🍔🧠 NVIDIA Built a Model That Reads 6 Languages at 35 Pages/Second

Happy Monday! ☀️

Welcome to the 879 new hungry minds who have joined us since last Monday!

If you aren't subscribed yet, join smart, curious, and hungry folks by subscribing here.

📚 Software Engineering Articles

Building the foundation for running extra-large language models
Securing every Kubernetes workload at scale
Netflix's operations layer behind live at scale
Relational database design guide: master the fundamentals
Why MicroVMs power Docker's sandbox architecture

🗞️ Tech and AI Trends

Google opens Gemma 4 under Apache 2.0 with multimodal capabilities
Claude Opus 4.7 released: faster reasoning, better performance
Zero-copy GPU inference from WebAssembly on Apple Silicon

👨🏻‍💻 Coding Tip

React Server Components eliminate client-side data fetching, reduce bundle size dramatically

Time-to-digest: 5 minutes

How NVIDIA built a multilingual OCR model using 12M synthetic images 🗄️

NVIDIA's OCR v1 model was solid for English but completely fell apart on other languages; we're talking 0.56 to 0.92 NED scores (basically gibberish). Instead of manually annotating millions of images across six languages, they went all-in on synthetic data generation and built Nemotron OCR v2.

The challenge: Train a single unified OCR model that handles English, Chinese, Japanese, Korean, and Russian without language-specific variants — and do it without the prohibitive cost of annotating millions of real-world images with word-, line-, and paragraph-level bounding boxes.

Implementation highlights:

Synthetic data pipeline: Generate 12.2M training images with pixel-perfect ground truth by rendering text from mOSCAR (a 163-language web corpus) onto randomized backgrounds with diverse fonts, layouts, and augmentations
Hierarchical annotations for free: Every synthetic image comes with word, line, and paragraph bounding boxes plus a reading order graph — labels that would cost a fortune to annotate manually
Line-level recognition for CJK: Chinese and Japanese don't use spaces between words, so they switched from word-level to line-level recognition — eliminating the need for a separate word segmentation step
Shared backbone architecture: Based on the FOTS design, a single RegNetX-8GF backbone processes the image once, then detection, recognition, and relational models all reuse those features — no redundant computation
Language-agnostic extensibility: Adding a new language only requires source text + fonts. No architecture changes, no manual annotation. The pipeline generates millions of annotated pages per day on a single machine

Results and learnings:

From gibberish to near-perfect: NED scores on non-English languages dropped from 0.56–0.92 down to 0.035–0.069 — a single model outperforming even language-specific competitors
Absurdly fast: 34.7 pages/second on a single A100 GPU, 28x faster than PaddleOCR v5's server pipeline
Data > architecture: Expanding the character set alone barely helped. The model could theoretically output CJK characters — it just had never learned what they looked like

NVIDIA's approach is a masterclass in letting synthetic data do the heavy lifting. If you can programmatically generate your labels, you skip the most expensive part of ML; and you get to control exactly what your model sees during training. The dataset and model are both open, so you can extend this to your own languages today.

Data Pipeline Foundations - Everything You Need To Know About Data Pipelines

Hi, fellow future and current Data Leaders; Ben here 👋

Building the foundation for running extra-large language models

We built a custom technology stack to run fast large language models on Cloudflare’s infrastructure. This post explores the engineering trade-offs and technical optimizations required to make high-performance AI inference accessible.

Securing every Kubernetes workload at scale

A Guide to Relational Database Design

In this article, we cover the core concepts that inform those decisions. We’ll look at tables, keys, relationships, normalization, and joins, with each concept building on the last.

Why MicroVMs: The Architecture Behind Docker Sandboxes | Docker

Learn from Docker experts to simplify and advance your app development and management with Docker. Stay up to date on Docker events and new version

Finding zombies in our systems: A real-world story of CPU bottlenecks

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

ARTICLE (design-code-tango)
Figma Design to Code, Code to Design: Clearly Explained

GITHUB REPO (stack-em-high)
Stacked PRs

ARTICLE (math-goes-brrr)
All elementary functions from a single binary operator

ARTICLE (protocol-party-time)
Scaling MCP adoption

ARTICLE (money-moves-smooth)
Engineering the Forge Billing Platform for Reliability and Scale

ARTICLE (tile-time-magic)
Frontend Engineering at Palantir: Polar Scaled Tiles in Zodiac

ARTICLE (robot-future-vibes)
The Agent Stack Bet

ARTICLE (slice-it-vertical)
Vertical Slice Architecture in Node.js: One Folder Per Use Case

ARTICLE (gpt-plays-favorites)
Why ChatGPT Cites One Page Over Another

Want to reach 200,000+ engineers?

Let’s work together! Whether it’s your product, service, or event, we’d love to help you connect with this awesome community.

WORK WITH US

🤖 Google Opens Gemma 4 Under Apache 2.0 with Multimodal and Agentic Capabilities (2 min)

Brief: Google releases Gemma 4, an open-weight model family with Apache 2.0 licensing spanning 2B-31B parameters, featuring native video, image, and audio processing, 256K context windows, and agentic capabilities like function-calling and structured JSON output—with the 31B variant matching performance of models 3-5x larger and broad distribution across Hugging Face, Kaggle, vLLM, Ollama, and other platforms.

🚀 What is Inference Engineering? A Deep Dive into Serving AI Models at Scale (10 min)

Brief: As open LLM models grow increasingly capable, inference engineering—the practice of optimizing how AI models run in production—is shifting from a niche skill at frontier labs to a critical competency for any company building AI products; key techniques include quantization, speculative decoding, caching, parallelism, and disaggregation to achieve faster, cheaper, and more reliable model serving.

🚀 Anthropic Releases Claude Opus 4.7 with Major Coding and Vision Upgrades (4 min)

Brief: Anthropic launched Claude Opus 4.7, a significant improvement over Opus 4.6 that excels in advanced software engineering, complex long-running tasks, and vision capabilities—processing images at 3.75 megapixels—while introducing cybersecurity safeguards and new effort control options, available across all Claude products at unchanged pricing.

🎨 Anthropic Launches Claude Design to Democratize Visual Creation (3 min)

Brief: Anthropic unveils Claude Design, an AI-powered tool that lets teams collaborate with Claude to create polished designs, prototypes, and presentations by describing what they need, with features like automatic brand consistency, inline editing, and seamless handoffs to code—available now for Claude Pro, Max, Team, and Enterprise subscribers.

🤖 OpenAI Upgrades Codex with Computer Control, Image Generation, and Advanced Developer Tools (4 min)

Brief: OpenAI significantly expands Codex capabilities for its 3M+ weekly users, adding computer control to operate apps, image generation, native web browsing, 90+ new plugins, and memory features to streamline the entire software development lifecycle from coding to deployment.

🚀 Zero-Copy GPU Inference from WebAssembly on Apple Silicon (5 min)

Brief: A developer demonstrates that WebAssembly modules can share memory directly with Apple Silicon GPUs without copying data, using the chip's unified memory architecture to run stateful AI inference with portable KV caches that can be serialized, restored across machines, and recovered 5.45× faster than recomputation.

This week’s tip:

Use React Server Components (RSCs) with Server Actions to eliminate client-side data-fetching boilerplate and reduce JavaScript bundle size by moving logic server-side. RSCs render on server, stream to client, and support direct database/API access without intermediate API endpoints.

Wen?

Data-heavy dashboards: Fetch and filter data server-side; stream only rendered HTML.
Secure mutations: Run sensitive operations (auth checks, DB writes) in Server Actions; client never sees tokens or credentials.
Progressive enhancement: RSCs work without client JavaScript; add interactivity with 'use client' islands only where needed.

Greatness comes from living with purpose and passion.
Ralph Marston

That’s it for today! ☀️

Enjoyed this issue? Send it to your friends here to sign up, or share it on Twitter!

If you want to submit a section to the newsletter or tell us what you think about today’s issue, reply to this email or DM me on Twitter! 🐦

Thanks for spending part of your Monday morning with Hungry Minds.
See you in a week — Alex.

Icons by Icons8.

*I may earn a commission if you get a subscription through the links marked with “aff.” (at no extra cost to you).

🍔🧠 NVIDIA Built a Model That Reads 6 Languages at 35 Pages/Second

Keep Reading

Hungry Minds 🍔🧠