Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Zoom teams up with World to verify humans in meetings

    Hackers are abusing unpatched Windows security flaws to hack into organizations

    Gazing Into Sam Altman’s Orb Now Proves You’re Human on Tinder

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tutorials»Building a Fast Multilingual OCR Model with Synthetic Data
    AI Tutorials

    Building a Fast Multilingual OCR Model with Synthetic Data

    By No Comments11 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Building a Fast Multilingual OCR Model with Synthetic Data
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Ryan Chesler's avatar


    Training a high-quality OCR model requires a large quantity of annotated image-text pairs: images with precise bounding boxes, transcriptions, and ideally reading order information at the word, line, and paragraph level. Every approach to curating this data comes with tradeoffs. Existing benchmark datasets like ICDAR and Total-Text have clean labels but limited scale, typically tens of thousands of images skewed toward English and Chinese. Manual annotation produces the highest quality labels but is expensive and slow, making it impractical at the millions-of-images scale needed for robust multilingual models. Web-scraped PDFs offer enormous quantity, but the embedded text is often noisy: characters recorded as individual strokes instead of words, text baked into images with no extractable layer, or scanned pages where a weak OCR model was applied and the resulting text layer is unreliable. You can extract usable signal from web PDFs, but it takes significant filtering effort and the result is never perfectly clean.

    Synthetic data generation offers a way out of these tradeoffs. By rendering text onto images programmatically, we get both the scale of web scraping and the label purity of hand annotation. Every bounding box, transcription, and reading order relationship is known exactly because we placed it there, and we have full control over which layouts, font styles, and edge cases appear in the training set. The challenge is realism. Simulating diverse layouts and realistic document scenarios is difficult, but with the right rendering engine and strong randomization across fonts, colors, backgrounds, augmentations, and layout structures, it is possible to build enough invariance that models trained on synthetic data generalize well to real-world documents.

    Using this approach, we built Nemotron OCR v2, a multilingual OCR model that is both accurate and fast. Accuracy is driven by data: 12 million synthetic training images across six languages brought NED scores from 0.56–0.92 down to 0.035–0.069 on non-English languages. Speed is driven by architecture: a shared detection backbone whose features are reused by both the recognizer and relational model, eliminating redundant computation and enabling 34.7 pages/second on a single A100 GPU. The synthetic data pipeline is generic enough to extend to any language for which fonts and source text exist.

    The dataset is publicly available at nvidia/OCR-Synthetic-Multilingual-v1 and the model at nvidia/nemotron-ocr-v2. You can try the model directly in the browser at the Nemotron OCR v2 demo.

    Nemotron OCR v2 HuggingFace Space demo showing detected text regions and extracted text




    The Problem: Data, Not Architecture

    Nemotron OCR v1 was a strong English OCR model, but it was not trained for multilingual purposes so when exposed to other languages it failed to read the documents accurately. On our SynthDoG benchmark, v1 produced Normalized Edit Distance (NED) scores between 0.56 and 0.92 for Japanese, Korean, Russian, and Chinese. At these error rates, the model output bears little resemblance to the ground truth.

    Language Nemotron OCR v1 NED
    Japanese 0.723
    Korean 0.923
    Russian 0.564
    Chinese (Simplified) 0.784
    Chinese (Traditional) 0.700

    Part of the issue was the character set. The v1 model supported only 855 characters, which simply did not cover CJK (Chinese, Japanese, Korean) or Cyrillic scripts. We ran an experiment where we expanded the character set to 14,244 characters to cover all the target languages. This helped slightly, but without sufficient training data actually containing those characters, the improvement was marginal. The model could theoretically output the right characters, but it had never learned what they looked like. The bottleneck was data, not architecture.

    Collecting and annotating millions of real-world images across six languages with word-, line-, and paragraph-level bounding boxes plus reading order graphs would be prohibitively expensive. We needed a different approach.




    A Generic Synthetic Data Pipeline

    Our key insight is that the recipe for multilingual OCR training data is fundamentally language-agnostic. You need two ingredients:

    1. Source text in the target language, drawn from a realistic distribution
    2. Fonts that can render that language’s script

    Given those, a synthetic renderer can produce unlimited annotated training images with pixel-perfect ground truth at every level of granularity for free.



    Text: mOSCAR

    For source text, we use mOSCAR, a large-scale multilingual web corpus covering 163 language subsets across dozens of scripts including Latin, CJK, Cyrillic, Arabic, Devanagari, and Thai. Sampling from mOSCAR gives us text that follows a realistic distribution of vocabulary, sentence length, and character frequency for each language. This is far more representative than dictionary word lists or machine-generated text.



    Rendering: Modified SynthDoG

    We built our pipeline on a heavily modified version of SynthDoG (Synthetic Document Generator) from the Donut project. The original SynthDoG generates document-like images with page-level text labels. We extended it in several important ways.

    Multi-level bounding boxes. Vanilla SynthDoG provides only page-level text. Our pipeline generates pixel-precise annotations at three levels simultaneously: word, line, and paragraph. Each level includes both axis-aligned bounding boxes and 4-point quads, with indices linking words to their parent lines and paragraphs.

    Relation graph for reading order. Most publicly available OCR datasets do not include reading order annotations. This makes it hard to train models that understand document structure beyond just detecting text. We took inspiration from the HierText dataset, which pioneered hierarchical (word, line, paragraph) annotations with structural relationships. Our synthetic pipeline generates a relation graph for every sample, encoding which words compose each line, which lines compose each paragraph, and in what order they should be read. This is what powers the relational model component of Nemotron OCR v2, which handles multi-column layouts, tables, and other structures where a simple top-to-bottom, left-to-right merge would produce garbled output.

    Diverse layout modes. We created a set of layout templates that cover a range of real-world document scenarios: flowing multi-column text, scattered scene-text-like words, vertical text columns (important for Japanese and Chinese), tables with headers and borders, table-of-contents pages with dot leaders, PowerPoint-style slides, and Word-document-style pages with headings and body text. Each generation run randomly selects a layout mode, so the model sees a wide variety of structures during training.

    Line-level recognition for CJK. An important design decision for the multilingual variant was moving from word-level to line-level text recognition. Languages like Chinese and Japanese do not use spaces between words, so there is no natural word boundary to segment on. Korean uses spaces inconsistently. By operating at the line level, the recognizer handles these languages naturally without needing a separate word segmentation step. The English variant continues to use word-level recognition where it makes sense.

    Open-source font pool. We assembled 165 to 1,258 unique fonts per language from open-source collections including Google Fonts and the Noto family, covering serif, sans-serif, handwritten, decorative, and variable-weight styles.

    Augmentations. Each rendered page goes through a stack of randomized augmentations to improve generalization. At the text level, these include border/outline effects, drop shadows, extrusion, and sprinkle noise on glyph edges. Custom effects modulate stroke opacity and stroke width variation across text using random fields. At the image level, we apply morphological operations (dilation, erosion), median blur, and elastic distortion, gated by a minimum text height to avoid destroying small text. At the full-page level, the pipeline applies contrast and brightness jitter, Gaussian and motion blur, color shifting, shadow overlays, and additive Gaussian noise. Backgrounds are either image textures or solid colors, with optional semi-transparent tinted rectangles behind individual words or lines.




    What the Data Looks Like

    Here is a sample of raw synthetic images across all six languages. Each image is generated with a random layout mode, font selection, background, and augmentation stack:

    Mosaic of synthetic training images across all languages

    Below is an annotated example showing the hierarchical structure. Dashed outlines indicate paragraph boundaries, shaded regions show line-level groupings (colored by paragraph), and arrows trace the reading order between lines within each paragraph.

    English paragraph layout with reading order annotations

    Here are more annotated examples across languages showing the range of layouts, scripts, and augmentation styles the pipeline produces. Each subtitle describes what the example highlights:

    Annotated examples across languages and layout types




    Dataset at a Glance

    The full dataset contains 12.2 million samples across six languages:

    Language Total Samples Train Test Validation
    English 1,825,089 1,460,304 183,629 181,156
    Japanese 1,889,137 1,502,712 193,779 192,646
    Korean 2,269,540 1,814,994 227,091 227,455
    Russian 1,724,733 1,380,404 171,678 172,651
    Chinese (Simplified) 2,335,343 1,914,948 210,143 210,252
    Chinese (Traditional) 2,214,304 1,772,280 221,867 220,157
    Total 12,258,146 9,845,642 1,208,187 1,204,317

    Download: nvidia/OCR-Synthetic-Multilingual-v1




    Extensibility

    The pipeline we have described is deliberately generic. We chose six languages for this release, but adding a new language only requires source text and fonts that cover the script. No changes to the model architecture or manual annotation are needed. The rendering pipeline can generate millions of annotated pages per day on a single machine, making it practical to produce large-scale training sets for new languages quickly. With mOSCAR covering 163 language subsets and the Noto font family supporting virtually every Unicode script in active use, there is a clear path to scaling this approach broadly.




    The Model: Nemotron OCR v2

    Nemotron OCR v2 is a production-ready, commercially usable OCR model trained on this synthetic data along with approximately 680K real-world images. It uses a three-component end-to-end architecture:

    • Text Detector (RegNetX-8GF backbone): localizes text regions in the image
    • Text Recognizer (pre-norm Transformer): transcribes detected regions
    • Relational Model: predicts logical groupings, reading order, and layout relationships

    Two variants are available:

    v2_english v2_multilingual
    Languages English EN, ZH, JA, KO, RU
    Region level Word Line
    Recognizer layers 3 6
    Character set 855 14,244
    Parameters 54M 84M

    An important distinction from other OCR pipelines: Nemotron OCR v2 multilingual is a single unified model that handles all five languages simultaneously. You do not need to know the document language ahead of time or select a language-specific variant. By contrast, pipeline tools like PP-OCR v5 (PaddleOCR) and OpenOCR offer specialized per-language models that perform well on their target language but require you to either detect the language first or fall back to a base variant that is less capable across the board.



    Why the Model Is Fast



    The architecture is based on the FOTS (Fast Oriented Text Spotting) design, which unifies detection and recognition into a single network with a shared convolutional backbone. The detection backbone (RegNetX-8GF) processes the input image once and produces feature maps that are reused by all three components. The text recognizer receives rectified feature crops from detected regions and decodes them with a small Transformer. The relational model reasons over per-region embeddings derived from the same feature maps using a compact Transformer encoder. Because the expensive convolutional pass happens only once, the downstream components add minimal overhead. This feature reuse is what drives the model’s efficiency, enabling 34.7 pages/second on a single A100 GPU.



    Results: What Synthetic Data Buys You



    Multilingual Benchmark (SynthDoG)

    Normalized Edit Distance on SynthDoG-generated pages (lower is better). The v2 multilingual model, trained on our synthetic data, brings NED scores from unusable levels down to near-zero across all target languages:

    Language PaddleOCR (base) PaddleOCR (specialized) OpenOCR (server) Nemotron OCR v1 Nemotron OCR v2 (multi)
    English 0.117 0.096 0.105 0.078 0.069
    Japanese 0.201 0.201 0.586 0.723 0.046
    Korean 0.943 0.133 0.837 0.923 0.047
    Russian 0.959 0.163 0.950 0.564 0.043
    Chinese (Simplified) 0.054 0.054 0.061 0.784 0.035
    Chinese (Traditional) 0.094 0.094 0.127 0.700 0.065

    Note that the “PaddleOCR (specialized)” column uses language-specific models (e.g., the Korean model for Korean), which is the best-case scenario when you already know the language. Nemotron OCR v2 multilingual achieves better results on this synthetic data than even these specialized variants across all languages, using a single model.



    Real-World Benchmark (OmniDocBench)

    On OmniDocBench, a real-world document OCR benchmark with English, Chinese, and mixed-language content, Nemotron OCR v2 multilingual achieves competitive accuracy at 34.7 pages/second, over 28x faster than PaddleOCR v5:

    Model pages/s EN ZH Mixed
    PaddleOCR v5 (server) 1.2 0.027 0.037 0.041
    OpenOCR (server) 1.5 0.024 0.033 0.049
    Nemotron OCR v2 (multi) 34.7 0.048 0.072 0.142
    Nemotron OCR v2 (EN) 40.7 0.038 0.830 0.437
    EasyOCR 0.4 0.095 0.117 0.326
    Nemotron OCR v1 39.3 0.038 0.876 0.436

    NED scores (lower is better). Speed measured on a single A100 GPU with the v2 batched pipeline. All comparison models were benchmarked using their default detector + recognizer pipeline with optional extras disabled.

    A note on the speed differences between variants: v1 and v2 English are faster than v2 multilingual because the multilingual recognizer is larger (6 Transformer layers with a 14,244-token vocabulary vs 3 layers with 855 tokens). The recognizer processes every detected text region, so a heavier recognizer directly impacts throughput on text-dense pages. The v2 english model is slightly faster than v1 because the backbone has been swapped from RegNetY to RegNetX.




    Links



    Acknowledgments

    Thank you to Bo Liu, Théo Viel and Mike Ranzinger for contributing code, strategy and additional validation to this work.

    Building Data Fast model Multilingual OCR Synthetic
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOpen Reasoning VLA Model for Humanoid Robots
    Next Article Adaptive Verifiable Environments for E-Commerce Conversational Agents
    • Website

    Related Posts

    AI News

    Open Reasoning VLA Model for Humanoid Robots

    AI Tutorials

    7 summer travel tips with help from Google

    AI Tutorials

    Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Zoom teams up with World to verify humans in meetings

    0 Views

    Hackers are abusing unpatched Windows security flaws to hack into organizations

    0 Views

    Gazing Into Sam Altman’s Orb Now Proves You’re Human on Tinder

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Zoom teams up with World to verify humans in meetings

    0 Views

    Hackers are abusing unpatched Windows security flaws to hack into organizations

    0 Views

    Gazing Into Sam Altman’s Orb Now Proves You’re Human on Tinder

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.