Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA Nemotron 3 Nano Omni is a new omni-modal understanding model built for real-world document analysis, multiple image reasoning, automatic speech recognition, long audio-video understanding, agentic computer use, and general reasoning.
It extends the Nemotron multimodal line from a strong vision-language system to a broader text + image + video + audio model.
Nemotron 3 Nano Omni delivers best-in-class accuracy on complex document intelligence leaderboards such as MMlongbench-Doc, OCRBenchV2, while also leading in video and audio leaderboards like WorldSense and DailyOmni. It achieves top accuracy on VoiceBench for audio understanding and ranks as the most cost‑efficient open video understanding model on MediaPerf.
Under the hood, it combines the Nemotron 3 hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder.
The architecture is designed to preserve fine visual detail, add native audio understanding, and scale to very long multimodal contexts for dense images, documents, videos, and mixed-modality reasoning.
The training recipe uses staged multimodal alignment and context extension, followed by preference optimization and multimodal reinforcement learning.
Nemotron 3 Nano Omni delivers up to 9x higher throughput and 2.9x the single-stream reasoning speed on multimodal use-cases, compared to alternatives.
Download the BF16, FP8 and NVFP4 checkpoints at HuggingFace.
For more information about the model architecture, training recipe, data pipelines and benchmarks, read the full Nemotron 3 Nano Omni report.

Benchmark highlights
Building on Nemotron Nano V2 VL, Nemotron 3 Nano Omni delivers substantial visual gains and adds entirely new audio and video+audio capabilities – while also leading another open-weights omni model, Qwen3-Omni, in many domains.

Task	Benchmark	Nemotron 3 Nano Omni	Nemotron Nano V2 VL	Qwen3-Omni 30B-A3B
Document understanding	OCRBenchV2-En	65.8	61.2	–
	MMLongBench-Doc	57.5	38.0	49.5
	CharXiv reasoning	63.6	41.3	61.1
GUI	ScreenSpot-Pro	57.8	5.5	59.7
	OSWorld	47.4	11.0	29.0
Video understanding	Video-MME	72.2	63.0	70.5
Video + Audio understanding	WorldSense	55.4	–	54.0
	DailyOmni	74.1	–	73.6
Voice interaction	VoiceBench	89.4	–	88.8
ASR	HF Open ASR (lower is better)	5.95	–	6.55

Efficiency highlights
Compared to other open omni models with the same interactivity, Nemotron 3 Nano Omni delivers 7.4x higher system efficiency for multi-document use cases and 9.2x higher system efficiency for video use cases
Figure 1. Total system throughput for multi-document and video use cases sustained by each model at a fixed per‑user interactivity threshold (tokens/sec/user)

What Nemotron 3 Nano Omni is designed for

At a high level, Nemotron 3 Nano Omni is aimed at five classes of workloads:

1. Real-world document analysis

This is not only about OCR. The model is positioned for long, messy, high-value documents where understanding depends on layout, tables, figures, formulas, section structure, and cross-page references. Think contracts, technical papers, reports, manuals, multi-page forms, or compliance packets. The model can handle 100+ page documents.

2. Automatic Speech Recognition

Nemotron 3 Nano Omni includes strong speech understanding capabilities that enable high-quality transcription across diverse audio conditions. It handles long-form audio with varying speakers, accents, and background noise. These capabilities can be integrated into broader workflows, allowing spoken content to be transcribed, analyzed, and combined with other modalities for tasks like summarization, question answering, and cross-modal reasoning.

3. Long audio-video understanding

Many enterprise and developer workflows depend on mixed audio and visual evidence: screen recordings with narration, training videos, meetings with slides, tutorials, product demos, customer support captures, and long-form video archives. Nemotron 3 Nano Omni is built to reason over those inputs jointly.

4. Agentic computer use

The Nemotron 3 Nano Omni model is specifically trained for agentic computer use, enabling it to assist with tasks in graphical user interface (GUI) environments. Its capabilities include interpreting screenshots, monitoring the state of the user interface, grounding its reasoning in on-screen visuals, and helping with action selection or workflow automation.

5. General multimodal reasoning

The model is designed for more than perception. It excels at reasoning-intensive tasks that require synthesizing information across long context windows, multiple modalities, and structured or semi-structured evidence. It can carry out multi-step reasoning, perform calculations, and connect signals from text, images, tables, and other inputs to arrive at coherent, well-supported answers.

Model architecture and key innovations

Nemotron 3 Nano Omni uses a unified encoder-projector-decoder design. The language backbone is Nemotron 3 Nano 30B-A3B, paired with the C-RADIOv4-H vision encoder and the Parakeet-TDT-0.6B-v2 audio encoder. The modality-specific encoders connect into the LLM backbone through lightweight projectors.

Figure 2. Model architecture of NVIDIA Nemotron 3 Nano Omni 30B-A3B

A hybrid Mamba-Transformer-MoE backbone for long multimodal context

The model backbone interleaves three key components: 23 Mamba selective state-space layers for efficient long-context processing; 23 MoE layers with 128 experts, top-6 routing, and a shared expert for conditional capacity; and 6 grouped-query attention layers to preserve strong global interaction and expressivity.

Nemotron 3 Nano Omni combines state-space models, attention, and MoE in a unified design that maintains strong reasoning performance while remaining practical for long, multimodal contexts.

Dynamic resolution for dense documents, charts, and screens

On the vision side, the Nemotron 3 Nano Omni replaces the tiling strategy used in the v2 model with dynamic resolution processing at native aspect ratio. Each image can be represented using a variable number of 16 x 16 patches, with a minimum of 1,024 to a maximum of 13,312 visual patches per image. For square images, this is equivalent to 512 x 512 and 1840 x 1840, respectively.

That flexibility is critical for handling high-resolution, complex visual inputs such as OCR-heavy documents, financial tables, slides, research figures, screenshots, and GUI layouts—especially when both fine details and overall structure need to be understood together.

Conv3D temporal compression for video

For video, Nemotron 3 Nano Omni uses a dedicated Conv3D tubelet embedding path. Instead of embedding each frame independently, every pair of consecutive frames is fused into a single “tubelet” before the ViT, halving the number of vision tokens the language model has to attend to. This allows us to either double the number of frames with the same token budget, or halve the number of tokens with the same number of frames

EVS — Efficient Video Sampling

EVS is an important feature, enabled during inference time, that drops redundant video tokens after the vision encoder. This reduces latency and improves throughput while maintaining accuracy. The first frame of the video is kept entirely, then for each subsequent frame, EVS keeps the “dynamic” tokens where the video is changing and drops the “static” ones where nothing has changed from the previous frame. We combine this with Conv3D to enable superior compression: Conv3D fuses tokens from pairs of frames into one, and then EVS prunes redundant static information.

Native audio input, not just text transcripts

The audio side is powered by Parakeet-TDT-0.6B-v2, connected to the backbone through its own 2-layer MLP projector. Audio is sampled at 16 kHz, and the model is trained with inputs up to 1,200 seconds (20 minutes), while the LLM max context length supports 5+ hours.

This represents a shift from traditional VLM pipelines by enabling native audio processing within a shared multimodal sequence, allowing audio, visual, and text tokens to be jointly modeled. This is crucial for scenarios like narrated screen recordings, video Q&A where speech alters visual meaning, long-form instructional or meeting content, and tasks requiring temporally grounded multimodal reasoning.

Lightweight modality projectors and unified token interleaving

Each encoder is connected to the LLM with a lightweight 2-layer MLP projector that maps encoder features into the shared embedding space. Once projected, vision, audio, and text tokens are interleaved and processed jointly.

This design keeps the overall system modular while still enabling genuine cross-modal reasoning inside the backbone itself.

Training data, infrastructure and systems story

The SFT stages are trained on NVIDIA H100, scaling from 32 to 128 nodes depending on the stage. The stack uses Megatron-LM, Transformer Engine, and Megatron Energon, with tensor parallelism, expert parallelism, sequence parallelism, context parallelism for the long context stages, online sequence packing, and selective activation recomputation.

Post-SFT reinforcement learning uses NeMo-RL and NeMo Gym with a Megatron backend. The RL infrastructure used a Ray-based distributed setup across B200 and H100 clusters, plus multimodal deduplication, so repeated rollouts do not multiply image, video, and audio memory.

We open-source substantial parts of our training code.

Using RL to shape reliable multimodal behavior

We introduce multi-environment text and omni training in Nemotron 3 Nano Omni. Our text RL training stage happens across diverse environments in Nemo-Gym, which evaluates the model’s ability to perform sequences of actions such as tool calling, writing code, and multi-part planning that satisfy the verifiable criteria.

Omni RL trains the model to reason across images, video, audio, and text within a unified framework, covering tasks from single-modality to fully multimodal scenarios. A diverse verifier suite evaluates outputs across formats like multiple-choice, math, GUI grounding, and ASR, while intentionally including unanswerable cases to teach the model to abstain when evidence is insufficient rather than hallucinate.

Data and data pipelines

Nemotron 3 Nano Omni is trained on an enhanced dataset that emphasizes high-quality reasoning across multiple modalities. We significantly expand task coverage and introduce synthetic data for complex reasoning scenarios where public datasets are limited. To support this, we build task-specific, multi-stage pipelines for scalable synthetic data generation.

As one example, we generated approximately 11.4M synthetic QA pairs (~45B tokens) from a large corpus of real-world PDFs using NeMo Data Designer. This dataset is used to strengthen long-context document reasoning during post-training and delivers a 2.19× improvement in overall accuracy on MMLongBench-Doc.

We detail the full pipeline evolution, including failure analysis and key lessons learned, in our Data Designer developer note. The note also includes nine runnable pipeline recipes that can serve as starting points for building your own document-understanding datasets.

Example workflows

Example 1: Long multi-page document analysis

Nemotron 3 Nano Omni can analyze and reason over long documents such as financial reports, academic papers, product manuals, etc. The following example retrieves financial metrics across a 100+ page document to calculate another metric:

Figure 3: Stylized example from the MMLongBench-Doc benchmark

The model was prompted to answer the question as follows:

Extract information from all the given images, then answer the question using a single word or phrase. Return 'Not answerable' if the answer cannot be derived from the the images.

The model is capable of long-context retrieval, structured extraction, table/chart reading and multi-page reasoning all in one go.

Example 2: Video + audio understanding

Nemotron-3 Nano Omni performs joint audio-visual analysis, both locally for specific scenes and globally across the entire video. This enables it to answer complex questions that require cross-modal reasoning—such as identifying specific visuals shown exactly when a certain topic is mentioned in the audio.

Video

Question

Watch the video and listen to the narration.
1.What structure is on fire as shown in the video and how much money was being spent in its renovation project ? 
2. Describe in short what visuals are shown when the eye-witness narrates her experience?

Answer

1. The structure on fire is the Notre Dame Cathedral, which was undergoing an almost $7 million renovation project.

2. When the eye-witness narrates her experience, the visuals show the cathedral engulfed in flames and thick smoke, scaffolding surrounding the burning roof, firefighters using ladders to spray water, a crowd of people watching from the riverbank, and views of the burning structure against the Paris skyline.

Example 3: Agentic computer use / screenshot reasoning

Nemotron 3 Omni can be integrated into agentic computer-use systems to reason over user intents, analyze GUI elements, and execute actions to accomplish tasks.

The model was given access to a web environment with the following system prompt:

You are a GUI agent. You are given an instruction, a screenshot of the screen and your previous interactions with the computer. You need to perform a series of actions to complete the task. The passoword of the computer is {password}.

For each step, provide your response in this format:
{thought}
## Action:
{action}
## Code:
{code}

In the code section, the code should be either pyautogui code or one of the following functions wrapped in the code block:
- {"name": "computer.wait", "description": "Make the computer wait for 20 seconds for installation, running code, etc.", "parameters": {"type": "object", "properties": {}, "required": []}}
- {"name": "computer.terminate", "description": "Terminate the current task and report its completion status", "parameters": {"type": "object", "properties": {"status": {"type": "string", "enum": ["success", "failure"], "description": "The status of the task"}, "answer": {"type": "string", "description": "The answer of the task"}}, "required": ["status"]}}

User Instruction

Find the Driver License Eligibility Requirements

Model Response:
The GIF above shows the actions that the agent takes to fulfill the user’s request

It clicks on the “License & IDs” tab
Then it clicks on “Driver’s License” from the dropdown menu
Then it clicks on “Driver’s License Eligibility” on the left
It scrolls down on this page and back up
Finally it produces the following response for the user:

Successfully found the Driver License Eligibility Requirements page on the Virginia DMV website. The page contains comprehensive information about driver license eligibility including:

**General Requirements:**
- Must be a resident of the Commonwealth of Virginia
- Must be at least 16 years and 3 months of age

**First Time Drivers:**
- Must complete a state-approved driver education program
- Must obtain a learner's permit at least 60 days before taking the road skills test
- Must be at least 15 years and 6 months of age for a learner's permit
- No exceptions to age requirements

**New Residents:**
- Temporary residents: must obtain license within 6 months
- Permanent residents: must obtain license within 1 year

**Military Personnel:**
- Active-duty members of the Armed Forces stationed in Virginia
- Spouses and dependents 16 years and 3 months or older may drive with valid out-of-state license
- If vehicle is co-owned and out-of-state plates are used, all co-owners must be active duty members

**Out-of-State Students:**
- Full-time students enrolled in accredited Virginia schools (not employed) may drive with valid out-of-state license
- Full-time students who are employed are considered Virginia residents for vehicle laws

**Non-Residents:**
- Non-residents temporarily living in Virginia may drive with home state license for no more than 6 months
- Note: This does not apply to commercial vehicle drivers
- If becoming gainfully employed, must hold Virginia driver's license

**Identification and Residency Requirements:**
- Proof of identity (acceptable documents listed in DMV 141)
- Proof of legal presence in the U.S.
- Proof of Virginia residency
- Proof of Social Security number
- Proof of name change (if applicable)

Additional resources available: interactive document guide, "Acceptable Documents for Obtaining a Driver's License or Photo ID Card" (DMV 141), and Spanish language version.

Example 4: Mixed document + chart + narration reasoning

Nemotron-3 Nano Omni can analyze charts, figures shown in images, along with audio files to generate any commonalities, discrepancies across the media.

Image

Audio

Question

Use both the slide content and the spoken commentary.
List out any extra points mentioned by the speaker that is not shown in the slide

Answer

Based on the speaker's commentary and the visual content of the slide, here are the extra points mentioned that are not explicitly shown or detailed in the slide:

*   **The Topic:** The speaker explicitly states she will be covering **multitask learning** with neural networks, while the slide title is simply "Some notation" (though it contains the mathematical definitions for tasks).
*   **Nature of the Output:** She describes the output ($y$) as a **probability distribution** over possible labels (e.g., tiger, tiger cat, lynx, cat), whereas the slide only shows a bar chart with class labels without explicitly labeling the vector as a distribution.
*   **Specific Task Example (Paper Titles):** She introduces a concrete example where the input is the **title of a paper** and the neural network is tasked with estimating the **length of the paper**. The slide shows "length of paper" near the bar chart, but does not visually distinguish how a paper title maps to these bars versus representing animal classes.
*   **Optimization Techniques:** She mentions that the loss function is typically optimized using **stochastic gradient descent** or a **"favorite optimizer,"** which are procedural steps not written on the slide (which only defines the loss function $mathcal{L}$ to minimize).
*   **Input Diversity:** She specifies that inputs can be a **"wide range of unstructured inputs,"** specifically mentioning **images** and **text**. The slide visually displays a single image (a structured input) and uses a generic variable $x$ for the input class.

Example 5: Soundscape

Nemotron-3 Nano Omni can analyze charts, figures shown in images, along with audio files to generate any commonalities, discrepancies across the media.

Audio

Question

What is producing the main sound in this audio, and what does the acoustic environment suggest about where it was recorded?

Answer

The main sound is a bird chirping, and the presence of multiple bird calls and a natural, open acoustic environment suggests the recording was made outdoors in a natural setting like a forest or park.

Example 6: Music

Nemotron-3-Nano-Omni can interpret and reason about general audio, including soundscapes and environmental sounds.

Audio

Question

What's the overall vibe of this song, and what kind of scenario would it complement well?

Answer

The song has a calm, reflective, and slightly melancholic vibe, with a gentle piano melody and soft string accompaniment. It would complement a quiet, introspective scene such as a rainy evening, a peaceful walk, or a moment of personal reflection.

What's Hot

‘It’s Undignified’: Hundreds of Workers Training Meta’s AI Could Be Laid Off

Amazon launches an AI-powered audio Q&A experience on product pages

Google Home makes it easier to understand why devices aren’t working

Amazon launches an AI-powered audio Q&A experience on product pages

Let the AI Do the Experimenting

Correlation Doesn’t Mean Causation! But What Does It Mean?

‘It’s Undignified’: Hundreds of Workers Training Meta’s AI Could Be Laid Off

Amazon launches an AI-powered audio Q&A experience on product pages

Google Home makes it easier to understand why devices aren’t working

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Most Popular

‘It’s Undignified’: Hundreds of Workers Training Meta’s AI Could Be Laid Off

Amazon launches an AI-powered audio Q&A experience on product pages

Google Home makes it easier to understand why devices aren’t working

Our Picks

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Subscribe to Updates

What's Hot

Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

What Nemotron 3 Nano Omni is designed for

1. Real-world document analysis

2. Automatic Speech Recognition

3. Long audio-video understanding

4. Agentic computer use

5. General multimodal reasoning

Model architecture and key innovations

A hybrid Mamba-Transformer-MoE backbone for long multimodal context

Dynamic resolution for dense documents, charts, and screens

Conv3D temporal compression for video

EVS — Efficient Video Sampling

Native audio input, not just text transcripts

Lightweight modality projectors and unified token interleaving

Training data, infrastructure and systems story

Using RL to shape reliable multimodal behavior

Data and data pipelines

Example workflows

Example 1: Long multi-page document analysis

Example 2: Video + audio understanding

Example 3: Agentic computer use / screenshot reasoning

Example 4: Mixed document + chart + narration reasoning

Example 5: Soundscape

Example 6: Music

Getting started with Nemotron 3 Nano Omni

References

Related Posts

Subscribe to Updates