Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Google unveils two new TPUs designed for the “agentic era”

    Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

    Will a new CEO realize Apple’s smart home potential?

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»Gemma 4 VLA Demo on Jetson Orin Nano Super
    AI Tools

    Gemma 4 VLA Demo on Jetson Orin Nano Super

    By No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Gemma 4 VLA Demo on Jetson Orin Nano Super
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Asier Arranz's avatar


    Talk to Gemma 4, and she’ll decide on her own if she needs to look through the webcam to answer you. All running locally on a Jetson Orin Nano Super.

    You speak → Parakeet STT → Gemma 4 → [Webcam if needed] → Kokoro TTS → Speaker
    

    Press SPACE to record, SPACE again to stop. This is a simple VLA: the model decides on its own whether to act based on the context of what you asked, no keyword triggers, no hardcoded logic. If your question needs Gemma to open her eyes, she’ll decide to take a photo, interpret it, and answer you with that context in mind. She’s not describing the picture, she’s answering your actual question using what she saw.

    And honestly? It’s pretty impressive that this runs on a Jetson Orin Nano. 🙂



    Get the code

    The full script for this tutorial lives on GitHub, in my Google_Gemma repo next to the Gemma 2 demos:

    👉 github.com/asierarranz/Google_Gemma

    Grab it with either of these (pick one):

    
    git clone https://github.com/asierarranz/Google_Gemma.git
    cd Google_Gemma/Gemma4
    
    
    wget https://raw.githubusercontent.com/asierarranz/Google_Gemma/main/Gemma4/Gemma4_vla.py
    

    That single file (Gemma4_vla.py) is all you need. It pulls the STT/TTS models and voice assets from Hugging Face on first run.




    Hardware

    What we used:

    • NVIDIA Jetson Orin Nano Super (8 GB)
    • Logitech C920 webcam (mic built in)
    • USB speaker
    • USB keyboard (to press SPACE)

    Not tied to these exact devices, any webcam, USB mic, and USB speaker that Linux sees should work.




    Step 1: System packages

    Fresh Jetson, let’s install the basics:

    sudo apt update
    sudo apt install -y 
      git build-essential cmake curl wget pkg-config 
      python3-pip python3-venv python3-dev 
      alsa-utils pulseaudio-utils v4l-utils psmisc 
      ffmpeg libsndfile1
    

    build-essential and cmake are only needed if you go the native llama.cpp route (Option A in Step 4). The rest is for audio, webcam, and Python.




    Step 2: Python environment

    python3 -m venv .venv
    source .venv/bin/activate
    pip install --upgrade pip
    pip install opencv-python-headless onnx_asr kokoro-onnx soundfile huggingface-hub numpy
    



    Step 3: Free up RAM (optional but recommended)

    Heads up: this step may not be strictly necessary. But we’re pushing this 8 GB board pretty hard with a fairly capable model, so giving ourselves some headroom makes the whole experience smoother, especially if you’ve been playing with Docker or other heavy stuff before this.

    These are just the commands that worked nicely for me. Use them if they help.



    Add some swap

    Swap won’t speed up inference, but it acts as a safety net during model loading so you don’t get OOM-killed at the worst moment.

    sudo fallocate -l 8G /swapfile
    sudo chmod 600 /swapfile
    sudo mkswap /swapfile
    sudo swapon /swapfile
    echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
    



    Kill memory hogs

    sudo systemctl stop docker 2>/dev/null || true
    sudo systemctl stop containerd 2>/dev/null || true
    pkill -f tracker-miner-fs-3 || true
    pkill -f gnome-software || true
    free -h
    

    Close browser tabs, IDE windows, anything you don’t need. Every MB counts.

    If you’re going with the Docker route in Step 4, obviously don’t stop Docker here, you’ll need it. Still kill the rest though.



    Still tight on RAM?

    From our tests, Q4_K_M (native build) and Q4_K_S (Docker) run comfortably on the 8 GB board once you’ve done the cleanup above. But if you’ve got other stuff you can’t kill and memory is still tight, you can drop one step down to a Q3 quant, same model, a bit less smart, noticeably lighter. Just swap the filename in Step 4:

    gemma-4-E2B-it-Q3_K_M.gguf   # instead of Q4_K_M
    

    Honestly though, stick with Q4_K_M if you can. It’s the sweet spot.




    Step 4: Serve Gemma 4

    You need a running llama-server with Gemma 4 before launching the demo. We’ll build llama.cpp natively on the Jetson, it gives the best performance and full control over the vision projector that the VLA demo needs.



    Build llama.cpp

    cd ~
    git clone https://github.com/ggml-org/llama.cpp.git
    cd llama.cpp
    cmake -B build 
      -DGGML_CUDA=ON 
      -DCMAKE_CUDA_ARCHITECTURES="87" 
      -DGGML_NATIVE=ON 
      -DCMAKE_BUILD_TYPE=Release
    cmake --build build --config Release -j4
    



    Download the model and vision projector

    mkdir -p ~/models && cd ~/models
    
    wget -O gemma-4-E2B-it-Q4_K_M.gguf 
      https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf
    
    wget -O mmproj-gemma4-e2b-f16.gguf 
      https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/mmproj-gemma4-e2b-f16.gguf
    

    The mmproj file is the vision projector. Without it Gemma can’t see, so don’t skip it.



    Start the server

    ~/llama.cpp/build/bin/llama-server 
      -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf 
      --mmproj ~/models/mmproj-gemma4-e2b-f16.gguf 
      -c 2048 
      --image-min-tokens 70 --image-max-tokens 70 
      --ubatch-size 512 --batch-size 512 
      --host 0.0.0.0 --port 8080 
      -ngl 99 --flash-attn on 
      --no-mmproj-offload --jinja -np 1
    

    One flag worth mentioning: -ngl 99 tells llama-server to push all the model’s layers onto the GPU (99 is just “as many as the model has”). If you ever run into memory issues, you can lower that number to offload fewer layers to GPU and the rest to CPU. For this setup though, all layers on GPU should work fine.



    Verify it’s up

    From another terminal:

    curl -s http://localhost:8080/v1/chat/completions 
      -H "Content-Type: application/json" 
      -d '{"model":"gemma4","messages":[{"role":"user","content":"Hi!"}],"max_tokens":32}' 
      | python3 -m json.tool
    

    If you get JSON back, you’re good.




    Step 5: Find your mic, speaker, and webcam



    Microphone

    arecord -l
    

    Look for your USB mic. In our case the C920 showed up as plughw:3,0.



    Speaker

    pactl list short sinks
    

    This lists your PulseAudio sinks. Pick the one that matches your speaker, it’ll be a long ugly name like alsa_output.usb-.... In my case it was alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo, but yours will be different.



    Webcam

    v4l2-ctl --list-devices
    

    Usually index 0 (i.e. /dev/video0).



    Quick test

    export MIC_DEVICE="plughw:3,0"
    export SPK_DEVICE="alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo"
    
    arecord -D "$MIC_DEVICE" -f S16_LE -r 16000 -c 1 -d 3 /tmp/test.wav
    paplay --device="$SPK_DEVICE" /tmp/test.wav
    

    If you hear yourself, you’re set.




    Step 6: Run the demo

    Make sure the server from Step 4 is running, then:

    source .venv/bin/activate
    
    export MIC_DEVICE="plughw:3,0"
    export SPK_DEVICE="alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo"
    export WEBCAM=0
    export VOICE="af_jessica"
    
    python3 Gemma4_vla.py
    

    On first launch, the script downloads Parakeet STT, Kokoro TTS, and generates voice prompt WAVs. Takes a minute, then you’re live.

    • SPACE → start recording
    • speak your question
    • SPACE → stop recording

    There’s also a text-only mode if you want to skip audio setup and test the LLM path directly:

    python3 Gemma4_vla.py --text
    



    Changing the voice

    Kokoro ships with many voices. Switch with:

    export VOICE="am_puck"
    python3 Gemma4_vla.py
    

    Some good ones: af_jessica, af_nova, am_puck, bf_emma, am_onyx.




    How it works

    The script exposes exactly one tool to Gemma 4:

    {
      "name": "look_and_answer",
      "description": "Take a photo with the webcam and analyze what is visible."
    }
    

    When you ask a question:

    1. Your speech is transcribed locally (Parakeet STT)
    2. Gemma gets the text plus the tool definition
    3. If the question needs vision, she calls look_and_answer, the script grabs a webcam frame and sends it back
    4. Gemma answers, and Kokoro speaks it out loud

    There’s no keyword matching. The model decides when it needs to see. That’s the VLA part.

    The --jinja flag on llama-server is what enables this, it activates Gemma’s native tool-calling support.




    Troubleshooting

    Server runs out of memory, do the cleanup from Step 3 again. Close everything. This model fits on 8 GB, but you have to be tidy.

    No sound, check pactl list short sinks and make sure SPK_DEVICE matches a real sink.

    Mic records silence, double-check with arecord -l, then test recording manually.

    First run is slow, normal. It’s downloading models and generating voice prompts. Second run is fast.




    Environment variables

    Variable Default Description
    LLAMA_URL http://127.0.0.1:8080/v1/chat/completions llama-server endpoint
    MIC_DEVICE plughw:3,0 ALSA capture device
    SPK_DEVICE alsa_output.usb-...analog-stereo PulseAudio sink for playback
    WEBCAM 0 Webcam index (/dev/videoN)
    VOICE af_jessica Kokoro TTS voice



    Bonus: just want to try Gemma 4 in text mode?

    If you don’t care about the full VLA demo and just want to poke at Gemma 4 on your Jetson without building anything, there’s a ready-to-go Docker image from Jetson AI Lab with llama.cpp pre-compiled for Orin:

    sudo docker run -it --rm --pull always 
      --runtime=nvidia --network host 
      -v $HOME/.cache/huggingface:/root/.cache/huggingface 
      ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin 
      llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S
    

    One line, no compilation, and -hf pulls the GGUF from Hugging Face on first run. Hit http://localhost:8080 with any OpenAI-compatible client and chat away.

    Heads up: this Docker path is text-only. It doesn’t load the vision projector, so it won’t work with the VLA demo above. For the full webcam experience, stick with the native build in Step 4.


    Hope you enjoyed this tutorial! If you have any questions or ideas, feel free to reach out. 🙂

    Asier Arranz | NVIDIA

    Demo Gemma Jetson Nano Orin Super VLA
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleRivian R2 production has started despite tornado damage to factory
    Next Article AI and the Future of Cybersecurity: Why Openness Matters
    • Website

    Related Posts

    AI Tools

    Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

    AI Tools

    From Ad Hoc Prompting to Repeatable AI Workflows with Claude Code Skills

    AI Tools

    Ivory Tower Notes: The Methodology

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Google unveils two new TPUs designed for the “agentic era”

    0 Views

    Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

    0 Views

    Will a new CEO realize Apple’s smart home potential?

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Google unveils two new TPUs designed for the “agentic era”

    0 Views

    Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

    0 Views

    Will a new CEO realize Apple’s smart home potential?

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.