Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Study: Infrasound likely a key factor in alleged hauntings

    With new patch design, the Crew-13 astronauts clearly aren’t superstitious

    EU tells Google to open up AI on Android; Google says that’s “unwarranted intervention”

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tutorials»microsoft/VibeVoice
    AI Tutorials

    microsoft/VibeVoice

    By No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Screenshot of a macOS terminal running an mlx-audio speech-to-text command using the VibeVoice-ASR-4bit model on lenny.mp3, showing download progress, a warning that audio duration (99.8 min) exceeds the 59 min maximum so it
    Share
    Facebook Twitter LinkedIn Pinterest Email

    microsoft/VibeVoice. VibeVoice is Microsoft’s Whisper-style audio model for speech-to-text, MIT licensed and with speaker diarization built into the model.

    Microsoft released it on January 21st, 2026 but I hadn’t tried it until today. Here’s a one-liner to run it on a Mac with uv, mlx-audio (by Prince Canuma) and the 5.71GB mlx-community/VibeVoice-ASR-4bit MLX conversion of the 17.3GB VibeVoice-ASR model, in this case against a downloaded copy of my recent podcast appearance with Lenny Rachitsky:

    uv run --with mlx-audio python -m mlx_audio.stt.generate 
      --model mlx-community/VibeVoice-ASR-4bit 
      --audio lenny.mp3 --output-path lenny 
      --format json --verbose --max-tokens 32768
    

    The tool reported back:

    Processing time: 524.79 seconds
    Prompt: 26615 tokens, 50.718 tokens-per-sec
    Generation: 20248 tokens, 38.585 tokens-per-sec
    Peak memory: 30.44 GB
    

    So that’s 8 minutes 45 seconds for an hour of audio (running on a 128GB M5 Max MacBook Pro).

    I’ve tested it against .wav and .mp3 files and they both worked fine.

    If you omit --max-tokens it defaults to 8192, which is enough for about 25 minutes of audio. I discovered that through trial-and-error and quadrupled it to guarantee I’d get the full hour.

    That command reported using 30.44GB of RAM at peak, but in Activity Monitor I observed 61.5GB of usage during the prefill stage and around 18GB during the generating phase.

    Here’s the resulting JSON. The key structure looks like this:

    {
      "text": "And an open question for me is how many other knowledge work fields are actually prone to these agent loops?",
      "start": 13.85,
      "end": 19.5,
      "duration": 5.65,
      "speaker_id": 0
    },
    {
      "text": "Now that we have this power, people almost underestimate what they can do with it.",
      "start": 19.5,
      "end": 22.78,
      "duration": 3.280000000000001,
      "speaker_id": 1
    },
    {
      "text": "Today, probably 95% of the code that I produce, I didn't type it myself. I write so much of my code on my phone. It's wild.",
      "start": 22.78,
      "end": 30.0,
      "duration": 7.219999999999999,
      "speaker_id": 0
    }
    

    Since that’s an array of objects we can open it in Datasette Lite, making it easier to browse.

    Amusingly that Datasette Lite view shows three speakers – it identified Lenny and me for the conversation, and then a separate Lenny for the voice he used for the additional intro and the sponsor reads!

    VibeVoice can only handle up to an hour of audio, so running the above command transcribed just the first hour of the podcast. To transcribe more than that you’d need to split the audio, ideally with a minute or so of overlap so you can avoid errors from partially transcribed words at the split point. You’d also need to then line up the identified speaker IDs across the multiple segments.

    microsoftVibeVoice
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGoogle is testing AI chatbot search for YouTube
    Next Article Adaptive Ultrasound Imaging with Physics-Informed NV-Raw2Insights-US AI
    • Website

    Related Posts

    AI Tutorials

    An open-source spec for Codex orchestration: Symphony.

    AI Tutorials

    Google and Kaggle’s GenAI Intensive Vibe Coding course 2026

    AI Tutorials

    WHY ARE YOU LIKE THIS

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Study: Infrasound likely a key factor in alleged hauntings

    0 Views

    With new patch design, the Crew-13 astronauts clearly aren’t superstitious

    0 Views

    EU tells Google to open up AI on Android; Google says that’s “unwarranted intervention”

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Study: Infrasound likely a key factor in alleged hauntings

    0 Views

    With new patch design, the Crew-13 astronauts clearly aren’t superstitious

    0 Views

    EU tells Google to open up AI on Android; Google says that’s “unwarranted intervention”

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.