Dia: The Undergrad-Built Free AI Speech Model That Beats Eleven Labs, Sesame and Rivals NotebookLM

Two Korean undergraduates with limited AI experience have accomplished something remarkable. In just three months, they created Dia, a text-to-speech AI model that rivals Google’s NotebookLM and other commercial offerings from companies like ElevenLabs and Sesame. Their success highlights a significant shift in AI development: powerful tools are now within reach of small teams with modest resources.

Table of Contents

What Makes Dia Stand Out

Dia is a 1.6 billion parameter model that generates realistic dialogue from text. What sets it apart is its ability to create podcast-style conversations complete with natural-sounding speech patterns, emotional tones, and non-verbal sounds like laughter and coughing.

We just solved text-to-speech AI.

This model can simulate perfect emotion, screaming and show genuine alarm.
— clearly beats 11 labs and Sesame
— it’s only 1.6B params
— streams realtime on 1 GPU
— made by a 1.5 person team in Korea!!

It's called Dia by Nari Labs. pic.twitter.com/rpeZ5lOe9z
— Deedy (@deedydas) April 22, 2025

Unlike many other speech generators, Dia gives users precise control over their generated audio. You can:

Create multi-speaker dialogues using simple [S1] and [S2] tags
Add non-verbal elements like laughs and coughs
Clone voices from sample audio
Maintain speaker consistency through audio prompts or fixed random seeds

The model runs on most modern PCs with at least 10GB of VRAM and can generate audio in real-time on enterprise-grade GPUs. A typical A4000 GPU produces about 40 tokens per second, with 86 tokens equaling one second of audio.

The Technical Achievement

What makes Dia particularly impressive is its modest size. At 1.6 billion parameters, it’s relatively small compared to many commercial models, yet produces results that compete with much larger systems. This efficiency suggests clever architecture choices and training approaches.

The team leveraged Google’s TPU Research Cloud program for free access to specialized AI chips during training. Their GitHub page mentions that their work was heavily inspired by SoundStorm, Parakeet, and Descript Audio Codec research.

For developers, Dia is refreshingly accessible. The model weights are hosted on Hugging Face, and implementation requires just a few lines of code.

The Rise of Open-Source AI Speech Models

The voice AI market is growing rapidly. PitchBook reports that startups in this space raised huge money in venture capital last year alone. But Dia represents something different: an open-source alternative that puts advanced capabilities in the hands of individual developers, small businesses, and researchers.

Open-weight models like Dia are changing how AI tools reach the market. Instead of being locked behind subscription paywalls or API rate limits, these tools allow direct implementation and modification. This openness accelerates innovation as developers can build on each other’s work.

For small businesses and independent content creators, models like Dia offer affordable access to tools that were previously only available to well-funded companies. A podcast producer can now generate realistic dialogue without expensive voice actors. A game developer can create dynamic NPC conversations without a massive audio recording budget.

From Proof of Concept to Practical Applications

The potential applications for Dia extend far beyond simple voice generation:

Content Creation: Podcasters can test scripts and formats before recording
Game Development: Indie games can feature rich dialogue without extensive voice acting budgets
Accessibility: Text content can be converted to audio for vision-impaired users
Language Learning: Generate conversation examples between native speakers
Prototyping: UX designers can quickly mock up voice interfaces

For developers looking to implement Dia in their projects, the process is straightforward. The model can be installed directly from GitHub. And the team provides a Gradio UI for experimentation.

Ethics and Responsible Use

The ease with which Dia can clone voices and generate realistic speech raises important ethical questions. The model offers few built-in safeguards, making it easy to create misleading or deceptive content.

Developers implementing Dia should consider adding their own safeguards:

Audio watermarking to identify synthetic content
Clear disclosure when AI-generated voices are used
Verification systems for sensitive applications
Consent mechanisms before cloning someone’s voice

While Nari Labs disclaims responsibility for misuse, developers who implement the technology bear ethical responsibility for its applications. The industry is still working to establish best practices and standards for synthetic media disclosure.

The Data Question

Like many AI models, Dia’s training data remains somewhat mysterious. The team hasn’t disclosed which datasets they used, raising questions about whether copyrighted content was included in the training process.

A commenter on Hacker News noted that some samples sound similar to hosts from NPR’s “Planet Money” podcast, suggesting that commercial audio may have been used without explicit permission. This highlights the ongoing legal questions surrounding AI training data.

For developers working with Dia, this creates potential legal uncertainties. While fair use arguments are common in AI development, the legal landscape remains unsettled. Companies building commercial products with Dia may want to seek legal advice about potential copyright implications.

What’s Next for Dia and Nari Labs

The Nari Labs team has ambitious plans. They intend to create a synthetic voice platform with social features built on top of Dia and larger future models. They’re also planning to release a technical report detailing Dia’s architecture and expand language support beyond English.

For developers, several improvements are in the pipeline:

Docker support for easier deployment
Optimized inference speed
Quantization for better memory efficiency
CPU support for systems without dedicated GPUs

The Democratization of AI Development

Perhaps the most significant aspect of Dia is what it represents: two undergraduates with limited AI expertise created a competitive speech synthesis model in just three months. This would have been unthinkable just a few years ago when AI development was limited to well-funded research labs and tech giants.

Today’s AI landscape offers unprecedented access to research, computing resources, and open-source building blocks. Google’s TPU Research Cloud program provided the computing power. Previous research from SoundStorm, Parakeet, and Descript Audio Codec provided the theoretical foundation. Hugging Face and GitHub provided distribution channels.

This democratization means we can expect to see more breakthrough tools from unexpected sources. The next game-changing AI application might come from a dorm room, a garage startup, or an independent researcher with a good idea.

Try It Yourself

If you want to experiment with Dia, there are several options:

Visit the Hugging Face Space demo: https://huggingface.co/spaces/nari-labs/Dia-1.6B
Install the Github repository and run locally
Join their Discord server for community support and access to new features
Get on the waitlist for access to larger versions of the model

The team is also open to contributions, noting they are “a tiny team of 1 full-time and 1 part-time research-engineers.”

What will you create with this new tool? The possibilities are vast, from enhancing existing products with voice features to building entirely new applications that weren’t possible before. As AI tools become more accessible, the limiting factor is no longer technical capability but your imagination and creativity.