Project Vaani

India's Largest Open-Source Speech Dataset

Built from the ground up, across India's villages, towns, and cities. Project Vaani captures real voices — from tribal belts to urban metros, across 165 districts, surfacing dozens of languages that existing datasets have never touched.

Open-sourced via Bhashini and Hugging Face, it gives developers, researchers, and public innovators a foundation for building AI in education, health, governance, and beyond — tools that can genuinely reach India's population.

Features

Coverage: 165 districts, 28 states, 3 union territories

Long-term Goal: 150,000+ hours from all districts of India

Data Volume: 31,255 hours, 22M+ audio segments, over 288k images

Access: Open-source via Vaani and Hugging Face

Language Programs

Vaani

India's Largest Open-Source Speech Dataset

RESPIN

Recognising Speech in Indian Languages

SYSPIN

Synthesising Speech in Indian Languages

Our Partners

Supported By