Project Vaani
India's Largest Open-Source Speech Dataset
Built from the ground up, across India's villages, towns, and cities. Project Vaani captures real voices — from tribal belts to urban metros, across 165 districts, surfacing dozens of languages that existing datasets have never touched.
Open-sourced via Bhashini and Hugging Face, it gives developers, researchers, and public innovators a foundation for building AI in education, health, governance, and beyond — tools that can genuinely reach India's population.
Features
Coverage: 165 districts, 28 states, 3 union territories
Long-term Goal: 150,000+ hours from all districts of India
Data Volume: 31,255 hours, 22M+ audio segments, over 288k images
Access: Open-source via Vaani and Hugging Face
Language Programs
Vaani
India's Largest Open-Source Speech Dataset
RESPIN
Recognising Speech in Indian Languages
SYSPIN
Synthesising Speech in Indian Languages
Our Partners
Supported By