Building AI That Speaks Every Indian Language
India has 22 scheduled languages, hundreds of dialects, and over a billion speakers. Most AI understands none of them well. ARTPARK, in partnership with IISc, is building the data foundation to change that.
Scale of the work
The World's Most Comprehensive Open Language Dataset
Three interlocking programs - Vaani, RESPIN, and SYSPIN - address speech recognition, Text to Speech, and open data across India's full linguistic spectrum. The goal is a 150,000-hour open-source corpus covering all districts of India.
Why it matters
A Billion People.
Largely unheard by the machines meant to serve them.
Language is the interface to AI. If systems don’t understand how people speak, they don’t work for them.
Most models still fall short on India’s reality - accents, dialects, low-resource languages, and code-switched speech. VAANI addresses this at the data layer.
Inclusion, Not Coverage
Moves beyond “supported languages” to represented speech
Brings low-resource languages into mainstream AI systems
Built for the Real World
Trained on noisy, mixed, everyday conversations
Designed for production—not just clean benchmarks
Proven Performance Gains
~21% WER reduction (SandLogic)
Up to ~55% improvement in real-world deployments
3.10 WER across 200+ languages (Shunya Labs, Pingala V1)
From Capability to Application
Healthcare, governance, customer experience, and voice-first platforms (e.g., Convozen)
Systems that understand users the way they actually speak
Language Programs
Vaani
India's Largest Open-Source Speech Dataset
RESPIN
Recognising Speech in Indian Languages
SYSPIN
Synthesising Speech in Indian Languages
Collaborators feedback
“The Vaani Datasets have been invaluable in improving our Speech Models. The quality is excellent, with a great balance of gender variation, detailed metadata, and highly accurate transcripts with precise noise tagging.”
Reverie Language Technology LTD Pranjal Nayak, Head of R&D
“At SandLogic, we believe India’s AI future must be sovereign, inclusive, and representative of our people. The Vaani dataset captures the richness of Indian speech and has helped us benchmark and enhance our models for stronger performance in both research and enterprise use cases.”
SandLogic Technologies
Dr. Kruthika K R, Founding Researcher
“The Vaani dataset has been instrumental in bridging the data gap for Northeast Indian languages. Covering around 30 tribal languages, it enabled MWire Labs to build the first publicly available ASR system for Garo with a 9.74% Word Error Rate and ~3% Character Error Rate, performance that even impressed native Garo speakers during evaluation and sets a new benchmark for the language.”
MWire Labs Badal Nyalang, Director
Case Studies
Building Speech Technology for Garo: A Low-Resource ASR Breakthrough Using the Vaani Dataset
Leveraging Vaani Dataset:Fine-Tuning Hindi ASR for Real-World Call Analytics
Shunya Labs + Google Vaani: Speech to text for India and the World
Diverse Data, Real Results: Vaani Drives a 31% Gain in Voice Naturalness
Want to Build with India's Language Data?
Whether you are a researcher, a developer, or an institution building tools for India's population — ARTPARK's language datasets are open and available.