Building AI That Speaks Every Indian Language

India has 22 scheduled languages, hundreds of dialects, and over a billion speakers. Most AI understands none of them well. ARTPARK, in partnership with IISc, is building the data foundation to change that.

Scale of the work

The World's Most Comprehensive Open Language Dataset 

Three interlocking programs - Vaani, RESPIN, and SYSPIN - address speech recognition, Text to Speech, and open data across India's full linguistic spectrum. The goal is a 150,000-hour open-source corpus covering all districts of India. 

Why it matters

A Billion People.
Largely unheard by the machines meant to serve them.

Language is the interface to AI. If systems don’t understand how people speak, they don’t work for them. 
Most models still fall short on India’s reality - accents, dialects, low-resource languages, and code-switched speech. VAANI addresses this at the data layer.

 Inclusion, Not Coverage

  • Moves beyond “supported languages” to represented speech

  • Brings low-resource languages into mainstream AI systems

Built for the Real World

  • Trained on noisy, mixed, everyday conversations

  • Designed for production—not just clean benchmarks

Proven Performance Gains

  • ~21% WER reduction (SandLogic)

  • Up to ~55% improvement in real-world deployments

  • 3.10 WER across 200+ languages (Shunya Labs, Pingala V1)

 From Capability to Application

  • Healthcare, governance, customer experience, and voice-first platforms (e.g., Convozen)

  • Systems that understand users the way they actually speak

Language Programs

Vaani

India's Largest Open-Source Speech Dataset

RESPIN

Recognising Speech in Indian Languages

SYSPIN

Synthesising Speech in Indian Languages

Collaborators feedback

“The Vaani Datasets have been invaluable in improving our Speech Models. The quality is excellent, with a great balance of gender variation, detailed metadata, and highly accurate transcripts with precise noise tagging.”

Reverie Language Technology LTD Pranjal Nayak, Head of R&D

“At SandLogic, we believe India’s AI future must be sovereign, inclusive, and representative of our people. The Vaani dataset captures the richness of Indian speech and has helped us benchmark and enhance our models for stronger performance in both research and enterprise use cases.”

SandLogic Technologies
Dr. Kruthika K R, Founding Researcher

“The Vaani dataset has been instrumental in bridging the data gap for Northeast Indian languages. Covering around 30 tribal languages, it enabled MWire Labs to build the first publicly available ASR system for Garo with a 9.74% Word Error Rate and ~3% Character Error Rate, performance that even impressed native Garo speakers during evaluation and sets a new benchmark for the language.”

MWire Labs Badal Nyalang, Director

Media Coverage

Case Studies

Building Speech Technology for Garo: A Low-Resource ASR Breakthrough Using the Vaani Dataset

Leveraging Vaani Dataset:Fine-Tuning Hindi ASR for Real-World Call Analytics

Shunya Labs + Google Vaani: Speech to text for India and the World

Diverse Data, Real Results: Vaani Drives a 31% Gain in Voice Naturalness

Want to Build with India's Language Data?

Whether you are a researcher, a developer, or an institution building tools for India's population — ARTPARK's language datasets are open and available.