Building AI That Speaks Every Indian Language

India has 22 scheduled languages, hundreds of dialects, and over a billion speakers. Most AI understands none of them well. ARTPARK, in partnership with IISc, is building the data foundation to change that.

Scale of the work

The World's Most Comprehensive Open Language Dataset

Three interlocking programs - Vaani, RESPIN, and SYSPIN - address speech recognition, Text to Speech, and open data across India's full linguistic spectrum. The goal is a 150,000-hour open-source corpus covering all districts of India.

Why it matters

A Billion People.
Largely unheard by the machines meant to serve them.

Language is the interface to AI. If systems don’t understand how people speak, they don’t work for them.
Most models still fall short on India’s reality - accents, dialects, low-resource languages, and code-switched speech. VAANI addresses this at the data layer.

Inclusion, Not Coverage

Moves beyond “supported languages” to represented speech
Brings low-resource languages into mainstream AI systems

Built for the Real World

Trained on noisy, mixed, everyday conversations
Designed for production—not just clean benchmarks

Proven Performance Gains

~21% WER reduction (SandLogic)
Up to ~55% improvement in real-world deployments
3.10 WER across 200+ languages (Shunya Labs, Pingala V1)

From Capability to Application

Healthcare, governance, customer experience, and voice-first platforms (e.g., Convozen)
Systems that understand users the way they actually speak

Language Programs

Vaani

Learn more

India's Largest Open-Source Speech Dataset

RESPIN

Learn more

Recognising Speech in Indian Languages

SYSPIN

Learn more

Synthesising Speech in Indian Languages

Collaborators feedback

“The Vaani Datasets have been invaluable in improving our Speech Models. The quality is excellent, with a great balance of gender variation, detailed metadata, and highly accurate transcripts with precise noise tagging.”

Reverie Language Technology LTD Pranjal Nayak, Head of R&D

“At SandLogic, we believe India’s AI future must be sovereign, inclusive, and representative of our people. The Vaani dataset captures the richness of Indian speech and has helped us benchmark and enhance our models for stronger performance in both research and enterprise use cases.”

SandLogic Technologies
Dr. Kruthika K R, Founding Researcher

“The Vaani dataset has been instrumental in bridging the data gap for Northeast Indian languages. Covering around 30 tribal languages, it enabled MWire Labs to build the first publicly available ASR system for Garo with a 9.74% Word Error Rate and ~3% Character Error Rate, performance that even impressed native Garo speakers during evaluation and sets a new benchmark for the language.”

MWire Labs Badal Nyalang, Director

Media Coverage

Read All Articles

Case Studies

Building Speech Technology for Garo: A Low-Resource ASR Breakthrough Using the Vaani Dataset

Read full case study

Leveraging Vaani Dataset:Fine-Tuning Hindi ASR for Real-World Call Analytics

Shunya Labs + Google Vaani: Speech to text for India and the World

Read full case study

Diverse Data, Real Results: Vaani Drives a 31% Gain in Voice Naturalness

Read full case study

Want to Build with India's Language Data?

Whether you are a researcher, a developer, or an institution building tools for India's population — ARTPARK's language datasets are open and available.

Access Vaani

Building AI That Speaks Every Indian Language

The World's Most Comprehensive Open Language Dataset

31.2k Hours of Speech Data

109 Languages Covered

165 Districts

156k+ Speakers Recorded

A Billion People. Largely unheard by the machines meant to serve them.

Inclusion, Not Coverage

Built for the Real World

Proven Performance Gains

From Capability to Application

Language Programs

Vaani

RESPIN

SYSPIN

Collaborators feedback

Media Coverage

Case Studies

Want to Build with India's Language Data?

ARTPARK

A Billion People.
Largely unheard by the machines meant to serve them.