Legal Framework for Open-Sourcing Language Data in India

A white paper to inform stakeholders and ecosystem
September 2024

Introduction

Given the growing demand for artificial intelligence (AI) models localized to the Indian experience there is an urgent need for reliable datasets. This includes speech data in regional languages, to facilitate the widespread implementation of speech language technologies among India’s marginalized communities. These speech datasets, comprising of audio files, underlying text/transcripts, and annotations/metadata (Speech Datasets), have diverse applications, from Automated Speech Recognition to Text-to-speech systems. Many organisations in India are currently engaged in, or are planning to engage in, collecting and hosting Speech Datasets (Organisations) with the aim of creating a substantial open-source repository adaptable to various contexts. In connection with this, there are certain legal and ethical considerations that Organisations need to bear in mind across the lifecycle of a Speech Dataset.

This document is intended to serve as a foundational resource for Organisations seeking to understand and appreciate the legal issues pertaining to the creation and hosting of Speech Datasets in India. It is not legal advice and Organisations should seek legal counsel to advise them on specific situations and challenges.

AUTHORS

Trilegal
Rahul Matthan
Shreya Ramann
Pranay Jalan

IISc/ARTPARK
Nihar Desai
Prof. Prasanta Ghosh
Raghu Dharmaraju