Mitesh Khapra: A Leading Force in AI for Indian Languages

Biography

Mitesh M. Khapra, currently an Associate Professor at the Indian Institute of Technology Madras (IIT Madras), stands out as one of the most influential academic leaders in artificial intelligence, recognized globally for his deep commitment to democratizing AI technology across India’s linguistic diversity. In 2025, TIME Magazine featured him among its 100 most influential people in AI—a coveted recognition usually reserved for high-profile technology CEOs and founders. Unlike many of his global peers, Khapra’s success is not counted in revenues or company valuations, but in the transformative impact of his research, open-source initiatives, and the accessibility he has brought to millions of non-English speakers in India.

Khapra’s core motivation is forging parity in artificial intelligence technologies between English and India’s 22 official languages, which represent a rich tapestry of linguistic and cultural identities. As co-founder and principal scientific architect behind AI4Bharat—a non-profit research lab at IIT Madras—he spearheads efforts to build the foundational data, tools, and models necessary for robust natural language processing (NLP) and machine learning (ML) in Indian languages. These initiatives have had a transformative impact, fueling major breakthroughs in digital services, governance, education, and entrepreneurship, and in the process, shifting national academic research priorities towards solving “Indian problems” rather than adapting to Western models.

Beyond his research, Mitesh Khapra is distinguished as a mentor, teacher, and builder of communities that foster open science. His journey bridges academic rigor and real-world pragmatism—anchoring theoretical advances in NLP and ML with deployable, open-source applications that have become the backbone for India’s booming AI ecosystem. His leadership at AI4Bharat and his significant role in government-industry initiatives like the Bhashini mission have helped shape a future where India aspires not only to adopt global AI advances, but to become a foundational contributor to them.


Academic Background

Education

Mitesh Khapra's academic journey began at the Indian Institute of Technology Bombay (IIT Bombay), one of India’s most prestigious engineering and research institutions. He completed his M.Tech in 2008 and subsequently his Ph.D. in 2012 under the guidance of distinguished advisors including Pushpak Bhattacharyya and A. Kumaran. His doctoral thesis, titled "Reusing Resources for Multilingual Computation," broke new ground by exploring efficient methods to simultaneously process multiple languages—work that anticipated many of the challenges facing India’s multilingual digital landscape today.

During his doctoral studies, Khapra received the IBM PhD Fellowship and Microsoft Rising Star Award, both in 2011—a testament to the international academic community’s recognition of his early potential. His research at this stage laid the conceptual foundation for scalable multilingual language technologies, particularly relevant for countries with rich linguistic diversity such as India.


Professional Trajectory

Before delving fully into academic research, Khapra gained valuable industry experience as a software engineer at Infosys Technologies in Pune from 2002 to 2004. There, he worked on web services and scalable software systems, cementing his understanding of the practical constraints that real-world technology deployments face. This early exposure to industry challenges would later inform Khapra’s uniquely pragmatic approach to research, blending theory with application.

After earning his doctorate, Khapra joined IBM Research India. Over four and a half years, he developed critical expertise in machine translation, cross-language learning, deep learning, multimodal language processing, and argument mining. He published several influential papers in leading conferences and journals during this period, setting the stage for his eventual move to academia and open science.


Academic Role at IIT Madras

In 2016, Mitesh Khapra joined IIT Madras as a faculty member. He is now an Associate Professor in the Department of Data Science and AI at the Wadhwani School of Data Science and AI. Khapra also heads the AI4Bharat Research Lab at IIT Madras, where he guides major research, development, and outreach efforts aimed at advancing AI technologies for Indian languages.

His teaching portfolio includes foundational and advanced courses on deep learning, natural language processing, and computational engineering for undergraduate and graduate students. Notably, Khapra’s commitment to mentorship has produced several Ph.D. graduates, multiple recipients of elite fellowships, and cohorts of student researchers now at the forefront of India’s AI renaissance.


Research Contributions: NLP, Multilingual AI, and Machine Learning

Research Philosophy and Objectives

At the heart of Khapra’s research philosophy lies a drive to bridge the digital divide caused by language exclusion. He points out that “the reason Indian language technology is behind English is because we do not have enough data for Indian languages." Western language models, he observes, perform well on globally represented languages but fall far behind for India’s lower-resourced languages. Addressing this disparity, Khapra’s work aims for two ambitious goals: (1) to achieve genuine technological parity for all Indian languages in AI, and (2) to make such advances openly accessible and reusable by researchers, startups, and the broader public.

Key Research Areas

Multilingual Pretrained Language Models

Khapra’s pioneering work on multilingual pretrained models—spanning both text and speech—has shaped the cutting edge of Indian NLP. His research has contributed to the development and evaluation of models like IndicBERT and IndicBART, which are specifically designed for the linguistic complexities and resource constraints of Indian languages. These models, trained on massive datasets drawn from diverse Indian language sources, have achieved or exceeded performance benchmarks compared to Western counterparts like mBERT and XLM-R. Significantly, these models enable advanced applications such as zero-shot transfer, cross-lingual retrieval, and sentiment analysis across Indian languages.

IndicBERT, for instance, is an ALBERT-based multilingual architecture pretrained specifically on 12 major Indian languages, demonstrating state-of-the-art results on classification, translation, and information retrieval benchmarks relevant to the Indian context. Its frequent adoption by academic groups and technology startups highlights its impact on the ecosystem.

Similarly, Khapra's contributions to IndicBART and Airavata extend capabilities to generative tasks—including summarization and text generation—across major Indian languages, advancing the frontiers of what AI can accomplish in local contexts. These works are further supported by the careful curation of training corpora such as IndicCorpora, which includes billions of tokens sourced from news, literature, and web content in Indian languages.

Machine Translation and Low-Resource Language Processing

Khapra’s research has broken new ground in neural machine translation (NMT) for Indian languages. His lab was pivotal in the release of IndicTrans and its successor, IndicTrans2—the first open-source transformer-based NMT models supporting translation across all 22 official Indian languages, including multiple scripts and dialects. IndicTrans2, in particular, is designed for high-quality bi-directional and intra-Indic translation—featuring innovative script unification methods and leveraging parallel corpora collected at unprecedented scale for low-resource scenarios.

These models have enabled robust translation, even in languages with minimal digitized data, and their open-source release has ensured wide accessibility and adoption. They are validated on new benchmarks such as IN22, specifically created for Indian language pairs—a significant step in closing the AI gap for languages beyond Hindi and English.

Speech Recognition, Synthesis, and Multimodal Learning

Acknowledging the importance of speech as a medium for digital inclusion, Khapra’s research has also driven advances in automatic speech recognition (ASR) and text-to-speech (TTS) for Indian languages. His team created and released large-scale datasets such as Kathbath, Shrutilipi, and Rasa, supporting the development of models like IndicWav2Vec and AI4BTTS. These tools are capable of transcribing, synthesizing, and understanding speech across all official Indian languages, ensuring accessibility for populations traditionally excluded from English-dominated internet services.

IndicVoices and Rasa deserve special mention as the first datasets to capture expressive, conversational, and emotional speech at national scale. This work supports applications from education and healthcare to governance and entertainment, and extends inclusiveness for users with varying dialects, accents, and linguistic backgrounds.

Data Collection Initiatives and Open Collaboration

Among Khapra’s most important contributions is his leadership in collecting, annotating, and releasing open data resources for Indian languages. AI4Bharat’s speech data collection project saw researchers fan out to nearly 500 of India’s 700 districts, gathering thousands of hours of voices covering all 22 official languages and a broad array of dialects and demographics. The ongoing goal is to collect 15,000 hours of transcribed speech and create a bilingual corpus with 2.2 million translation pairs—an effort rivaling large international projects in scope and ambition.

These efforts have often been undertaken in close partnership with government agencies, most notably as the lead data management unit for India’s Bhashini mission—a national platform aiming to deliver multilingual AI-enabled services for every Indian citizen. The open nature of these resources under Creative Commons licenses has encouraged developers, researchers, and even global tech companies to build upon AI4Bharat's work for both commercial and societal benefit.

Evaluation Metrics, Fairness, and Bias in Indian AI

Khapra’s research extends to responsible AI—developing new evaluation metrics for natural language generation and exploring fairness and bias in multilingual AI systems. With his collaborators, he has worked on large-scale evaluation frameworks like IndicGLUE and on methods to systematically detect, analyze, and mitigate biases in datasets and models arising due to societal and linguistic factors in India. This work informs not only technical innovation but policy and design decisions towards more equitable AI systems.


AI4Bharat: Vision, Mission, and Major Initiatives

Genesis and Mission

AI4Bharat was co-founded by Mitesh Khapra in 2019 at IIT Madras with the vision of bringing technological parity in AI for all Indian languages relative to English. Conceived initially as a research collective, it has rapidly grown into India’s foremost non-profit, open-source initiative for language technology. AI4Bharat’s stated mission is to treat the digital and AI infrastructure for Indian languages as a public good, focusing on open data, open models, and open research—removing barriers posed by proprietary solutions prevalent in major Western AI ecosystems.

Strategic Focus Areas

AI4Bharat operates at the intersection of research, application, and policy, advancing core NLP areas critical to Indian society:

  • Transliteration: Creating models like IndicXlit to enable seamless script conversion—essential for interoperable content across India’s many language scripts.
  • Natural Language Understanding (NLU): Building datasets and models for semantic parsing, intent detection, and question answering in regional languages.
  • Machine Translation: Developing IndicTrans models for high-quality, bi-directional translation between English and all official Indian languages.
  • Automatic Speech Recognition (ASR): Training models such as IndicWav2Vec for accurate speech-to-text conversion, essential for voice-driven apps and accessibility.
  • Text to Speech (TTS): Producing expressive-sounding synthetic voices for regional languages with projects like AI4BTTS and the Rasa dataset.
  • Optical Character Recognition (OCR): Early-stage efforts for document layout parsing and OCR to digitize Indian scripts and archival content.

Data Collection and Diversity

The hallmark of AI4Bharat’s approach is its insistence on diversity, depth, and scale. Supported by grants from philanthropies and government agencies (including the Ministry of Electronics and IT, and Nilekani Philanthropies), AI4Bharat has undertaken data collection at a national scale. It has assembled and trained a team of over 100 translators, established professional recording studios for high-fidelity TTS data, and harnessed crowdsourcing platforms like Kathbath for low-resource collection in rural areas.

This model of inclusive data gathering ensures representation across educational, socioeconomic, and regional groups—addressing not only linguistic but also social gaps plaguing digital transformation in India.

Open-Source Ecosystem Impact

AI4Bharat’s open-source releases—spanning datasets, pretrained models, and toolkits—form the language layer of India's AI stack. These resources are used by startups developing chatbots, educational tools, and healthcare solutions. Government platforms, like Bhashini and NPCI’s UPI voice payments, rely on AI4Bharat infrastructure to reach citizens in their mother tongues. The Supreme Court’s SUVAS system, for example, translates judgments into regional languages using AI4Bharat's technology, while agricultural advisory bots powered by AI4Bharat models help farmers across linguistic divides.

AI4Bharat’s models are hosted on India’s open AI repository, AIKosh, reaffirming the center’s commitment to public digital infrastructure and the democratization of cutting-edge technology.

Collaboration with the Bhashini Program

As the principal data management unit of India’s ambitious Bhashini program, AI4Bharat supplies 80% of the AI models and datasets that power official digital translation and speech services. Bhashini seeks to enable every Indian citizen to interact with digital government platforms in their native language. AI4Bharat’s datasets and models form the backbone of this mission, embodying Khapra’s belief that “inclusion begins with access—and for Bharat, that means language.”


Summary of Major Projects Led by Mitesh Khapra and AI4Bharat

These projects, most of which are open-source and available for public/application use, exemplify Khapra’s strategy of building “foundational infrastructure.” For instance, IndicBERT and IndicTrans2 are not just research artifacts, but have achieved “production-grade” adoption in both government mission-mode projects and private technology business lines. Bhashini’s reliance on AI4Bharat’s data and models illustrates the critical role played by Khapra’s leadership in national digital inclusion policy.

Rasa and Kathbath deserve special emphasis for their innovative approaches to expressive speech and scalable, rural-centric data collection—addressing both quality and inclusion. AI4Bharat’s transliteration and OCR projects, with IndicXlit and ongoing script digitization efforts, exemplify the sustained drive to tackle less-explored but essential components of India’s digital transformation.


Academic Publications and Community Leadership

Scholarly Output in Top-Tier Venues

Mitesh Khapra’s publication record includes over 80 papers in major international conferences and journals, including ACL, NeurIPS, EMNLP, AAAI, and TACL—establishing him as a thought leader on the global AI stage. His contributions range from the theoretical foundations of NLP (such as resource-efficient learning, transfer learning, and representation learning) to the practical challenges of deploying AI in multilingual, low-resource, and technologically constrained environments.

Some notable works include:

  • A Primer on Pretrained Multilingual Language Models: A comprehensive survey outlining challenges, innovations, and future opportunities for multilingual AI worldwide, with a focus on methods applicable to Indian contexts.
  • Multilingual Multimodal Language Processing using Neural Networks (NAACL-HLT 2016): Early work predicting the convergence of modalities and languages—a theme that is now central to large generative models and AI systems.
  • Indic LLM Suite: A recent series of papers detailing blueprints for creating pre-training and fine-tuning datasets, and benchmarking methodologies tailored for Indian language LLMs.
  • Large-Scale Datasets and Speech Translation Systems for Indian Languages (ACL 2025): Explores the creation of inclusive multilingual speech datasets such as IndicVoices, vital for underrepresented communities.

Editorial and Professional Service

Khapra has also served as an Area Chair, Senior Program Committee Member, or reviewer for several leading AI conferences (e.g., ICLR, AAAI, EMNLP) and plays an active role in nurturing global dialogue around low-resource and multilingual AI research. His professional engagements extend to advising government panels and working groups focused on digital language policy, machine translation, and responsible AI in India.

Mentorship and Teaching

At IIT Madras and through AI4Bharat, he has supervised multiple PhD students and postdoctoral scholars, many of whom have gone on to secure prestigious fellowships (such as the Google PhD Fellowship) and academic positions. His courses in deep learning, advanced machine learning, and NLP are widely recognized for their rigor and industry-linked relevance; many lecture videos and resources are disseminated openly, reinforcing his commitment to open science and inclusive education.

Initiatives for Skill Development

Khapra is also a co-founder of One Fourth Labs, which aims to offer affordable, high-quality AI education to a broad workforce in India, further amplifying the societal impact of his research and outreach efforts.


Recognitions and Awards

Mitesh Khapra’s list of recognitions underscores his dual impact as an academic leader and technological nation-builder.

  • TIME Magazine 100 Most Influential People in AI (2025): For his pioneering role in democratizing AI in India, particularly across its many languages.
  • IBM PhD Fellowship (2011): For early contributions to multilingual natural language computation.
  • Microsoft Rising Star Award (2011): Awarded for innovative doctoral work in NLP.
  • Google Faculty Research Award (2018): For impactful research in language technologies.
  • IIT Madras Young Faculty Recognition Award (2019): For excellence in research and teaching.
  • Prof. B. Yegnanarayana Award for Excellence in Research and Teaching (2020).
  • Srimathi Marti Annapurna Gurunath Award for Excellence in Teaching (2022).
  • NASSCOM AI Game Changer Award (2021): Accepted on behalf of Team Samanantar for groundbreaking work in machine translation.

In addition to individual honors, the teams Khapra leads have received national and sectoral accolades for their contribution to digital inclusion, language technology, and academic excellence.


Impact on Indian and Global AI Community

Reshaping Indian Research and Startup Ecosystem

Khapra’s efforts have helped shift the mindset in Indian academia from adapting imported solutions to generating homemade innovations for local challenges. As he notes, previously “an average PhD student in India working on language technology would end up working on English problems," but today, “with these datasets available, I see a shift: now Indian students are working on Indian problems.” This paradigmatic change has downstream effects, fueling a surge in startups building voice- and language-centric products for India’s huge market of non-English users.

Numerous Indian startups, particularly in fintech, agriculture, healthcare, and education, leverage models and datasets from AI4Bharat for their solutions. The open-source philosophy has also fostered vibrant developer and research communities, accelerating the time from research prototype to scalable, user-facing applications.

Policy and Societal Influence

By collaborating with the Indian government, especially on the Bhashini mission, Khapra and AI4Bharat have become architects of the technological backbone for future digital India initiatives. Their work has informed digital policy, data stewardship practices, and the national approach to language inclusivity in AI. The fact that Supreme Court judgments, banking transactions, and farming advisories are now accessible in multiple Indian languages is directly linked to these foundational efforts.

Championing Open Science and Sovereignty

While AI4Bharat’s open repository has enabled rapid adoption by both Indian and international technology firms, Khapra insists on the importance of India developing its own foundational models. He has often stated that “unless we learn that skill, we will always be in a perpetually dependent position,” emphasizing the importance of technological sovereignty for national development.

Global Academic and Industry Attention

Khapra’s inclusion in the TIME AI 100 list alongside global tech leaders like Elon Musk and Sam Altman is emblematic of the recognition India’s linguistic AI research now commands internationally. His approach is now seen as a model for other multilingual, resource-constrained regions seeking to roll out inclusive AI solutions.


Conclusion: The Road Ahead

Mitesh Khapra’s vision for Indian AI is expansive and long-term. He is quick to acknowledge that the journey is far from complete—a sentiment echoed by the continuous growth of AI4Bharat’s team, the addition of new domains like document parsing and OCR, and ongoing efforts to capture endangered and underrepresented languages. Philanthropic support, exemplified by multiple grants from Nandan Nilekani totaling INR 70 crore, underscores the national and international confidence invested in this project.

Khapra’s model of research-led open innovation, community collaboration, and government partnership has created a template for digital inclusion not just in India, but potentially for hundreds of languages and communities worldwide. As generative AI and LLM technologies continue to accelerate, his insistence on inclusive, ethical, and publicly accessible infrastructure positions India as a leader rather than a follower in the age of intelligent machines.

His story highlights that impactful AI is not only measured by commercial metrics, but also by the tangible empowerment it brings to people and the knowledge commons it builds for future generations. Through his academic legacy, open-source advocacy, and societal focus, Mitesh Khapra exemplifies the transformative power of research applied with purpose, vision, and national pride.


References