Introducing BharatBench: Making AI Understand India

Ola Krutrim

04 Feb 2025 — 8 min read

Have you ever wondered if the AI you use every day truly understands India and its diverse culture? Most AI models are trained primarily on English, Chinese, or European languages, which means they might miss the nuances of Indian languages and cultural contexts. That's where BharatBench comes in.

What is BharatBench?

BharatBench is a Comprehensive, Multimodal, Multilingual, Multi-task Indic benchmark. It is a special evaluation tool designed to test how well AI models understand and perform in Indian languages and cultural contexts. Think of it as a report card for AI, but specifically for its ability to work with the diverse needs of India. It helps us see which AI models are truly ready for the Indian market and which ones need improvement.

Why is BharatBench Important?

• Addresses the gap: Many current AI models don’t work well with Indian languages. They often struggle with understanding the different ways people speak and the cultural context. BharatBench helps highlight these issues.• Lack of a Comprehensive, Multimodal, Multilingual, Multi-task Indic benchmark: Existing benchmarks in English and Chinese, such as MMLU, ARC, TruthfulQA and others, do not adequately capture the unique linguistic and cultural nuances of India. There is a specific need for a benchmark that addresses the diversity of India.

• Ensures inclusivity: With over a billion people speaking various Indian languages, it's important that AI is inclusive. BharatBench ensures that AI models are evaluated on their performance with these diverse languages and cultures.

• Improves AI quality: By using BharatBench, researchers and developers can create better AI models that are specifically optimized for the Indian market. This means more accurate and relevant AI for everyone in India.

• Promotes fair evaluation: BharatBench was created because current AI evaluation methods don't always work well for Indian languages. It's important to have specific tools that can test AI fairly in these contexts.

What Does BharatBench Evaluate?

BharatBench tests AI models across several key areas:

• Language Understanding: It checks how well AI models can understand and generate text in Indian languages. It uses various tasks including things like comprehension, translation and grammar correction. This includes seeing if models understand Indian cultural contexts, such as the importance of festivals and traditions.

• Distributional Representations: Assesses embedding models for sentence retrieval, evaluating their ability to handle Indian languages, contexts, and large documents

• Visual Understanding: It assesses how well AI understands images related to India. This includes things like recognizing Indian festivals, art, historical sites, and more. It can also read text in images, like old books and handwritten notes.

• Speech Understanding: BharatBench evaluates how well AI can understand spoken Indian languages. This includes transcribing speech to text and translating it accurately.

How Does BharatBench Work?

BharatBench uses a collection of real-world examples and datasets specifically designed to represent the linguistic and cultural diversity of India. This ensures that AI models are evaluated on their performance with real-world use cases. The models are tested across multiple Indic languages, including Hindi, Marathi, Tamil, Telugu, Kannada, Gujarati, Bengali, and Malayalam and others. The framework also includes several evaluation techniques, including looking at how well AI models can perform in specific language tasks like:

• Indian Cultural Context (ICC): This assesses AI’s understanding of Indian customs, traditions, and social practices.

• Multi-turn comprehension: This evaluates the model’s ability to understand and interpret passages of text from varied domains.

• Multi-turn Translation: This evaluates the model’s ability to translate accurately between languages, including the handling of idioms and colloquial phrases.

• Text classification: This tests how well AI can classify text into different categories, like sentiment, topic, intent, and language.

• Grammar Correction: It checks the AI's ability to identify and correct grammatical errors in Indic languages.

Key Findings

• BharatBench uses real-world scenarios and data to assess the AI landscape in India.

• AI models that are trained specifically for Indian languages tend to perform much better than generic models.

• Some AI models are better at certain tasks than others. For example, some may be better at grammar correction while others might excel at translation.

• Krutrim LLM series, models trained specifically for Indic languages, have shown very promising results across various tasks

• Vyakyarth, a new embedding model, has shown high performance in understanding the nuances of multiple Indic languages

• For visual understanding, Chitrarth outperforms other models in understanding images, showing it is better at understanding Indian cultural contexts

• Shabdarth, a custom OCR model, has shown good performance in reading Indic content, including old books with complex scripts.

• Custom speech models perform better than generic models, indicating a need for models specifically designed for Indic languages

• Need for Better Evaluation: Existing evaluation methods are often biased towards high-resource languages. There is a need for custom multilingual evaluators which BharatBench addresses.

Results

Language	MuRIL	IndicBERT	Jina-Embeddings-V3	Vyakyarth
Bengali	77.0	91.0	97.4	98.7
Gujarati	67.0	92.4	97.3	98.7
Hindi	84.2	90.5	98.8	99.9
Kannada	88.4	89.1	96.8	99.2
Malayalam	82.2	89.2	96.3	98.7
Marathi	83.9	92.5	97.1	98.8
Sanskrit	36.4	30.4	84.1	90.1
Tamil	79.4	90.0	95.8	97.9
Telugu	43.5	88.6	97.3	97.5

Table: Performance of Indic Embedding models across different languages

Speech to text transcription measured through Word Error Rate (WER), lower is better.

Speech to text translation measured using BLEU scores

Performance of baseline models on BharatBench-V Evaluation framework

Performance of VLMs on BharatBench translated multi-modal academic datasets

Figure: Character Error Rate (CER) at the top and Word Error Rate (WER) below (lower is better). Shabdarth model performs better against GPT-4o and GCP for the BharatBench OCR dataset

Model	Bengali	English	Gujarati	Hindi	Kannada	Malayalam	Marathi	Tamil	Telugu
Meta-Llama-3.1-70B-Instruct-Turbo	0.55	0.72	0.82	0.79	0.39	0.49	0.77	0.52	0.46
Meta-Llama-3.1-8B-Instruct-Turbo	0.47	0.69	0.69	0.74	0.35	0.35	0.74	0.46	0.49
Mistral-Nemo-Instruct-2407	0.51	0.67	0.54	0.71	0.31	0.4	0.64	0.49	0.49
gemma-2-27b-it	0.67	0.79	0.75	0.8	0.45	0.49	0.79	0.58	0.55
gemma-2-9b-it	0.56	0.73	0.72	0.76	0.47	0.46	0.73	0.54	0.49
gpt-4o	0.59	0.77	0.79	0.8	0.41	0.52	0.76	0.62	0.56
gpt-4o-mini	0.53	0.76	0.6	0.81	0.4	0.43	0.77	0.55	0.54
Krutrim-1	0.43	0.51	0.21	0.58	0.2	0.29	0.12	0.43	0.24
Krutrim-2	0.51	0.63	0.71	0.67	0.32	0.46	0.73	0.45	0.41

Table: Performance of different models on BharatBench NER tasks. We report 5-shot accuracy for the NER task.

Model	Bengali	English	Gujarati	Hindi	Kannada	Malayalam	Marathi	Tamil	Telugu
Llama-3.2-3B-Instruct-Turbo	0.77	0.8	0.33	0.57	0.47	0.48	0.63	0.32	0.58
Meta-Llama-3.1-70B-Instruct-Turbo	0.87	0.92	0.5	0.93	0.8	0.77	0.88	0.75	0.9
Meta-Llama-3.1-8B-Instruct-Turbo	0.8	0.83	0.33	0.77	0.58	0.52	0.73	0.6	0.75
Mistral-Nemo-Instruct-2407	0.83	0.83	0.45	0.88	0.65	0.62	0.73	0.7	0.78
gemma-2-27b-it	0.83	0.92	0.67	0.95	0.72	0.75	0.83	0.73	0.9
gemma-2-9b-it	0.78	0.87	0.6	0.9	0.7	0.68	0.82	0.72	0.8
gpt-4o	0.83	0.93	0.68	0.97	0.85	0.7	0.85	0.8	0.9
gpt-4o-mini	0.8	0.93	0.65	0.93	0.75	0.72	0.88	0.78	0.88
Krutrim-1	0.75	0.77	0.53	0.72	0.65	0.6	0.6	0.62	0.73
Krutrim-2	0.82	0.92	0.5	0.83	0.68	0.65	0.72	0.68	0.78

Table: Performance of different models on BharatBench text classification task. We report 0-shot accuracy for the Text classification task.

Model	Bengali	English	Gujarati	Hindi	Kannada	Malayalam	Marathi	Tamil	Telugu
Llama-3.2-3B-Instruct-Turbo	0.82	0.89	0.87	0.88	0.85	0.85	0.83	0.86	0.86
Meta-Llama-3.1-70B-Instruct-Turbo	0.84	0.9	0.89	0.89	0.86	0.88	0.87	0.89	0.88
Meta-Llama-3.1-8B-Instruct-Turbo	0.83	0.89	0.87	0.88	0.85	0.87	0.85	0.88	0.87
Mistral-Nemo-Instruct-2407	0.8	0.89	0.84	0.85	0.83	0.83	0.82	0.84	0.85
gemma-2-27b-it	0.83	0.89	0.88	0.88	0.86	0.87	0.85	0.88	0.88
gemma-2-9b-it	0.83	0.9	0.88	0.88	0.85	0.86	0.85	0.88	0.88
gpt-4o	0.85	0.9	0.89	0.9	0.87	0.88	0.88	0.91	0.89
gpt-4o-mini	0.85	0.91	0.89	0.89	0.86	0.88	0.87	0.9	0.89
Krutrim-1	0.85	0.9	0.9	0.89	0.87	0.89	0.87	0.9	0.89
Krutrim-2	0.85	0.86	0.9	0.89	0.87	0.88	0.88	0.9	0.89

Table: Performance of different models on BharatBench ICC task. We report 0-shot BERT Score.

Why This Matters:

◦ Better AI for India: BharatBench is a step towards creating AI that better understands and serves the diverse population of India.

◦ More Inclusive AI: The framework promotes inclusivity in the AI ecosystem by focusing on underrepresented languages and cultures.

◦ Informed Decisions: It helps businesses and researchers choose the best AI models for Indian use cases.

The Future of BharatBench

BharatBench is in its nascent stage and will continue to expand its evaluation capabilities, including adding more languages, prompts, and tasks. It is an important step toward building AI that is inclusive and relevant to India. By continuing to refine and use BharatBench, we can ensure that AI works for everyone.

Conclusion

BharatBench is a critical tool for evaluating and improving AI models for Indian languages and culture. It’s a step toward building more inclusive and effective AI that can serve the diverse needs of India.

Please check out our technical report for a more detailed discussion.

Authors: Shubham Agarwal, Abhinav Ravi, Rajkiran Panuganti, Sharath Adavanne, Ashish Kulkarni, Hareesh Kumar, Chandra Khatri