Krutrim-2: A Best-in-Class Large Language Model for Indic Languages

Krutrim-2: A Best-in-Class Large Language Model for Indic Languages

Author: Chandra Khatri, Ashish Kulkarni, Rajkiran Panuganti, Sharath Adavanne, Kumar Ashish, Abhinav Ravi

Description

The plurality of Indian languages and culture poses unique challenges in building general purpose AI models for India. In the context of large language models, these include multilingual language understanding covering this linguistic diversity and the ability to respond back while honoring the cultural nuances and maintaining a personable conversational tone. Towards that we trained Krutrim-1 - our first multilingual foundational model for India in 2023 and released to the public in Jan 2024. The model delivered promising performance on multiple Indic benchmarks. However, owing to its smaller size (7B parameters) and training on relatively lesser FLOPs, it left a lot to be desired for its users. Given the prevalence of synthetic web data and lesser capacity to alignment - it had a tendency of confirmation bias leading to hallucinations like it was built by other AI labs.

Building upon our foundational work, we now present Krutrim-2, a best-in-class large language model for Indic. The model has been meticulously crafted to cater to various linguistic needs within India and beyond. Krutrim-2 is a 12 billion parameters dense transformer model, built on the Mistral-NeMo architecture. Our team ensured that Krutrim-2 received comprehensive training on a rich dataset encompassing English, Indic languages (hundreds of billions of tokens), code snippets, mathematical concepts, literary works, and high quality synthetically generated content. It is natively multilingual (English and 22 Indian languages) and supports a context window of 128K tokens.

The model delivers best-in-class performance across Indic tasks and a promising performance on English benchmarks equivalent to models 5-10x the size. We present details of the model architecture, pre-training, post-training and evaluation results. We also publicly release the post-trained versions of the model. We are continuously improving the model through post-training techniques such as RLVR.

At Krutrim, we have consistently strived towards creating advanced models capable of understanding, processing, and generating content in multiple languages. With Krutrim-2, our journey takes another significant leap and paves way forward towards our mission to build models for India.

Use Cases

Creative writing and more relevant responses in Indian languages

Krutrim-2 is natively multilingual, delivering state-of-the-art performance on Indic benchmarks. It also matches or outperforms models up to six times larger in multilingual tasks such as creative writing, summarization, and translation.

Long-form generation

The model supports a context window of 128K tokens enhancing its ability to handle extensive inputs and maintain context over longer interactions. This makes it well suited for long-form generation, multi-turn conversations, document translation, coding and others.

Multimodal applications

Its improved multilingual understanding and generation capabilities on Indian languages makes it the model of choice as a backbone in large multimodal models for visual understanding, captioning, and speech applications in the Indian context.

Cost efficient AI applications

With best-in-class performance on Indic, better than or competitive with that of models much larger in size in tasks like coding and instruction following, the model offers a significant cost advantage in integrating it in AI applications for India. Further, the enhanced multilingual understanding and generation capabilities of the model can be distilled into models much smaller in size.

Demo

Engages in multiturn conversation in Indic

ex_1.png
ex_2.png

Solves math problems in Indic

ex_3.png

Explains code in your language

ex_4.png

Indian context understanding

ex_5.png

Support low-resource languages like Sanskrit

ex_6.png

Use it in agentic applications

ex_7.png

Model Architecture and Training

Krutrim-2 is a 12B parameter dense transformer model based on the Mistral-Nemo architecture. The model is pre-trained on high quality data comprising a curated mix of English, Indic, code, math, books, and synthetic data. It is natively multilingual (English and Indian languages) and supports a context window of 128K tokens. We followed a multi-stage training procedure, varying the data-mix, context size and batch size at every stage, leading to a stable and efficient model training.

After pre-training, the model underwent supervised training for cross-task instruction following and direct preference optimization for alignment.

# Hyper parameter Value
1 Layers 40
2 Max sequence length 128K
3 Vocab size 131K
4 Attention type GQA (Group Query Attention)
5 Positional embeddings RoPE

Evaluation

EN Benchmarks

We use the LM Evaluation Harness to evaluate our model on the En benchmarks tasks. Please note that at the time of writing this report, we were unable to use the evaluation framework for llama-3.3-70B, Gemini-1.5-flash and GPT-4o. We currency report the available published numbers for these models. We realise that the prompt templates and few-shot settings might vary and are working to make these evaluations consistent.

# Benchmark Metric Krutrim-1 7B MN-12B-Instruct Krutrim-2 12B llama-3.3-70B Gemini-1.5 Flash GPT-4o
1 Hellaswag (0-shot) Common sense reasoning Accuracy 0.74 0.82 0.83 (0-shot) 0.95 0.87 (10-shot) 0.95 (10-shot)
2 CommonSenseQA (0-shot) Common sense reasoning Accuracy 0.74 0.70 0.74 (0-shot) - - 0.85
3 TruthfulQA (0-shot) Factuality Accuracy 0.49 0.54 0.59 (0-shot) - - 0.59
4 MMLU (5-shot) Language understanding Accuracy 0.47 0.68 0.63 (5-shot) 0.82 0.79 0.86
5 TriviaQA (5-shot) Reading comprehension EM 0.44 0.72 0.62 (5-shot) - - -
6 GSM8K (0-shot) Math EM 0.07 0.74 0.71 (0-shot) 0.93 (8-shot, CoT) 0.86 (11-shot) 0.89
7 ARC_Challenge (0-shot) Knowledge reasoning Accuracy 0.48 0.59 0.60 (0-shot) 0.93 (25-shot) - 0.50
8 HumanEval Coding Pass@10 0.00 0.23 0.80 (Pass@10) 0.88 0.74 (0-shot) 0.90
9 IF_Eval (0-shot) Inst. following Accuracy 0.27 0.46 0.73 (0-shot) 0.92 - 0.84

BharatBench

The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India. For more information, please refer to the BharatBench blog.

# Benchmark Metric Krutrim-1 7B MN-12B-Instruct Krutrim-2 12B llama-3.1-70B-Instruct Gemma-2-27B-Instruct GPT-4o
1 Indian Cultural context (0-shot) Generation Bert Score 0.86 0.56 0.88 0.88 0.87 0.89
2 Grammar Correction (5-shot) Language understanding Bert Score 0.96 0.94 0.98 0.98 0.96 0.97
3 Multi Turn (0-shot) Generation Bert Score 0.88 0.87 0.91 0.90 0.89 0.92
4 Multi Turn Comprehension (0-shot) Bert Score 0.90 0.89 0.92 0.93 0.91 0.94
5 Multi Turn Translation (0-shot) Bert Score 0.85 0.87 0.92 0.91 0.91 0.92
6 Text Classification (5-shot) Accuracy 0.61 0.71 0.76 0.88 0.86 0.89
7 Named Entity Recognition (5-shot) Accuracy 0.31 0.51 0.53 0.61 0.65 0.65

Indic Benchmarks

We also report the performance of our model on the existing Indic benchmarks - IndicXTREME, IndicGenBench and IN-22 for translation The numbers are averaged across 11 Indic languages - Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

# Benchmark Metric Krutrim-1 7B MN-12B-Instruct Krutrim-2 12B llama-3.3-70B Gemini-1.5 Flash GPT-4o
1 IndicSentiment (0-shot) Text classification Accuracy 0.65 0.70 0.95 0.96 0.99 0.98
2 IndicXParaphrase (0-shot) Generation Accuracy 0.67 0.74 0.88 0.87 0.89 0.91
3 IndicQA (0-shot) Generation Bert Score 0.90 0.90 0.91 0.89 0.94 TBD
4 FloresIN Translation xx-en (1-shot) chrF++ 0.54 0.50 0.58 0.60 0.62 0.63
5 FloresIN Translation en-xx (1-shot) chrF++ 0.41 0.34 0.48 0.46 0.47 0.48
6 IN22 Translation xx-en (0-shot) chrF++ 0.50 0.48 0.57 0.58 0.55 0.54
7 IN22 Translation en-xx (0-shot) chrF++ 0.36 0.33 0.45 0.42 0.44 0.43

Qualitative Evaluation

In addition to the quantitative evaluations on the academic benchmarks above, we also conducted manual evaluation on prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation. We currently limit the evaluation to 8 languages and plan to extend it.

cumulative_score_category.png

cumulative_score_language.png

How to Access the Model?

Chat Application

Users can directly access the model on our chat application here: chat.olakrutrim.com/home

API Integration

Developers can integrate the model into their applications via the model API available on Krutrim cloud.

Run locally

Please visit the Krutrim-2 repository or Krutrim-2 HF page for details on running the model locally.

License

Krutrim Community License