Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages

Deepthi Chintha; Nanda Vikas Konduru

Volume 12 Issue 4
April-2025
eISSN: 2349-5162

7.95 impact factor calculated by Google scholar

Published Paper ID:
JETIR2504A94

Registration ID:
560578

Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages

Tokenization is a crucial preprocessing step in any Natural Language Processing task that significantly influences the performance of NLP systems and large language models (LLMs). It is relatively straightforward in languages like English that are abundantly found over the internet; it poses substantial challenges for morphologically rich and script-diverse languages, such as those spoken in India. This survey provides an exhaustive overview of tokenization mechanisms, tracing the evolution from word-level and character-level tokenization to subword, byte-level, and hybrid approaches. We analyze the tokenization techniques employed in major multilingual LLMs such as mBERT, XLM-R, IndicBERT, BLOOM, and LLaMA, highlighting their impact on Indian languages. Specific linguistic challenges such as agglutination, lack of whitespace separation, multi-script diversity, and code-mixing are discussed in depth. We also evaluate the performance of existing Indic tokenizers and review recent innovations, including character-aware and token-free models. Open challenges like the lack of standardized benchmarks, underrepresentation of dialects, and resource constraints are identified. Finally, we outline future research directions focused on building inclusive, adaptive, and efficient tokenization systems for India’s multilingual ecosystem. This survey aims to serve as a comprehensive comparison and analysis of tokenization for Indian languages in the era of multilingual LLMs.

Natural Language Processing, Tokenizers, Large Language Models, Indian Language Tokenizers

"Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages", International Journal of Emerging Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.12, Issue 4, page no.k759-k772, April-2025, Available :http://www.jetir.org/papers/JETIR2504A94.pdf

"Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages", International Journal of Emerging Technologies and Innovative Research (www.jetir.org | UGC and issn Approved), ISSN:2349-5162, Vol.12, Issue 4, page no. ppk759-k772, April-2025, Available at : http://www.jetir.org/papers/JETIR2504A94.pdf

Published Paper ID: JETIR2504A94

Registration ID: 560578

Published In: Volume 12 | Issue 4 | Year April-2025

DOI (Digital Object Identifier):

Page No: k759-k772

Country: Bengaluru, Karnataka, India .

Area: Engineering

ISSN Number: 2349-5162

Publisher: IJ Publication

Home |
Contact Us

Contact Us
Click Here

WhatsApp Contact
Click Here

Published in:

UGC and ISSN approved 7.95 impact factor UGC Approved Journal no 63975

Unique Identifier

Page Number

Post-Publication

Share This Article

Important Links:

Jetir RMS

Title

Authors

Abstract

Key Words

Cite This Article

ISSN

Cite This Article

Publication Details

Download Paper / Preview Article

Download Paper

Preview This Article

Download PDF

Downloads

Print This Page

Impact Factor:

7.95

Impact Factor Calculation click here

Impact Factor:

7.95

Impact Factor Calculation click here

Current Call For Paper

Call for Paper
Cilck Here For More Info

Important Links:

Jetir RMS

Contact Us Click Here

WhatsApp Contact Click Here

Published in:

UGC and ISSN approved 7.95 impact factor UGC Approved Journal no 63975

Unique Identifier

Page Number

Post-Publication

Share This Article

Important Links:

Jetir RMS

Title

Authors

Abstract

Key Words

Cite This Article

ISSN

Cite This Article

Publication Details

Download Paper / Preview Article

Download Paper

Preview This Article

Download PDF

Downloads

Print This Page

Impact Factor: 7.95 Impact Factor Calculation click here

Impact Factor:

7.95

Impact Factor Calculation click here

Current Call For Paper

Call for Paper Cilck Here For More Info

Important Links:

Jetir RMS

Contact Us
Click Here

WhatsApp Contact
Click Here

Impact Factor:

7.95

Impact Factor Calculation click here

Call for Paper
Cilck Here For More Info