UGC Approved Journal no 63975(19)
New UGC Peer-Reviewed Rules

ISSN: 2349-5162 | ESTD Year : 2014
Volume 12 | Issue 9 | September 2025

JETIREXPLORE- Search Thousands of research papers



WhatsApp Contact
Click Here

Published in:

Volume 12 Issue 4
April-2025
eISSN: 2349-5162

UGC and ISSN approved 7.95 impact factor UGC Approved Journal no 63975

7.95 impact factor calculated by Google scholar

Unique Identifier

Published Paper ID:
JETIR2504A94


Registration ID:
560578

Page Number

k759-k772

Share This Article


Jetir RMS

Title

Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages

Abstract

Tokenization is a crucial preprocessing step in any Natural Language Processing task that significantly influences the performance of NLP systems and large language models (LLMs). It is relatively straightforward in languages like English that are abundantly found over the internet; it poses substantial challenges for morphologically rich and script-diverse languages, such as those spoken in India. This survey provides an exhaustive overview of tokenization mechanisms, tracing the evolution from word-level and character-level tokenization to subword, byte-level, and hybrid approaches. We analyze the tokenization techniques employed in major multilingual LLMs such as mBERT, XLM-R, IndicBERT, BLOOM, and LLaMA, highlighting their impact on Indian languages. Specific linguistic challenges such as agglutination, lack of whitespace separation, multi-script diversity, and code-mixing are discussed in depth. We also evaluate the performance of existing Indic tokenizers and review recent innovations, including character-aware and token-free models. Open challenges like the lack of standardized benchmarks, underrepresentation of dialects, and resource constraints are identified. Finally, we outline future research directions focused on building inclusive, adaptive, and efficient tokenization systems for India’s multilingual ecosystem. This survey aims to serve as a comprehensive comparison and analysis of tokenization for Indian languages in the era of multilingual LLMs.

Key Words

Natural Language Processing, Tokenizers, Large Language Models, Indian Language Tokenizers

Cite This Article

"Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages", International Journal of Emerging Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.12, Issue 4, page no.k759-k772, April-2025, Available :http://www.jetir.org/papers/JETIR2504A94.pdf

ISSN


2349-5162 | Impact Factor 7.95 Calculate by Google Scholar

An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 7.95 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator

Cite This Article

"Survey of Tokenization Mechanisms in Multilingual Large Language Models with a Focus on Indian Languages", International Journal of Emerging Technologies and Innovative Research (www.jetir.org | UGC and issn Approved), ISSN:2349-5162, Vol.12, Issue 4, page no. ppk759-k772, April-2025, Available at : http://www.jetir.org/papers/JETIR2504A94.pdf

Publication Details

Published Paper ID: JETIR2504A94
Registration ID: 560578
Published In: Volume 12 | Issue 4 | Year April-2025
DOI (Digital Object Identifier):
Page No: k759-k772
Country: Bengaluru, Karnataka, India .
Area: Engineering
ISSN Number: 2349-5162
Publisher: IJ Publication


Preview This Article


Downlaod

Click here for Article Preview

Download PDF

Downloads

000165

Print This Page

Current Call For Paper

Jetir RMS