Back to Blog 

Tokenization is the core process through which a model ingests data. Models are not databases that store data as strings, characters, integers, floats, and so on. A model is mathematically designed to handle numbers only; therefore, letters, strings, words, and any form of Unicode must be converted into a digital representation before processing.
A major reason a small mistake in a word or sentence might produce a very different answer in earlier models is the high variance in the tokenized inputs. With improvements in RAGs, context handling, and correction mechanisms, this early issue has largely been addressed in many production LLMs.
Raw text is generally represented as Unicode strings.
Example:
1String = "Hello, 🌍! Chinese!"
The above is usually represented by indices such as:
1indices = [15496, 11, 995, 0]
A language model places a probability distribution over a sequence of tokens. This is why a standard and predictable procedure is needed to encode strings into tokens. A predictable procedure is also needed to decode the sequence of tokens back into string output.
In summary, a tokenizer is simply a class that implements the encode and decode methods for inputs and outputs in an AI system.
Now, let us dive into how tokenizers really work, their word-processing methods, working mechanisms, and more.
Observation 1
1. A word and its preceding space are part of the same token.
In the process of tokenization, a word and its preceding space are part of the same token. For instance, the phrase "Hello World" will produce two tokens: [A, B]. In that phrase, there are two words. Token A is derived from the tokenized string "Hello", while token B is tokenized from the space followed by the word "world". In other words, B is a tokenization of " World". This gives us a rule.
Observation 2
2. If the same words are used in a phrase, the same word at the beginning, in the middle, and at the end of the phrase are different.
For example, if the phrase "hello hello hello" is input, following rule 1, the tokenized output is [A, B, C]. Although the words are the same in the phrase, in the tokenized input, A != B != C.
Note: There is no specific token ID for words. Every tokenization ID is based on a tokenizer's vocabulary (dictionary). Hence, a hacker gaining access to the official tokenization dictionary of an AI product could already create a severe security issue.
Observation 3
3. Numbers are tokenized into groups of a few digits.
Although numbers are typically numerical, tokenizers still split them into groups of a few digits. Sometimes this is predictable; sometimes it is not. Some tokenizers make each digit a token. This is not a good memory practice, as it can consume a large amount of the tokenizer's memory. Therefore, tradeoffs must be considered.
Invariant for Tokenization
- Accurate Roundtrip: When a tokenizer is implemented, the
encodeanddecodemethods must be checked thoroughly for an accurate roundtrip of encoded input to decoded output. If the roundtrip is not accurate, there is a problem.
Compression Ratio in Tokenization
First, in tokenization, the compression ratio is the number of bytes per token. The larger the compression ratio, the shorter the sequence.
Note: Attention is quadratic in sequence length, so a shorter sequence resulting from a larger compression ratio is good. A powerful method to increase your compression ratio is by increasing vocabulary size. This can lead to sparsity.
The formula for compression ratio is:
Types of Tokenizers
- Character Tokenizer
- Byte Tokenizer
- Word Tokenizer
- BPE Tokenizer
Character Tokenizer:
A character tokenizer treats a Unicode string as a sequence of Unicode characters, where each character can be converted into a code point (integer) using the
ord function. The code point can also be converted back into a character using the chr function.There are approximately 150,000 Unicode characters, according to recent reports. Using all of these Unicode characters creates a very large vocabulary, which becomes a problem because many characters are rare. As a result, using such a massive vocabulary is inefficient. This is bad for both worlds: it creates a large vocabulary and a low compression ratio.
Byte Tokenizer:
A byte tokenizer represents Unicode strings as a sequence of bytes, which can be represented as integers from 0 to 255. The most common Unicode encoding is UTF-8, which is used in nearly every application.
With this tokenizer, a string may consist of a single byte or multiple bytes, depending on the nature of the string. Byte sequences can be very long, leading to a lower compression ratio, which is not ideal.
Given that the context length of a Transformer is limited, and attention is quadratic in nature, this is not a very efficient approach.
Word Tokenizer:
The word tokenization method is another approach. It was frequently applied in classical NLP (Natural Language Processing) before the LLM boom. Word tokenization involves splitting strings into words.
This method is better because words carry meaning, which relates more closely to human context. The compression ratio is high, and the vocabulary size can be large.
However, many words are rare, and the model will not learn about them. As a result, this approach does not provide a fixed vocabulary size.
This can matter in production. New words that have not been seen during training will receive a special
UNK token, which can be ugly and lead to poor perplexity and calculations.BPE (Byte Pair Encoding) Tokenizer:
At the time of writing this article, Byte Pair Encoding is the dominant tokenization method used in modern AI systems. The Byte Pair Encoding algorithm originally appeared in the 1990s and was introduced for data compression. Because of its balance between compression ratio and fixed-size vocabulary, it is now widely used in NLP.
Following the rules of compression, it merges successive and repeated bytes to reduce ambiguity and complexity. This principle still plays an important role in modern LLM tokenization.
The process begins as follows:
- Encode the sentence or input into bytes.
- Token pairs with the same byte sequences are paired and merged.
- The merged token pairs generate a new token that is stored in the vocabulary. In this way, a token can represent a sequence of pairs.
This method allows the model to dynamically store a sequence of inputs, which can be a word, phrase, or sentence, in a vocabulary as a single token for long, regularly occurring inputs. Hence, the compression ratio is high and the vocabulary size is limited while still matching the input efficiently.
To maintain speed, inputs are broken into chunks and encoded in parallel with the tokenizer. The best practice is usually to make chunk sizes variable.
While tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens, it also introduces certain limitations.
These limitations can be revealed by introducing
Improbable Bigrams.An improbable bigram is an out-of-distribution combination of incomplete tokens designed to exploit their dependency. This is one of the potential vulnerabilities introduced by modern byte-level BPE tokenizers, which may create blind spots in language representation.
Some behaviors can arise as a result of
incomplete (undecodable) tokens in byte-level BPE tokenizers. These tokens cannot be decoded independently and must appear in conjunction with certain other tokens to form legal Unicode characters.Erratic flaws have been pointed out in BPE: the greedy compression prioritizes frequency over linguistically meaningful boundaries. As a result, incorrect segmentation by BPE tokenization can cause suboptimal model performance.
Note:
Improbable Bigrams suffer from a significantly higher rate of hallucination than their complete-token counterparts across all tested models. This is different from hallucination involving glitch tokens, which is attributed to undertrained tokens. This is evidence that even well-trained incomplete tokens can struggle to faithfully represent textual inputs.Behaviours of Improper Tokenization
-
Glitch Token Hallucination: This happens when an AI model encounters ambiguous text. Tokenizers are formed by reading a massive amount of data from the internet. However, when an AI model encounters strange words and names, such as usernames, it can form and generate bizarre tokens, causing the model to accidentally recognize useless strings of text. Even though these bizarre strings exist, the model was never trained on them. Hence, if it is tricked into generating such text, it does not know what it means.
-
Numerical Reasoning Errors: One of the problems in AI models is that they do not see a consistent framework for numbers. Tokenizers treat numbers like text and chunk them inconsistently based on their appearance in text. This inconsistency affects their interpretation. This can lead to inconsistent evaluation of the product of two or more large numbers by an AI model. Hence, AI models should not be relied on for exact calculation. Calculating tools should be attached to the model if it is used in trading or mathematically critical business interfaces.
-
Unfairness and Bias: Because English is so dominant in modern AI models, most AI models are trained on English-based content from the internet. Using a minority language to communicate with models will be less efficient because most tokens are formed from English-based data. As a result, unexpected and inaccurate answers and behaviors may be observed.
Mitigations
- Alternative Tokenization: Before we begin, please read the sentence: "Helo, hau are yuu dolng?". From reading this short sentence, you can understand that it should be represented correctly as "Hello, how are you doing?". This is not the same for AI models, because everything deviates.
Now, given the problem, alternative tokenization is a way to deal with it. To handle improbable bigrams and the phrasing difficulties that models struggle to process, the target phrases are pre-segmented to avoid character-boundary-crossing tokenization. This helps isolate characters formed from stray bytes so they can be tokenized separately and then appended together to create an alternative token that points to or represents the same phrase.
This is effective because the sequence for the alternative tokenization cannot be generated by the tokenizer, ensuring that it is out of distribution. Also, providing more than two tokens enables the model to recall more tokens correctly when an ambiguous token combination is input, allowing it to mimic human assumptions for multiple misspellings in a sentence and provide more accurate information.
Zealynx Security Brief
Monthly vulnerability spotlights, exploit breakdowns, and security insights. Join security-conscious devs.
No spam. Unsubscribe anytime.
Note: This does not mean that the mitigation is completely secure; however, it is a more accurate alternative that improves the model's performance.
Tokenization is not just a preprocessing detail; it is one of the earliest and most important layers where human language is translated into something a model can reason over. The vulnerabilities discussed in this article show that a tokenizer can quietly distort meaning long before the model ever generates an answer. A single bad split, an unusual byte sequence, a rare token, or a poorly handled number can create a failure that looks like hallucination, weak reasoning, or brittle behavior.
That is why tokenizer design should be treated as a security and reliability concern, not only as an engineering convenience. The best models are not only those with better weights or larger context windows; they are also those whose input representation is robust enough to preserve meaning under real-world pressure. In practice, this means testing your tokenizer on edge cases, multilingual inputs, unusual Unicode, noisy user text, and domain-specific vocabulary. It also means recognizing that the same text can be interpreted very differently depending on how it is segmented.
For builders, researchers, and security practitioners, the lesson is clear: if the tokenization layer is weak, the rest of the system inherits that weakness. A strong AI system needs strong input representation, careful evaluation, and a clear understanding of where representation failures can become security problems.
Ready to Secure Your AI Systems?
At Zealynx, we specialize in comprehensive AI security assessments that go beyond traditional smart contract audits. Our team applies the cognitive security framework and mathematical analysis you've learned throughout this series to identify vulnerabilities in:
- LLM Applications - Prompt injection, context manipulation, data extraction
- AI Agent Systems - Multi-modal attacks, tool misuse, privilege escalation
- ML Pipeline Security - Training data poisoning, model extraction, adversarial inputs
- AI Infrastructure - API security, access controls, deployment vulnerabilities
What makes our AI audits different:
- Deep understanding of cognitive attack vectors and mathematical vulnerabilities covered in this series
- Analysis of optimization-based poisoning, information leakage, and graph manipulation attacks
- Practical remediation strategies tailored to your AI architecture
- Ongoing security monitoring and threat intelligence
FAQ
1. What is a tokenizer, in simple terms?
A tokenizer is the component that converts raw text into a sequence of tokens the model can process. In practice, it decides how words, punctuation, numbers, and symbols are split and represented. A practical use case is a customer support chatbot that must understand misspellings, product names, and mixed-language requests.
2. Why can tokenization make a model behave unpredictably?
Because tokenization changes the structure of the input before the model ever sees it. The same sentence can be represented differently depending on spacing, punctuation, rare words, or Unicode. A practical use case is a legal assistant handling names, abbreviations, or unusual formatting that should not change the meaning of the prompt.
3. What are improbable bigrams, and why do they matter?
Improbable bigrams are unusual token combinations that can expose weaknesses in the tokenizer. They matter because they can push the model into awkward or unstable behavior, especially when the input is out of distribution. A practical use case is a product search system where users enter strange brand names, code snippets, or hybrid words that should still be understood.
4. Why are BPE tokenizers so common in modern LLMs?
Byte Pair Encoding gives a strong balance between compression and vocabulary efficiency. It allows the model to represent common sequences as single units while still keeping the vocabulary manageable. A practical use case is a coding assistant that needs to handle both common English words and programming syntax efficiently.
5. Can tokenizer weaknesses become security issues?
Yes. If a tokenizer splits text in a surprising way, an attacker may be able to craft inputs that confuse the model, bypass expected interpretation, or trigger unreliable outputs. A practical use case is a finance or healthcare assistant where a small input distortion could lead to dangerous or misleading recommendations.
6. Why do numbers often cause trouble for LLMs?
Numbers are not always tokenized in a consistent or intuitive way. Some tokenizers split them into chunks that make arithmetic and exact reasoning harder. A practical use case is an AI trading assistant or accounting copilot that should not rely on the model alone for exact calculations.
7. Why do multilingual systems suffer more from tokenizer issues?
Many tokenizers are trained with a strong bias toward English and high-resource languages, which means lower-resource languages may be represented less efficiently. A practical use case is a global customer support tool that must work well in Spanish, Arabic, Hindi, and other languages without degrading quality.
8. What practical steps can teams take to reduce tokenizer risk?
Teams can test round-trip encoding and decoding, stress-test unusual input, evaluate multilingual and noisy text, and consider alternative tokenization strategies when needed. A practical use case is an enterprise knowledge assistant that must remain reliable when users paste messy documents, code blocks, or uncommon names.
9. How can I tell whether my tokenizer is performing well?
A good tokenizer should preserve meaning across round trips, handle edge cases predictably, and avoid creating too many unstable or ambiguous token boundaries. A practical use case is a search system that must return the right documents even when queries contain typos, punctuation, or mixed scripts.
10. What is the main takeaway for AI builders?
The main takeaway is that tokenization is part of the model’s security and reasoning foundation. If the representation layer is fragile, the entire system becomes fragile. A practical use case is any production AI application where accuracy, trust, and safety matter more than raw model scale.
Glossary
| Term | Definition |
|---|---|
| Tokenization | The process of converting text into smaller units, or tokens, that a language model can process and reason over. |
| Byte Pair Encoding (BPE) | A common subword tokenization method that merges frequent byte pairs to create compact token vocabularies for LLMs. |
| Improbable Bigram | An unusual or out-of-distribution token combination that can expose weaknesses in tokenization and lead to unstable model behavior. |
| Context Window | The limited amount of input a model can attend to at once; tokenization directly affects how much information fits into it. |
| AI Hallucination | A confident but false or unsupported model output that can be amplified when tokenization distorts the input representation. |
| Round-Trip Encoding | The ability to encode text into tokens and decode it back to the same text, which is essential for reliable tokenizer behavior. |
Get funded for your audit
Core grants cover up to $32k. Growth and Builder tiers available. Rolling applications.
No spam. Unsubscribe anytime.
