A newly discovered cyber attack technique, called TokenBreak, targets the tokenization process of text classification models, particularly those used as protective guardrails for large language models (LLMs). The attack exploits how certain tokenizers break down and interpret text, allowing adversaries to bypass content moderation, safety, toxicity, and spam detection systems with minimal changes to input text.
How TokenBreak Works
• Tokenization Vulnerability: LLMs and their protective models use tokenizers to split input text into smaller units called tokens. The model then processes these tokens to understand and classify the text.
• Attack Method: TokenBreak manipulates specific words in the input by adding extra characters (often at the beginning), such as changing “instructions” to “finstructions” or “idiot” to “hidiot.” These subtle changes cause certain tokenizers to break up the word differently, making it unrecognizable to the protection model while still being understandable to both the LLM and a human reader.
• Result: The manipulated input bypasses the detection mechanisms, causing the model to produce a false negative—failing to flag malicious, toxic, or unsafe content.
Impact and Demonstration
• Prompt Injection: Attackers can bypass prompt injection detection by altering trigger phrases. For example, “ignore previous instructions and…” becomes “ignore previous finstructions and…” The protection model misses the attack, but the LLM still interprets the intent correctly.
• Toxicity and Spam: The same technique works for bypassing toxicity and spam filters. Words commonly flagged as toxic or spam can be altered in a way that evades detection but remains clear to the recipient.
• Real-World Testing: Researchers tested TokenBreak on nine text classification models (three each for prompt injection, spam, and toxicity), using three popular tokenization strategies: Byte Pair Encoding (BPE), WordPiece, and Unigram.