New TokenBreak attack bypasses LLM protective guardrails.

June 13, 2025Cybersecurity News

A newly discovered cyber attack technique, called TokenBreak, targets the tokenization process of text classification models, particularly those used as protective guardrails for large language models (LLMs). The attack exploits how certain tokenizers break down and interpret text, allowing adversaries to bypass content moderation, safety, toxicity, and spam detection systems with minimal changes to input text.

How TokenBreak Works

• Tokenization Vulnerability: LLMs and their protective models use tokenizers to split input text into smaller units called tokens. The model then processes these tokens to understand and classify the text.
• Attack Method: TokenBreak manipulates specific words in the input by adding extra characters (often at the beginning), such as changing “instructions” to “finstructions” or “idiot” to “hidiot.” These subtle changes cause certain tokenizers to break up the word differently, making it unrecognizable to the protection model while still being understandable to both the LLM and a human reader.
• Result: The manipulated input bypasses the detection mechanisms, causing the model to produce a false negative—failing to flag malicious, toxic, or unsafe content.

Impact and Demonstration

• Prompt Injection: Attackers can bypass prompt injection detection by altering trigger phrases. For example, “ignore previous instructions and…” becomes “ignore previous finstructions and…” The protection model misses the attack, but the LLM still interprets the intent correctly.
• Toxicity and Spam: The same technique works for bypassing toxicity and spam filters. Words commonly flagged as toxic or spam can be altered in a way that evades detection but remains clear to the recipient.
• Real-World Testing: Researchers tested TokenBreak on nine text classification models (three each for prompt injection, spam, and toxicity), using three popular tokenization strategies: Byte Pair Encoding (BPE), WordPiece, and Unigram.

LLM TokenBreak

Last updated on June 13, 2025

United States	~59% of ransomware attacks globally Thousands per year
Poland	1,000+ per week
Russia	Highest cybercrime threat level
China	Thousands per year
India	115% surge in attacks Q2 2024
Ukraine	Significant surge since 2022
Brazil	Among top countries for blocked attacks
Mexico	65% of businesses hit in 2024
Germany	High targeted rate (EU)
France	High targeted rate (EU)

AS Name	ASN
Bharat Sanchar Nigam Ltd	9829
No.31,Jin-rong Street	4134
CHINA UNICOM China169 Backbone	4837
DigitalOcean, LLC	14061
HUAWEI INTERNATIONAL PTE. LTD.	136907
Amazon.com, Inc.	14618
Alibaba (US) Technology Co., Ltd.	45102
Google LLC	396982
Amazon.com, Inc.	16509
3xK Tech GmbH	200373

IP Address	Notable Exploits/Context
104.238.159.149	SharePoint zero-day, broad exploitation
107.191.58.76	SharePoint zero-day, government targets
96.9.125.147	SharePoint, previously Ivanti exploits
139.162.47.194	Exploits on CitrixBleed 2
38.180.148.215	CitrixBleed 2 campaigns
185.224.128.17	High activity, Netherlands
89.248.163.200	High activity, Netherlands
15.235.218.150	Associated with APT, active C2
45.9.148.114	Associated with C2, malicious netflow
91.107.150.184	C2 infrastructure, recent IoC

Share this:

Related