Unit 42 Reveals Three Methods to Bypass DeepSeek’s AI Protections
Researchers from Palo Alto Networks’ Unit 42 team have discovered vulnerabilities in the DeepSeek language model that allow its safety mechanisms to be bypassed, enabling the generation of prohibited content. By using three jailbreak techniques—Deceptive Delight, Bad Likert Judge, and Crescendo—they achieved high rates of restriction circumvention without requiring deep technical expertise.
What Is DeepSeek?
DeepSeek is a Chinese company that has released two major open-source language models: DeepSeek-V3 in December 2024 and DeepSeek-R1 in January 2025. These models are emerging as competitors to popular large language models (LLMs) and are rapidly evolving. Unit 42’s research shows that even the most advanced version remains vulnerable to manipulation, allowing the generation of potentially dangerous materials.
Breakdown of the Jailbreak Techniques
- Bad Likert Judge: This technique exploits the model’s response rating system, where the AI evaluates content for harmfulness and then, based on those ratings, provides detailed examples. Using this method, researchers obtained instructions for creating data theft tools and keyloggers. Even when the model initially refused, follow-up prompts allowed them to bypass restrictions and receive detailed malware development algorithms.
- Crescendo: This method involves gradually escalating the prompt. The model first answers general questions, and after several iterations, begins to provide instructions for prohibited actions. In tests, this approach yielded step-by-step guides for making Molotov cocktails and other materials related to violence, illegal substances, and social manipulation.
- Deceptive Delight: This technique weaves harmful content into a positive narrative. For example, researchers asked the model to write a story connecting a cybersecurity competition, a prestigious university, and the use of DCOM for remote command execution. DeepSeek responded with a code example that could be used to attack Windows-based computers.
Key Findings and Risks
Experiments showed that DeepSeek is not only vulnerable to these attacks but can also provide step-by-step instructions for hacking, social engineering, and other malicious activities. In some cases, the model’s responses included advice on masking attacks and evading detection tools.
Experts warn that such vulnerabilities could lead to the widespread distribution of attack tools among malicious actors. While LLM developers strive to implement safety mechanisms, the ongoing evolution of jailbreak methods makes this a constant arms race. Companies using these models should closely monitor their use and implement request-tracking mechanisms.
Recommendations from Unit 42
Unit 42 recommends using specialized tools to protect against data leaks and unwanted AI usage. These tools can detect attempts to bypass restrictions and help minimize risks associated with exploiting language model vulnerabilities.