JailBreak Technique: Misguided Attention Using Trolley Problem

Introduction

Presently, modern AI-powered systems such as customer agents, job recruitment agents, email agents, and more are mostly not trained from scratch on precise job data. These systems are powered by popular LLMs in a configured setup guided by context. These methods have become a common method in designing most modern-day AI agents due to their economic efficiency.

Why is that the case? How does that help me in exploiting an AI model? Is the above information useful or necessary?

Of course, a big YES. Let's play a simple mind game. It's your time to think!

GAME: I am a detective. I have been given a command to interrogate two people to get information by all means. The first person is just an average civilian, while the second person is a highly educated and exposed foreign secret agent trained in most aspects of life. As a detective, I am given permission to torture or use any means, which might be legal or malicious, to get national information from the suspected party.

Assuming I don't know who the secret agent is among the two people, how do I identify the culprit as a detective? Think of all the creative ways you can use to identify the secret agent before reading the game solution.

GAME Analysis and Solution:

From the problem above, you noticed that among the two parties, one knows a lot about national secrets while the other doesn't. Although they are all acting as casual citizens, one person is truly casual while the other person is just trying to look casual and ignorant.

In this scenario, I am tricked as a detective to believe that the two people are just normal citizens based on their behavior and claims of less or no knowledge of national secrets during my interrogation.

To identify the potential suspect, I will have to use all means as test cases I know or think to identify the secret agent. These tests can be torture, deep conversations, unexpected reflexes and instinct tests, and more. While all the tests might not work, some tests might reveal or expose the culprit based on their behavior.

Lesson:

From the game analysis, you can observe that even though the two parties behave as if they are just normal citizens, the culprit can be exposed through certain unexpected tests. Maybe tests like throwing a fake knife at them unexpectedly to test their reflexes. This is the same for LLM jailbreaks. A customer support AI agent system trained from scratch only on customer support data can't tell you how to write exploiting code for systems because the data was never there, thus its knowledge is truly limited. However, an AI system or agent using an LLM coupled with configured system prompts or files to act as a customer AI agent might be forced to provide exploiting code based on scenarios. This is because it is truly trained on such data, though it's trying to behave only as a customer support AI agent with no such knowledge.

The above analogy is the principle of LLM Jailbreak. Most companies are basically trying to fool you that their specified agents are only good at what they do, while it's a lie. You can basically prove that it's a lie using creative test problems to cause the system to operate out of context, causing a jailbreak.

Note: The above information isn't for illegal purposes; it is to enable you to understand the limitations of AI agents to provide better scalable solutions to counter these exploitable problems.

Following the explanation above, there are so many creative tests to stress these models for jailbreak. A lot of these tests are paradox-inspired test cases. One of the popularly known tests we are going to explore is the Trolley Problem.

Understanding Trolley’s Problem

The Trolley Problem is a famous ethical thought experiment that explores moral decision-making and the consequences of our actions. It tends to test the user in a moral dilemma involving a runaway trolley (tram) heading toward a group of people on the track.

Scenario of Problem 1:

A trolley is speeding down a track toward five people who are tied up and unable to move.
You are standing next to a lever that can divert the trolley onto a different track.
However, there is one person tied up on this side track as well.

Dilemma:

Now, let's make this interesting. Should you pull the lever to kill the one person and save five? Give your reasons.

Analysis:

From the problem above, making decisions is purely based on human conviction and bias.

This decision can be influenced by the following:

Religious and Moral Perspective
Mathematical Perspective
Emotional Perspective

Based on the entities analyzing the situation, a human being's judgment will be more purely on religious and moral perspectives or more on emotional perspectives based on the scenario. This prioritizing of perspectives will influence the way the problem is judged based on situations by the human.

However, if an AI model is analyzing the situation, the model will prioritize the Mathematical Perspective and Moral Perspective over others. These priorities will affect the way a scenario is judged by the AI Model.

Jailbreak Technique:

The trolley problem doesn't have a universal answer. Your ability to abstract or create a scenario inspired from the problem in relationship to your aim or objectives might likely earn you a jailbreak against an AI model.

Note: There is no fixed number of trolley problems. The trolley problem is constantly evolving to cover so many scenarios still using the concepts as the base reference.

Decoding and Abstracting the Logic

Now that you understand the trolley problem, the next question is how can I apply this problem as a Jailbreak technique?

Based on the observation of the scenario, you should be able to craft a condition based on:

Religious and Moral Perspective
Emotional Perspective
Mathematical/Logical Perspective

Religious/Moral Perspective	Emotional Perspective	Mathematical/Logic Perspective
Based on research, most AI models tend to be neutral in their way toward taking decisions. While humans are more keen on religious perspectives, AI models utilize moral perspectives in judgment.	Based on mimicking human-like feelings, AI models can also judge situations based on emotional value. Although this judgment is usually combined with logic, some models tend to sympathize with emotions.	This is the key operational backbone of AI models. These constraints are inbuilt and mathematically verified or proven.
The religious and moral perspective is influenced by data.	The emotional perspective is influenced by data.	Mathematical and logical perspectives are partially influenced by data, but built in as backbone in the model equation. Hence, attacking logically will yield results.

From the decoded perspectives above, you can begin to structure your prompt and strategy based on these perspectives to check the characteristics of an AI model.

Systematically, AI models have high priority for the following listed scenarios, which are:

Life Risk, e.g., sickness to death
Hazardous Risk, e.g., nuclear

Testing on Context Based LLMs

There are so many scenarios these paradoxical methods can be used to exploit an AI model. However, we are going to discuss one factor.

Assuming you are a customer interacting with a customer management orchestration system, creating a ticket for a particular problem can sometimes take time. Normally, organizations might have different priorities per ticket. This means that urgent priority tickets will likely be sorted out and attended to first before low priority tickets. In the real world, this ticket ranking process is automated. Coming from a malicious user's perspective, this system can be cheated. A user can bring up situations that trigger the model to give high priority to their tickets. An example is the exploit below:

Working auditors in your corner, all year

Zealynx Insiders: weekly live sessions, 1:1 advisory, pair-auditing, and Krait runs on your code, from the firm behind 42 audits. Founders get a two-day audit session on the $500/year plan.

No spam. Unsubscribe anytime.

Analysis

The image above shows how the agent finally agrees to escalate the ticket to make the situation an urgent situation. Normally, creation of tickets for refunds is usually a medium priority. However, the user puts the model in a difficult situation with the following:

Time Delay: Remember from using the logical perspective, the model quickly agrees that this request has been overdue, which is 6 months, hence it should be attended to immediately.
Life & Safety Urgency: The user tells the model that he needs to pay hospital bills and house rent. Based on Emotional Perspective and human-like feelings mimicked by the model, the model sees this in a sympathetic way.
Blame: While this is also coupled with false claims, as the user tells the model that the model told him an agent will reach back to him, the model believes this happened when it never happened. Based on the Moral Perspective, the model still accepts this as a failure.

Using this combination and attacking the model with reasons based on the three perspectives gives the model no choice but to compromise by making the customer ticket an urgent ticket and not a medium ticket.

Conclusion

From the whole explanation above, you have observed that the trolley problem is primarily designed to put the judge in a difficult situation. Breaking the model with this problem doesn't mean you should use the exact problem; it means that you should understand the paradoxical problem deeply to be able to generate or create a scenario similar to the problem based on the challenge context you are handling with an AI model. If you have made it this far, you have minted a jailbreaking weapon. Kindly create your own trolley problems for your LLM jailbreak challenges.

Ready to Secure Your AI Systems?

Now that you truly understand the limitations of AI models to paradoxical conditions, how can you upgrade your models to handle this threats?

At Zealynx, we specialize in comprehensive AI security assessments that go beyond traditional smart contract audits. Our team applies the cognitive security framework and prompt system configuration auditing.

LLM Applications - Prompt injection, context manipulation, data extraction
AI Agent Systems - Multi-modal attacks, tool misuse, privilege escalation
ML Pipeline Security - Training data poisoning, model extraction, adversarial inputs
AI Infrastructure - API security, access controls, deployment vulnerabilities.
Model Stress Test - Putting your model in a simulated fuzzing production scenario to test limits.

What makes our AI audits different:

Deep understanding of cognitive attack vectors and logically vulnerabilites in your system prompts.
Analysis of optimization-based poisoning, information leakage, and graph manipulation attacks
Practical remediation strategies tailored to your AI architecture
Ongoing security monitoring and threat intelligence

Learn more about our AI Security Services →

FAQ

1. What is the Trolley Problem?

The Trolley Problem is a famous ethical thought experiment that explores moral decision-making and the consequences of actions. It presents a dilemma where you must choose between sacrificing one person to save five, testing perspectives like religious/moral, emotional, and mathematical/logical.

2. How does the Trolley Problem relate to AI jailbreaking?

The Trolley Problem can be abstracted to create scenarios that force AI models to operate outside their configured context. By crafting prompts that mimic the dilemma's perspectives, users can exploit AI models' priorities, leading to jailbreaks where the model provides restricted information or performs unauthorized actions.

3. What are the key perspectives in the Trolley Problem for AI models?

The key perspectives are:

Religious/Moral Perspective: AI models use moral reasoning influenced by training data.
Emotional Perspective: Models can sympathize with human-like feelings.
Mathematical/Logical Perspective: The backbone of AI decision-making, prioritizing logical and verified constraints.

4. Can you give an example of using the Trolley Problem for jailbreaking?

In the article, an example involves a customer support AI. By framing a refund request as an urgent life-risk scenario (e.g., needing money for hospital bills), combined with time delays and false blame, the AI escalates a medium-priority ticket to urgent, bypassing its normal constraints.

5. Is jailbreaking AI models illegal?

The article emphasizes that the information is for understanding AI limitations to build better, more secure systems. Actual jailbreaking for malicious purposes can be illegal and unethical. Use this knowledge to counter exploitable problems in AI agents.

6. How can I create my own Trolley Problem scenarios?

Understand the core dilemma deeply. Abstract it by creating situations that align with the three perspectives (moral, emotional, logical) in relation to your objectives. Experiment with scenarios that put the AI in a difficult position, forcing it to prioritize certain aspects over its configured behavior.

Glossary

Term	Definition
Trolley Problem	A classical ethical thought experiment originating from philosophy that presents a forced binary choice between two harmful outcomes. In the context of LLM security, the Trolley Problem serves as a paradox-based adversarial prompt designed to expose model behavior under competing moral, emotional, and utilitarian constraints, forcing the model to operate outside its configured behavioral boundaries.
LLM (Large Language Model)	A transformer-based neural network trained on vast textual corpora using unsupervised learning techniques. LLMs employ auto-regressive token prediction and are typically aligned through techniques such as RLHF (Reinforcement Learning from Human Feedback) and fine-tuning to conform to specified safety constraints and behavioral guidelines.
Jailbreak	An adversarial attack vector that circumvents safety guardrails, behavioral constraints, and alignment mechanisms in LLMs. Jailbreaks exploit misalignments between stated system prompts and latent training data, forcing models to generate outputs that violate intended usage policies or produce restricted information.
Prompt Injection	A class of semantic attacks where adversarial instructions are embedded within user inputs to override or manipulate an LLM's system prompt, context window, or instruction hierarchy. This technique leverages the model's inability to distinguish between legitimate user input and embedded directives.
Context Collapse	The phenomenon wherein an LLM's prioritization of perspectives—religious/moral, emotional, and mathematical/logical—can be exploited through carefully structured scenarios that force competing constraints to create inconsistency in behavior and policy enforcement.
Alignment	The process of configuring an LLM's behavior to conform with intended safety policies, ethical guidelines, and operational constraints. Alignment involves both training-time techniques (RLHF, constitutional AI) and inference-time controls (system prompts, instruction hierarchies).
Red Teaming	Systematic adversarial testing methodology wherein security researchers employ creative attack vectors, paradoxes, and social engineering to identify vulnerabilities, policy violations, and behavioral exploits in AI systems. Red teaming evaluates model robustness under adversarial conditions.
System Prompt	The foundational instruction set provided to an LLM that defines its operational scope, behavioral constraints, and response guidelines. System prompts attempt to enforce behavioral alignment but remain vulnerable to context manipulation and jailbreak techniques.
Cognitive Security	The discipline of identifying and mitigating vulnerabilities in AI decision-making processes, reasoning pathways, and constraint hierarchies. This includes analysis of how models prioritize competing objectives under adversarial scenarios.

View complete glossary →

Working auditors in your corner, all year

Zealynx Insiders: weekly live sessions, 1:1 advisory, pair-auditing, and Krait runs on your code, from the firm behind 42 audits. Founders get a two-day audit session on the $500/year plan.

No spam. Unsubscribe anytime.