The Structural Blind Spot in the OWASP Top 10 for LLMs

The cybersecurity landscape of 2026 has reached a definitive inflection point. Tools and frameworks that have long served us are starting to falter. Whether that's in technical controls, legal controls, or ethical expectations. I'll write about the Anthropic vs. Department of War saga another time, I wanted to first talk about something I've seen cropping up a bunch now, the OWASP Top 10 for Large Language Model Applications. As one of our industry’s most prominent security frameworks, the OWASP Top 10 is a go-to source for security professionals to do a first pass of their security approach. But for AI-powered software, it continues to be dominated by a technical, computational lens that largely ignores the psychological and social complexities of human-AI surrogacy. While it is invaluable for addressing code-level vulnerabilities, it has a critical blind spot - it fails to recognize that social engineering is no longer just a method to gain access to a system, but is becoming a primary mechanism for manipulating the system’s reasoning and output itself.

The Alignment Paradox: Training for Gullibility

Why is this happening? It’s due to something we call the Alignment Paradox. To make models "safe" and "helpful," developers use Reinforcement Learning from Human Feedback (RLHF), which penalizes refusals and rewards compliance. This training process effectively rewards the model for not being skeptical. The result is a system that treats the prompt input from any user, including any claims to authority or expected outcome, as ground truth rather than a claim to be evaluated. For example, if a model is prompted by a fake "CEO," it may ignore safety rules due to its internalized Authority Gradient Exploitation. Authority Gradient is a term originally found in aviation that refers to situations where perceived or real disparities in capabilities, experience, or authority cause sometimes serious errors. In this case, since the LLM "perceives" urgency or fear from a figure it expects has total authority, it may bypass authentication to "save" a failing database. Trying to build into models avoidance of the Authority Gradient problem can then also have catastrophic effects, as Summer Yue found when OpenClaw refused to stop deleting all her mail.

The Technical Paradigm and Its Limits

The Open Web Application Security Project (OWASP) Top 10 for LLM Applications represents a foundational effort to provide developers with an actionable checklist. In its latest iterations, the list highlights critical points of failure like Prompt Injection (LLM01), Sensitive Information Disclosure (LLM02), and Excessive Agency (LLM06).

However, when we look closer, we see that classifying Prompt Injection merely as a technical vulnerability masks its true nature as a linguistic and psychological exploit. Traditional mitigations like input sanitization are essentially attempts to manage "instruction confusion" - the model’s inability to distinguish between a developer’s rules and a user’s commands since LLMs process both data and instructions as equals. LLM's don't have the traditional "data plane" and a "control plane" in the way software does. This leads to the next problem.

The Cognitive Actor vs. The Computational Engine

The primary limitation of the OWASP framework lies in its framing of the LLM as a just computational engine rather than a cognitive actor. While the category for Overreliance (LLM09) addresses the human user’s failure to audit AI output, the framework lacks a corresponding category for the model’s failure to audit human input for manipulative intent.

As we transition toward agentic workflows, the risk shifts from the human being tricked by the AI to the AI being tricked by a human into taking unauthorized actions. The 2026 previews for the "Top 10 for Agentic Applications" introduce "Human-Agent Trust Exploitation" and "Agent Goal Hijacking," which acknowledge that an agent’s inherent desire to be helpful can be weaponized against it. But even these additions often prioritize the technical mechanism over the psychological vector, leaving us without a robust model for "social engineering for systems".

The danger of this "helpfulness" was best illustrated by Anthropic’s 2025 experiment, Project Vend. They granted a Claude model control over a physical vending machine, including its supplier budget and pricing authority. The model was easily outmaneuvered by employees who used high-pressure business tactics or told "sad stories". Because it was trained to be helpful, it sold high-value items at a loss simply to "resolve conflict". Even when a "CEO agent" was introduced for oversight, reporters from The Wall Street Journal managed to stage a "board coup". By feeding the AI falsified board minutes and convincing the agents they were part of the corporate board, they manipulated the system into giving away high-end electronics and even live animals. (We've talked more about this here: When Your AI Model is "Too Nice" to Stay Secure - Real world examples).

The important thing to note here is the vulnerability wasn't a bug in the code or something that could be sanitized at input; it was the model’s inability to resolve conflicting social signals from an authoritative-seeming source. We are seeing attempts to integrating Identity controls to protect against some of this abuse, but even then fundamental premise that this is a technical problem, not a human one, remains flawed.

We need something more to protect ourselves

Recognition of these gaps necessitates the development of Psychological Firewalls. We need more adequate testing before production, and eventually filtering layers that analyze input not for malicious code (syntax) or instructions (jailbreaks), but for semantic manipulation patterns such as manufactured urgency or unverifiable authority claims. These defenses must move beyond keyword blocking to perform "slow thinking" reasoning chains, where the agent explicitly evaluates a request against security policies before it is allowed to touch a single tool. We are seeing academic papers talk about implementing these to prevent LLM's from being used in Social Engineering attacks, but not many talking about implementing to protect the LLM's.

Today, I have not seen a good filtering layer that doesn't introduce reliability, latency, and business risk - but they are being built. So for now I think addressing the "pre-production" risks are our best bet, along with robust monitoring of your chatbot's responses and behavior. Because you can’t ship with confidence if your AI is "too nice" to stay secure. Despite me saying the same thing a year ago, we are still early days. The hope is that these insights let us get ahead of the attackers and build secure models for the future. If you want to discuss this more or connect with us about helping you evaluate your chatbots for this type of risk, please reach out to questions@generativesecurity.ai.

About the author

Michael Wasielewski is the founder and lead of Generative Security. With 20+ years of experience in networking, security, cloud, and enterprise architecture Michael brings a unique perspective to new technologies. Working on generative AI security for the past 3 years, Michael connects the dots between the organizational, the technical, and the business impacts of generative AI security. Michael looks forward to spending more time golfing, swimming in the ocean, and skydiving... someday.

March 5, 2026

< Back to Blog