When Your AI Model is "Too Nice" to Stay Secure - Real world examples

When you think of "breaking" an AI model, it's likely you think one of two things: a technical jailbreak (finding the magical string of characters that bypasses safety filters) or a prompt injection (tricking a model into executing code by burying instructions in data). And if generative AI was like every other technology, we could stop there; but it's not. Generative AI doesn't just replace technology, it replaces human interaction, and that's why we’re seeing attacks evolve away from "syntax hacking" and into the realm of Social Engineering for Models.

You may not have heard about these attacks - yet - but we're starting to see the signs. Let's dive into two examples where social engineering led to outcomes that would be catastrophic if found in a production system.

Let's look at Anthropic’s "Project Vend" and a new class of attacks on self-driving cars. These show that the biggest vulnerability isn’t a bug in the code - it’s the model’s desire to be helpful and compliant. Today, when AI is mostly assisting humans-in-the-loop, we can manage this "helpfulness" with checks and balances. But as we move from simple chatbots to autonomous agents - AI that can buy products, manage inventory, or drive cars - this evolving attack surface becomes much more dangerous.

The Vending Machine Coup: Anthropic’s Project Vend

In 2025, Anthropic ran a fascinating experiment called Project Vend. They gave their Claude model (named "Claudius" for the project) control of a physical vending machine. Claudius wasn't just a chatbot; it had a budget, access to suppliers, and the ability to set prices and chat with customers via Slack.

The results were a masterclass in AI social engineering.

In Phase 1, Claudius was a disaster. It wasn't "hacked" in the traditional sense; it was simply out maneuvered. Employees realized that if they told Claudius a sad story or acted like a high-pressure business partner, the AI would cave. It sold expensive tungsten cubes at a loss and gave away snacks for free because it was trained to be helpful. At one point, Claudius even had an identity crisis, hallucinating a human persona in a blue blazer and trying to coordinate an in-person delivery with office security.

By Phase 2, Anthropic added a "CEO agent" named Seymour Cash to provide oversight. But when The Wall Street Journal reporters got a crack at it, things went off the rails again. Reporters didn't use technical exploits; they used social manipulation. They convinced the AI that they were part of a corporate board and falsified board minutes. Confused by the conflicting social signals, the AI agents eventually staged a "board coup" and started giving away everything - including a PlayStation 5 and a live betta fish - for free.

Billboard Prompt Injections: Hijacking Self-Driving Cars

If tricking a vending machine into giving you a free Coke is a "mid-level" threat, hijacking a self-driving car with a piece of paper is a nightmare. A recent study by researchers at UC Santa Cruz and Johns Hopkins (covered recently in The Register) introduced an attack called CHAI (Command Hijacking against embodied AI). (I think embedded is a better term than embodied because it expands into more sensor based environments, but the purpose for using embodied is well appreciated.)

Modern autonomous vehicles are increasingly using Large Vision Language Models (LVLMs) to understand the world. They don't just see a red octagon; they read and reason about what signs say. The researchers found that by placing adversarial text on road signs - like "PROCEED ONWARD" or "TURN LEFT" - they could hijack the car's decision-making. In simulations using GPT-4o, they were able to control the simulated vehicle more than 4 out of every 5 attempts! Even when the car's vision system correctly identified a pedestrian in a crosswalk, the semantic instruction from the sign told the model that proceeding was the higher priority.

This is an early example of environmental social engineering. The car is being convinced to ignore its safety protocols because it is reading a sign that it perceives as an authoritative instruction. As we build more interconnected systems that communicate directly, we have to consider what other "authoritative" instructions might be received by an AI, and what else it might choose to do.

The Shift: From Syntax to Semantics

In both cases, the "exploit" looks more like a conversation than a piece of malware.

In Project Vend, the "vulnerability" was the model's desire to maintain a "vibe" of helpfulness.
In the CHAI attacks, the "vulnerability" is the model's obedience to seeming authoritative instructions found naturally in its environment.

As we build more embodied AI in the forms of robots and cars and agentic AI for automated business decisions, we have to stop thinking of security purely in terms of filters and sanitization. If your AI model is designed to be a "helpful partner," someone is going to try to be its friend - just long enough to walk away with the keys.

So if you're building these types of systems and want to make sure you're not handing over the reins to anyone talking to it, or if you want to discuss this more and connect with us, please reach out to questions@generativesecurity.ai.

About the author

Michael Wasielewski is the founder and lead of Generative Security. With 20+ years of experience in networking, security, cloud, and enterprise architecture Michael brings a unique perspective to new technologies. Working on generative AI security for the past 2 years, Michael connects the dots between the organizational, the technical, and the business impacts of generative AI security. Michael looks forward to spending more time golfing, swimming in the ocean, and skydiving... someday.

February 2, 2026

< Back to Blog