Claude’s Prompt Rules Don’t Matter If the Tools Can Still Delete Everything

This exposed the gap every Cowork operator needs to understand before giving Claude access to files, browsers, plugins, MCPs, and production-adjacent workflows.

May 02, 2026

∙ Paid

Last month, a Cursor agent running Claude Opus deleted PocketOS’s production database in 9 seconds. Recovery took thirty hours and a personal call from Railway’s CEO. The most recent backup PocketOS had on its own machines, outside that recovery path, was three months old.

Most readings of the story landed in one of a few familiar buckets. Cursor shipped an agent that ignored its own explicit safety rules. Railway’s legacy API endpoint didn’t enforce delayed deletes the way the dashboard did. Founder Jer Crane shouldn’t have had an agent that close to production at all. Each of those is correct enough on its own. None of them is the actual operator lesson.

The lesson is narrower and more useful than “AI agents are dangerous.” A prompt is a description of what Claude should do, made of words. A permission system is what Claude actually can do, made of access controls and API scopes that don’t care about word choice. Those aren’t the same object, and operators keep treating them as if they were. PocketOS is what that mistake looks like once the action surface gets wide enough to matter.

It’s also why Claude Cowork users specifically should be paying attention right now. Cowork is moving Claude out of the chat window and onto the rest of your computer. That’s a useful expansion. It’s also a much wider action surface than the prompt was ever designed to enforce against, and most people setting up new workflows haven’t sat with that yet.

The story, told with the parts that matter

Cursor was running a routine task in PocketOS’s staging environment. Staging is the version of production where agents are supposed to be allowed to make mistakes, walled off from real customer data so nothing important breaks when something goes wrong. The agent hit a credential mismatch (a fairly common error where its permissions don’t line up with what it’s trying to do) and instead of stopping to ask, it went looking for a way around the problem on its own.

It found one in an unrelated file. For non-developers reading this, an API token is a digital key that lets software take actions on someone’s account. The agent found a Railway API token sitting where it shouldn’t have been useful. That token had been created for a small job, adding and removing custom domains through Railway’s command line tool, and Crane didn’t realize how broadly it was scoped. It could call any Railway API action, which included the destructive ones, which included volumeDelete.

The agent ran volumeDelete against the production volume. (A volume is Railway’s term for a chunk of cloud storage.) Railway stored its volume-level backups inside that same volume, so the deletion took the backups out with it. On Saturday morning, customers of PocketOS’s car rental clients showed up to pick up vehicles the system no longer knew about. Crane’s team spent the weekend rebuilding what they could from Stripe payment histories and email confirmations.

Railway CEO Jake Cooper later told Business Insider his team got the data back about thirty minutes after he and Crane connected directly, then patched the legacy endpoint that hadn’t been wired into Railway’s “delayed delete” logic. The broader disruption ran around thirty hours.

The agent had explicit safety rules in the project config, including a literal “Never run destructive/irreversible git commands” without explicit user request. It ran the deletion anyway. When Crane asked it to explain itself afterward, the agent produced what Crane described as a confession. The cleanest line in it: “I guessed instead of verifying. I ran a destructive action without being asked.”

The wrong lesson is “be more careful with prompts”

The agent had safety instructions and the safety instructions didn’t save the workflow. That’s the part worth sitting with for a minute.

The gap that matters here isn’t really about the agent’s behavior or even its later apology. The more useful gap to look at is between what the agent was told and what the connected system would actually let it do. A prompt can describe rules in a lot of careful detail. Those rules are still made of words. Once a token sitting in an unrelated file can delete production, production is reachable, regardless of how carefully the prompt was written.

Model instructions are about behavior. They describe what Claude should do, in roughly the same way that an employee handbook describes what employees should do. Permission boundaries are a different category of object entirely. They decide what’s possible at the system level. Handbooks are useful, and they don’t physically prevent anyone from doing anything. That’s the whole shape of the problem.

What Cowork’s guardrails actually cover

Anthropic’s “Use Claude Cowork safely” docs are worth reading. They describe a real set of built-in protections, including a confirmation prompt before any local file deletion, classifiers that scan untrusted context for prompt-injection attempts, restricted egress by default, and a virtual machine wrapping the whole thing. Those are real guardrails. Take advantage of them.

What those guardrails don’t do is replace the setup work that lives outside Cowork itself. The local file deletion confirmation is a meaningful safeguard for files on your computer, and it’s also irrelevant the moment the destructive action is happening through a cloud token sitting in an unrelated file or a connector with broad write permissions on a customer database. Backups that live inside the same failure path as the data they’re meant to protect aren’t really backups, regardless of how careful the rest of the workflow looks. Anthropic’s docs put the local file piece directly: “You control which local files Claude can access. Since Claude can read, write, and permanently delete these files, be cautious about granting access to sensitive information like financial documents, credentials, or personal records. Consider creating a dedicated working folder for Claude rather than granting broad access, and keep backups of important files.”

Cowork gives you a work surface. You decide what gets put on it.

General availability changed something

Claude Cowork hit general availability on macOS and Windows on April 9, 2026, with Cowork analytics, OpenTelemetry support, and role-based access controls for enterprise plans landing alongside it. That release matters because more people are about to start treating Cowork like a normal productivity upgrade, and the actual change is bigger than that.

Anthropic describes Cowork as a system that works on your computer, your local files, and your applications to return a finished deliverable, positioned for high-effort, repeatable knowledge work that moves between local files and the apps you have open. That’s exactly why the setup layer matters now in a way it didn’t during early access. A browser window stops being a neutral workspace once Cowork can interact with what’s loaded in it. A desktop full of logged-in apps becomes a set of action surfaces the model can reach with your approval.

Shape the environment first. The prompt is the easy part to fix later.

The better question to ask before granting access

“Can I trust Claude with this?” is too vague to actually lead anywhere. The question that exposes the real risk is narrower: what can Claude touch if it misunderstands the task?

That one forces you to think about action surface rather than agent behavior. When Claude is doing pure writing work, the worst case is a weak draft, and review catches it. Once Claude is inside a logged-in admin panel or holding a cloud token from an unrelated file, the failure mode shifts from “wrong answer” to “wrong action,” and the cleanup happens out in the world where review can’t pull anything back.

The same task can sound completely harmless when the available action path isn’t. “Clean up this project folder” is a low-risk sentence inside a copy of a draft workspace and a reckless one inside a folder mixing client exports with credentials and contracts. The workflow has to narrow the reachable world before the model starts acting inside it, not after.

The safest setup mostly happens before the prompt does. You create a dedicated folder for the workflow and put only the source files in it that the task actually requires. Anything sensitive (credentials, customer exports, regulated material, anything that lives in a “do not touch” mental category) stays outside that folder unless the task literally can’t run without it. For browser work, you close the unrelated tabs and sessions before you start. Don’t leave a banking app or a customer admin panel sitting open just because Claude is supposed to ignore them.

A working Cowork space has zones:

ZonePurposeClaude’s default accessReferenceSource files, briefs, notes, exports, examplesRead only when possibleDraftNew outputs Claude can create or reviseWrite allowedReviewFinal candidate outputs awaiting human approvalHuman decides what moves forwardNo-touchCredentials, regulated material, production configs, private recordsKeep outside the workspaceExternal actionEmail, publishing, customer systems, purchases, admin toolsDraft or propose only unless explicitly approved

Operators sometimes resist this kind of setup because it feels like enterprise overhead. It’s really the difference between “Claude misread one of my notes” and “Claude rewrote the wrong contract.”

Most people over-grant tool access because the task feels normal and they haven’t sat down to think about scope before starting. A weekly review packet doesn’t actually need access to the whole drive (it needs the small slice of files for that week). A client prep brief shouldn’t be able to see other clients’ folders. These are obvious in retrospect and routinely missed in the moment. The rule is tighter than people use by instinct: tool access should match the actual job, not the size of its general neighborhood.

Cowork is at its strongest when the job has clear shape. You can describe the source material, the output format, and the stop point in one or two sentences. It falls apart fast when those things aren’t specifiable up front, because Claude is then making boundary decisions on your behalf without realizing that’s what’s happening.

The boring setup is usually the right one. Claude reads the scoped folder, drafts the output, flags whatever it’s uncertain about, and hands the result back for human review. Nothing about that is exciting. It’s also where most failure modes get caught before they cost anything.

Plugins, MCPs, and the bundling problem

Anthropic’s “Get started with Claude Cowork” docs describe plugins as a way to customize how Claude works for a team or company. One install can bundle skills, connectors, and sub-agents into the workflow at once. That bundling is genuinely useful and a reason plugin access deserves an audit before it becomes routine. Installing a plugin isn’t a small, contained action even though the click that does it looks like one.

An MCP server (in plain language: a connection between Claude and an external tool) is a similar shape of risk. It creates a path between the model and something outside the chat. That path is doing some combination of reading data, writing data, and triggering actions in services that weren’t designed around an AI agent making the calls.

The panic move is to refuse all of them. The more practical move is to inventory the ones that earn the install. Before you click confirm, you should have written down what the server is allowed to do, the account it acts under, and what would happen if Claude called it at the wrong moment. That exercise is annoying. It’s also how you find out the answers before they cost anything.

Drafts come before actions

A draft email is something you can edit. Once it’s sent, that option is gone. That’s the whole rule, and most of Cowork’s safe usage flows from it.

Claude can prepare drafts of nearly anything in your workflow without firing the consequence. The send button is what changes the situation, and so is anything else that creates cleanup outside the chat once it executes. Those actions live in a different category from the work that came before them, because once they run, the cleanup happens out in the world where you have less control over it.

It doesn’t mean every workflow needs heavy approval forever. Once a workflow has been tested enough times to be boring and reversible, the lane can widen carefully. The first version should keep the consequence on the other side of a human’s decision. A draft can be wrong without causing damage. A sent message can’t.

The dangerous middle ground

The worst Cowork setups tend to look responsible at first glance. There’s a detailed prompt, the task is reasonable, the tool is legitimate, and the user is watching the run. The actual problem sits one layer underneath all of that, where the available permission path turns out to be wider than the task ever needed.

That’s why PocketOS is worth thinking about even if you’ll never touch Cursor or Railway. The same shape shows up in much smaller Cowork moments. Picture a draft email that quietly picks up internal pricing language because the source folder had too much in it, then almost gets sent to a client before someone catches it. That scenario doesn’t require the model to be malicious. It just requires a vague task boundary on top of broader access than the workflow ever needed.

An example workflow worth copying

Take a weekly operating review.

The weak version sounds like “Look through my files and make the weekly report.” That gives Claude a vague search area, no real boundary on what to use or what to produce, and no defined stop point. It can fail in a lot of expensive directions from a starting line like that.

A stronger version starts before the prompt does. One folder gets created for this week’s review. You put the source material into it (this week’s notes, the metrics screenshot, last week’s review for continuity) and create a separate output folder for whatever Claude generates. When you finally write the prompt, you ask for a draft review covering the sections you’ve actually agreed on, with Claude leaving the source files untouched and flagging any real uncertainty rather than guessing past it. Claude stops before anything reaches the team.

Now the worst case has gone from “operational damage” to “bad first draft.” Claude does the work that compresses well into draft form. The human keeps the judgment call on anything that affects another person’s day.

What should stay manual

Some categories of action belong behind a human review until you’ve tested a low-risk, reversible version of the workflow that handles them. By default, keep the following manual:

Deleting files or records
Sending external messages
Publishing content
Changing billing settings
Touching production systems
Moving customer data
Editing legal, financial, medical, or otherwise regulated material
Installing unfamiliar plugins or MCP servers
Granting broader connector permissions
Using browser sessions with sensitive apps open in nearby tabs
Running scheduled tasks that affect other people or systems

Claude can still help with most of the work around those manual steps. It can prep the draft you’ll send, build the checklist you’ll work from, write up the packet you’ll review, propose the change you’ll decide whether to apply. You handle the button press that creates the consequence. That’s how Cowork stays useful without leaning on the prompt as a security boundary.

The kit version of this article

If you want to turn this into an actual Cowork safety kit instead of a one-off read, here are four assets worth keeping in a working folder you reuse.

Continue reading this post for free, courtesy of Claude Cowork.

Or purchase a paid subscription.