Stop asking if Claude got worse. Ask if your workflow can survive a rerun.
Anthropic’s postmortem confirmed the feeling. The bigger problem is that most Cowork users still have no baseline, fixture, or replayable way to prove what changed.
The laziest way to talk about Claude getting worse is to turn it into a feeling contest.
Someone swears the model feels dumber than last week and that the same task used to work fine, and it spirals from there. Sometimes the complaint is wrong because something else moved: the task itself drifted, the source material got messier, the prompt got vague, or the user crammed five jobs into a single run that was never stable to begin with.
Other times those users are catching a real product problem before the official explanation lands.
Anthropic’s April 23 postmortem matters because it confirmed a more useful version of that story. The company traced recent quality reports to three separate changes affecting Claude Code, the Claude Agent SDK, and Claude Cowork. The API was not impacted, and all three issues were resolved on April 20 in v2.1.116.
For Cowork users, that detail is more than trivia.
Panicking every time Claude feels different is not useful. What you want underneath the workflows you actually rely on is something more reliable than your own memory of how a run used to behave.
What’s actually moving inside a Cowork run
A real Cowork workflow has more parts than most people stop to think about: the model itself, the effort setting, the system prompt, session history, cached context, files, tool calls, connectors, project instructions, output format, and the human review point.
When any one of those layers shifts, the failure does not always look dramatic. It can show up as a thinner summary, a missing assumption, a worse source choice, or a draft that sounds fine but quietly stops answering the actual business question.
That is the expensive failure mode.
With a normal chat answer that comes back wrong, you re-ask and move on. A Cowork run can move through files, tools, drafts, summaries, and review cycles before you notice the output degraded, which means cleanup work disguised as progress.
What a regression test actually is, in plain language
In software, a regression test checks whether something that used to work still works after a change. Cowork users need the same idea translated into normal business work, and you do not need to be technical to use it.
The translation is what I’ll call a fixture.
A fixture is just a saved sample of a real task you can run again later: the brief, the source files or excerpts, the expected output format, your quality bar, the review checklist you’d actually use, the failure signs you’ve seen before, and a baseline output from a known good run.
That last piece is where most people are missing the system.
Without a baseline, you are comparing today’s output to your memory of last week’s output, and memory is a warning light rather than a measuring tool.
A fixture lets you ask a cleaner question when something feels off.
Did this workflow still produce the kind of deliverable I trust? Not in the same wording or paragraph order, but at the same operational quality.
Why this matters more for Cowork than for chat
The postmortem was mostly discussed by Claude Code users because coding regressions tend to be loud. A failed edit usually breaks the build before the developer has even moved on to the next prompt, which makes degradation easier to spot and easier to argue about in public.
Cowork is quieter.
A run can look polished and still be wrong in ways the operator only catches downstream: the meeting packet that missed the one risk that actually mattered, the market research brief whose two sources contradict each other in a footnote no one flagged, the findings memo that walked from rows to recommendations without pausing to define the metric, the draft that kept the voice but lost the thesis.
None of those failures will trip an error.
That is why Cowork needs regression testing more than chat does. The expensive failure mode here isn’t Claude visibly breaking; it’s Claude producing something plausible enough that the operator keeps moving.
What Anthropic’s postmortem teaches Cowork users
Anthropic’s three problems were genuinely unrelated to each other.
The first was a default change.
On March 4, Claude Code’s default reasoning effort was lowered from high to medium to reduce long latency that was making the UI appear frozen for some users. Anthropic later said that was the wrong tradeoff and reverted it on April 7 after users said they would rather default to higher intelligence and opt into lower effort for simple tasks. This affected Sonnet 4.6 and Opus 4.6.
The second was a session-state bug.
On March 26, Anthropic shipped a change meant to clear Claude’s older thinking from sessions that had been idle for more than an hour, so that resuming a stale session would be cheaper. The implementation had a bug: instead of clearing the thinking once when a session resumed, it kept clearing it on every turn for the rest of that session. Claude kept executing tool calls but with progressively less memory of why it had made them, which surfaced as forgetfulness, repetition, and odd tool choices. Anthropic also said this is what likely drove separate reports of usage limits draining faster than expected, because the dropped thinking blocks caused cache misses on subsequent requests. The fix landed April 10 in v2.1.101.
The third was a system prompt change.
On April 16, Anthropic added a verbosity-reduction instruction to Claude Code’s system prompt to compensate for Opus 4.7 being more verbose than its predecessor. In combination with other prompt changes, that instruction hurt coding quality. After running broader ablations, Anthropic found one evaluation that showed a 3% drop for both Opus 4.6 and Opus 4.7 and reverted the prompt on April 20.
Each of those failures had a different mechanism on a different timeline, which is part of why they were hard to reproduce internally and easy to feel as one big mood-of-the-product problem.
The practical takeaway for Cowork is that final-answer-only testing misses most of what can go wrong.
What needs checking is whether the workflow still keeps the relevant context, respects the source material, picks a reasonable route, follows the output standard, and hands the human something reviewable.
The output is what you see; the behavior underneath it decides whether that output is worth using.
What to test first
You do not need a full evaluation lab.
Start with the parts of a Cowork run that create the most cleanup when they degrade. The six sections below are where I’d start.
1. Context retention
A Cowork workflow should keep the brief, source material, assumptions, and output format intact across the run.
That does not mean Claude has to remember every sentence; it means the parts that affect the deliverable cannot quietly disappear.
For a client prep packet, the final output should still reflect the meeting objective, the client’s current state, the open risks, the agreed tone, and the decision the meeting is supposed to support.
A context-retention failure usually looks like Claude repeating background instead of using it, forgetting a constraint from an earlier step, treating a primary source as generic background, or producing something that reads useful while no longer fitting the job.
These failures rarely announce themselves, which is the whole reason the fixture exists: it tells you what was supposed to survive.
2. Output structure
Cowork is useful when it gives you a deliverable a human can review, so the output shape has to be part of the test.
A weekly operating review that comes back as a thoughtful essay is wrong even if the essay is good. The same applies to a research brief that blends every source as if all evidence carries equal weight, or a findings memo that jumps from spreadsheet rows to recommendations without ever explaining how the metrics were defined.
The prose can be excellent and the artifact still wrong for the job.
A strong fixture defines the artifact before the run starts, and the retest checks whether Cowork still produces something that fits.
3. Source handling
This is where a lot of “Claude got worse” complaints deserve a closer look before you blame the vendor.
The questions worth asking are practical: did the run actually use the files, did it overweight the chat, did it ignore the spreadsheet, did it treat a stale note as more important than the current source, did it separate source-backed claims from assumptions.
Most business workflows do not fail because the model cannot write; they fail because the wrong material gets treated as the right material.
The fixture should name the source priority directly.
For example: use the uploaded spreadsheet as primary evidence, treat meeting notes as context rather than proof, treat last month’s memo as background only, and flag anything not directly supported by the source set. That is how you stop a polished summary from quietly becoming a source-mixing problem.
4. Tool route
Agentic systems do not only answer; they choose paths, and the path matters.
A weak Cowork run might use the wrong source, skip a file, lean too heavily on chat context, overuse browsing, underuse a document, or take an action when a plan would have been the safer move.
The thing worth checking is not whether tools were used but whether the right route was used for this job.
For a file-heavy task that usually means Claude inspecting files first and summarizing what they contain before any drafting begins. Workflows that touch outside systems are different and need a planning step plus human approval before anything gets drafted, with no action taken at all without explicit confirmation.
Spreadsheet work has its own shape: column inspection and metric definitions belong on the table before anything gets written up as a finding.
None of this removes judgment from the system; it just checks whether the judgment still fits the task.
5. Reviewability
This does not mean asking for hidden chain-of-thought.
It means asking for reviewable reasoning artifacts: which sources were used, which assumptions got made, which items need review, where confidence is lower, and what was intentionally ignored.
For real work that is more useful than a longer answer, because what the human actually needs is enough surface area to catch the wrong turn before it becomes cleanup.
This matters especially after stale sessions.
Anthropic’s caching bug cleared older reasoning every turn once a session crossed the idle threshold, which produced exactly the kind of session drift that reviewability helps you catch in real time.
If the session state has gone weird, you want to see it before the run keeps moving.
6. Human handoff quality
Cowork is not supposed to remove judgment from serious work. It should put the human at the right review point with the right artifact.
The bar moves a lot depending on what the artifact is: a customer-facing draft, a contract triage, a founder decision packet, and an internal weekly note all need different review at different points, and the fixture is where you write that down.
The fixture should define the handoff explicitly.
Some outputs are usable after light editing, some need source verification before sharing, some must be reviewed before any external use, some are first-pass synthesis only, and some workflows must stop before sending, publishing, deleting, or changing files.
The point of the test isn’t whether Claude can finish everything but whether the run stops at the right place.
The beginner version
Start with one recurring task, but not the biggest or riskiest one you have, and not the workflow that touches three departments and six tools.
Something with a clear input and a clear output is what you want first.
Good starter candidates include a meeting prep packet, a research brief, a weekly update summary, a customer feedback synthesis, a spreadsheet-to-findings memo, or a source-material-to-article-draft workflow.
Run it once when the output is good. Save the task brief, the source set, the expected output structure, and the final answer. Then rerun the same fixture after a meaningful product change, after a model update, or whenever the workflow starts feeling off.
Identical prose is the wrong target, because two good runs can produce different paragraph orders and different tone choices and still both be fine.
What you want is stable behavior: did the run still use the sources well, did it keep the brief intact, did it produce the right artifact, did it flag the right review points, and did it avoid confident nonsense.
Five questions are enough to start.
The advanced version
Operators with more on the line should keep a small fixture library, around five fixtures to begin with, one per workflow category that actually matters to the business.
A reasonable starter library covers a research brief, a spreadsheet-to-findings workflow, a meeting prep workflow, a customer feedback synthesis, and an external-action draft.
Each fixture should have a baseline score. Skip the laboratory-science framing and use a working 1-to-5 scale across the criteria that affect trust: context retention, source use, output structure, tool route, assumption handling, reviewability, and human handoff quality.
A 5 means the output is usable with normal human review, a 3 means it needs meaningful cleanup, and a 1 means the workflow failed. What you get out of this is replayable evidence rather than vibes.
Advanced operators also want to log the conditions of every retest: the date, the visible Claude Code or app version when known, what changed since the baseline, and the observed regression. Over time the log itself becomes a diagnostic asset.
If the same fixture degrades twice across two unrelated product changes, you have a structural problem with the workflow and not with Anthropic’s release pipeline.
One more thing for the advanced reader: the postmortem’s caching bug is a useful template for the kind of failure that automated evals miss.
The bug only triggered after a session crossed the idle threshold, only on subsequent turns inside that broken state, and was suppressed in many CLI sessions by an unrelated display change. It got past code review, unit tests, end-to-end tests, and dogfooding.
If Anthropic missed it for over a week, the assumption that your workflows would catch a similar drift on their own is generous.
The fixture is the cheapest insurance.
What this catches and what it doesn’t
A regression harness will not catch everything, but it catches more than vibes.
It catches Cowork producing shorter outputs that read fine but lose substance, stale sessions forgetting the original brief, source handling getting sloppy, the wrong tool route getting picked for a job, and the deliverable looking polished while quietly losing decision-readiness.
It will also catch when your own prompt got worse, which matters because the model is not always the problem.
Sometimes the fixture shows that the product is fine and the task design changed: you added three goals, removed the output format, gave it weaker source material, or stuffed analysis, drafting, formatting, and external action into one overloaded run. That is still a useful finding.
What regression testing does not do is make Cowork safe for every task.
It does not remove human review, prove a workflow is production-ready, or protect you from bad permissions, stale files, prompt injection, weak sources, or overbroad tool access. It also will not rescue a fuzzy task.
If the instruction is “review everything and tell me what matters,” the test will be muddy because the workflow is muddy.
Regression testing works when the job has shape: a known task, a known input set, a known output standard, a known review point, and a known failure pattern. Without that you have no real test, just an open-ended check on whether Claude can guess what you meant today.
The habit to build now
The next serious Cowork users will save fixtures alongside their prompts, because a saved prompt only tells Claude what to do, while a saved fixture is the only thing that can tell you whether the workflow still works once something underneath it has changed.
That difference matters more here than in chat, because when a Cowork workflow gets worse the cost can spread across files, drafts, tool choices, summaries, handoffs, and review cycles before anyone notices.
The dangerous version of a Cowork failure is the run that looks close enough that the operator keeps moving when they should have stopped.
The starting move is to pick one workflow you actually rely on, save the task brief, the sources, the expected output, and a baseline you trust, and then rerun the same fixture whenever something changes underneath it.
After enough hours of Cowork work you stop trusting your memory of how a workflow used to behave, and the saved baseline becomes the only thing that can tell you whether the current run still meets your bar.
