Anthropic announces Claude Mythos Preview as part of Project Glasswing. It scores 77.8% on SWE-bench Pro, up from 53.4% for Opus 4.6.

My reliability team was asked for feedback on it for the model card, and naturally I wrote a paragraph of caveats. But, and I don’t say this lightly, it’s faster than us at initial triage, and it stood up a prod deploy none of us knew how to do.

From page 204 of the model card:

From a reliability engineering perspective, the model still cannot be left alone in a production environment to use generic mitigations. It frequently mistakes correlation with causation and it is not able to course-correct for different hypotheses. When asked to write incident retrospectives, more often than not it focuses on a single root cause and does not consider multiple contributing factors. However, we’ve found this model to be a step change in two areas. The first is signal gathering and initial analysis, where, by the time an engineer has opened two dashboards, the model has already found the outliers and what’s breaking. The second case is navigating ambiguity when there is a clearly defined outcome. For example, due to time zone differences, the reliability team in London was asked to stand up a model in a production environment with different constraints, and the engineers were unfamiliar with both the task and the constraints. Claude Mythos Preview was able to work step-by-step, fixing each error by observing other environments, checking any breadcrumbs that were left in previous commits, and reading documentation.

The London team in question was us.