Claude Mythos Preview

Today we announced Claude Mythos Preview as part of Project Glasswing. It scores 77.8% on SWE-bench Pro, up from 53.4% for Opus 4.6.

My reliability team was asked for feedback on it, so from page 204 of the model card:

From a reliability engineering perspective, the model still cannot be left alone in a production environment to use generic mitigations. It frequently mistakes correlation with causation and it is not able to course-correct for different hypotheses. When asked to write incident retrospectives, more often than not it focuses on a single root cause and does not consider multiple contributing factors. However, we’ve found this model to be a step change in two areas. The first is signal gathering and initial analysis, where, by the time an engineer has opened two dashboards, the model has already found the outliers and what’s breaking. The second case is navigating ambiguity when there is a clearly defined outcome. For example, due to time zone differences, the reliability team in London was asked to stand up a model in a production environment with different constraints, and the engineers were unfamiliar with both the task and the constraints. Claude Mythos Preview was able to work step-by-step, fixing each error by observing other environments, checking any breadcrumbs that were left in previous commits, and reading documentation.

The London team in question was us.