Claude Sonnet 4.5 and the agents’ promise to “program themselves”: between the 30-hour continuous jump and the old problem of delivering working software

Anthropic has sparked the debate about autonomous programming by introducing Claude Sonnet 4.5 with a deceptively simple yet difficult-to-replicate example: an agent that worked for 30 straight hours generating about 11,000 lines of code to “clone” a Slack/Teams-style application and stopped its execution upon completing the task. This milestone more than quadruples the 7 hours previously attributed to Opus 4 in May. For the company, the message is clear: Sonnet 4.5 would be “the best model in the world for real agents, programming, and computer use”.

The announcement comes amidst a core battle among Anthropic, OpenAI, and Google to win the enterprise market for autonomous agents: assistants that browse, operate a PC, orchestrate tools, and write code for hours without supervision. The stakes are high—licenses, services, and data—and the race is fueled by public demos and, increasingly, infrastructure around models.


What Claude Sonnet 4.5 Brings (Beyond the Headlines)

Anthropic isn’t just updating the model: they’re trying to deliver an agent development stack. Alongside the launch, they enable virtual machines, memory, context management, and multi-agent support. In the company’s words, these are the “building blocks” used internally in Claude Code now packaged so that developers can create their own “next-generation” agents. This approach aligns with what OpenAI (tools, computer control, integrations) and Google (Gemini + app/device control, workbenches) are pursuing, with a shared conclusion: a model alone is not an agent.

In The Verge, Scott White (product lead) described Sonnet 4.5 as an assistant capable of operating at a “chief of staff” level: coordinating calendars, viewing data dashboards, extracting insights, and drafting status updates. Dianne Penn (product management) emphasized that it is “more than three times” better at computer use than the October version: browsing, filling out forms, copying/pasting, automating workflows. Canva, as a beta tester, assured that it has helped with complex long-context tasks, from engineering in its repository to product features and research.


…and what programmers see in their daily work

Alongside the announcements, many developers experience a more prosaic reality. As Miguel Ángel Durán (@midudev) summarized: “Claude Sonnet 4.5 refactored my entire project in a prompt. 20 minutes thinking. 14 new files. 1,500 lines of code modified. Clean architecture. Nothing worked. But it was so beautiful.” Several tests reveal a common pattern: pristine structures, manual nomenclature, well-separated layers… and failures when it comes to compiling, testing, or launching the system without human intervention.

This is not a whim: delivering software requires more than generating files. It involves closing the integration (authentication, permissions, states, persistence, webhooks), managing dependencies and environments (runtimes, package managers, build systems), and passing end-to-end tests (not just happy paths). Current models write increasingly better, but they often fail to deliver a coherent, working set that functions truly and without interference.


Why does the “beautiful code vs. working software” gap persist?

  1. Invisible complexity. A Slack isn’t just a UI: there are events, syncs, granular permissions, caches, migrations, observability… The agent tends to over-architect and underestimate integration details.
  2. Environments and reproducibility. Lack of environment discipline: exact versions, build/run scripts, seed data, CI configurations, robust Dockerfiles.
  3. Meaningful tests. Creating tests isn’t the same as passing tests that matter. While happy path cases are common, edge cases are rare.
  4. Planning and coherence. Without a strategy of packages and contracts between modules, a massive refactor leaves subtle inconsistencies that break the product.

What is actually a step forward

Nevertheless, some real progress has been made. An agent capable of maintaining context for hours, returning to its own files, reviewing previous decisions, automating tasks involving “digital sweat” (gathering LinkedIn profiles, preparing spreadsheets, drafting briefs), and operating a browser reliably improves productivity. The added stackVMs, persistent memory, context management, multi-agent— explicitly recognizes its Achilles’ heel: a pure model isn’t enough; mechanisms for state, tools, and control are needed to simulate something akin to a “working system” of agents.


How to evaluate usefulness without falling for the hype

For engineering teams

  • Scope tasks: request closed pieces (CRUDs, migrations, parsers, basic telemetry) and evaluate with real tests in CI.
  • Reproducibility is mandatory: scripts (Makefile/NPX/Poetry), fixed versions, and a README with exact steps for build/run.
  • Delivery metrics: time to green build, bugs per diff, PR approval time.
  • Controlled environment: containers and linters to prevent “refactor for sport”.

For product/business teams

  • Use cases with ROI: slides, dashboards, meeting summaries, reports, classifications.
  • Human in the loop (HITL): the agent proposes, someone validates.
  • Hours saved: quantify weekly time and perceived quality.

Industry context: “agent runtime” vs. “model size”

The race isn’t just about the biggest model. OpenAI, Google, and Anthropic are moving towards execution environments that provide working memory, planning, tools (browser, terminal, editor), retries, and security (permissions, sandboxes). The winner won’t be who recites better but who delivers reproducible and useful flows without requiring a human to rebuild everything at the end.


Checklist of ‘serious’ signals to track progress

  • Reproducible repositories: infra as code, Dockerfiles, seed data, public CI.
  • Delivery benchmarks: measure time-to-green and MTBF (mean time between failures) after initial deployment.
  • Real integrations: OAuth, webhooks, queues (Kafka/Rabbit), Postgres/Redis, observability (OpenTelemetry).
  • Unit economics: ensure that the computing cost of the agent doesn’t exceed the value of the automated work.

When will we see truly “self-programming” agents?

There’s no fixed date. The path likely involves less monolithic models and more composite systems (medium models + RAG + tools), silicon optimized for work (GPU + NPUs/ASICs), and, most importantly, agent runtimes that reduce uncertainty. Meanwhile, the most realistic way to leverage Claude Sonnet 4.5 (or its rivals) is as accelerators: offload repetitive tasks, seed code and documentation with acceptable quality, and automate low-risk tasks. The “AI that programs itself”—without human intervention—is not here yet.


Conclusion

The 30-hour experiment with 11,000 lines places Claude Sonnet 4.5 exactly where Anthropic aimed: at the center of the debate on autonomous agents. At the same time, the experience of teams reminds us that clean code and software delivery remain two distinct crafts. The positive takeaway is that the craft of agents—memory, tools, context, multi-agentimproves. Prudence suggests grounding promises in measurable tests, with reproducibility and delivery benchmarks. Until then, instead of programmers on vacation while an agent “clones” a Slack, we’ll have work pairs: AI that prepares and humans who activate.


Frequently Asked Questions

Can Claude Sonnet 4.5 build complex applications unaided?
It can generate large codebases and maintain context for hours. In practice, releasing a complex app usually requires an engineer to finalize integrations, environments, and tests.

What real advantages does it bring over previous versions?
According to Anthropic, it’s 3× better at computer use and arrives with a stack of agent capabilities (VMs, memory, context management, multi-agent) for long-context tasks. Beta testers (e.g., Canva) report improvements in complex automations.

Why does “bookish” code sometimes fail to compile or run?
LLMs imitate structure and style, but often falter in contracts between modules, dependencies, environments, and meaningful tests. Without reproducibility and CI, the leap from appearance to functioning is fragile.

How to use it today without risk?
Scope tasks (CRUD, migrations, parsers), require scripts and fixed versions, integrate CI with real tests, and measure €/result (time saved, lead time, bugs avoided). Use the agent as a copilot, not as a full team.

Scroll to Top