X (Twitter) Facebook Pinterest LinkedIn E-mail

Google has taken a giant step forward in its multi-architecture strategy. Following the announcement of Axion, its first custom-designed Arm® CPUs, the company has explained how it enabled tens of thousands of internal applications to be compiled and run simultaneously on x86 and ARM within its production clusters. This is not an experiment: YouTube, Gmail, and BigQuery already serve traffic on both ISAs in parallel, with saturated ARM hardware and more servers deployed each month.

The incentives are clear. According to Google data, Axion-based instances deliver up to 65% better price-performance and are up to 60% more energy efficient compared to similar instances within Google Cloud. At the data center scale, this improvement multiplies the impact: lower energy bills and more usable capacity per watt for customers and first-party services.

Migrating to “multiarch” at the warehouse scale: what was not the problem

The technical story debunks some common assumptions. Before starting, both the multiarch team and application teams assumed that the major hurdles would be architecture differences: floating-point deviations, concurrency, vector intrinsics, platform-specific operators, or performance gaps that are hard to bridge. However, this was not the case — or, at least, not to the expected extent.

During the early migrations of “top” services (such as F1, Spanner, or Bigtable), engineers encountered real cases of these differences, but fewer than expected. Modern tools —compilers and sanitizers— had uncovered much of these issues over the years. The greatest friction was elsewhere:

Overly x86-tuned tests assuming specific timings or platform details.
Old build and release systems that did not account for Arm variants.
Deployment configurations that didn’t understand the same service running on two ISAs.
Operational risk in touching critical systems in production.

One engineer summarized it as: “Everyone obsessed over the completely different toolchain, and ‘surely everything will break’. Most of the difficulty was in configs and boring stuff.”

The challenge wasn’t migrating the “big ones”: it was the long tail

Moving a dozen critical jobs with dedicated squads and weekly meetings works, but does not scale. While roughly 60% of in-flight compute resides in the top 50 applications, in a monorepo with over 100,000 applications, the rest’s use falls into a rather flat “long tail”. If the goal is for Borg (Google’s internal scheduler) to fit jobs efficiently onto cells with both x86 and ARM, more multiarch jobs is better. The only viable way: automation.

Google approaches this with a set of tools — some already used before the project — and with a new AI agent to close the gap.

Automation: bulk commits, validation, and deployments without “human hand”

Rosie: generates large-scale commits and guides them through review flows. Example: activating Arm mode in a job’s Blueprint with a single config line.
Sanitizers and fuzzers: detect execution differences between x86 and ARM before reaching production (e.g., hidden data races under the x86 TSO memory model), preventing nondeterministic behaviors upon recompilation.
CHAMP (Continuous Health Monitoring Platform): an automated framework to deploy and monitor multiarch jobs. If a job crashes on ARM or reduces throughput, it is automatically evacuated for offline tuning, without risking cluster health.

With these tools, the team began to industrialize the migration.

38,156 commits and three phases: tooling, code, configuration

To understand what kind of changes the scale migration required, Google analyzed 38,156 commits in its monorepo Google3 — over 700,000 lines changed. Using Gemini Flash and a 1 million token window, it classified changes into 16 categories grouped into four main blocks, tracing their temporal evolution:

Tools and test adaptation: during the deployment of the multiarch toolchain, most commits involved tooling and tests tweaks.
Code adaptation: in migrating initial large applications, code changes (dependency fixes, platform-specific ifdefs, data representation, etc.) increased.
Configuration and processes: in the final phase, almost all changes related to config files and support processes; the commit count surged, reflecting scaling across the repository.

A noteworthy detail: most commits related to migration were small. The larger ones mainly involved big lists or configurations, not intricate changes in single files.

AI’s role: CogniPort, an agent to “fix” builds and tests

To tackle the remaining — that long tail of apps still unbuildable or failing tests on ARM — Google built CogniPort, a generative AI agent aimed at automating the rest of the migration.

How it works: CogniPort activates when a build or test fails on ARM. It operates using nested agent loops: a scheduler invoking build-fixer and test-fixer agents that modify code, build scripts, or configs, iteratively retrying until the target builds or the test passes. Early tests with historical commits showed it managed to fix 30% of test failures without additional prompting.

Axion + multiarch: implications for Google… and the industry

Scalability and elasticity: if the scheduler can place the same load on x86 or ARM depending on availability and cost, it raises average cluster utilization and reduces effective cost per service.
Sustainability: with up to 60% improvement in energy efficiency over comparable instances, the overall impact on consumption and emissions is significant.
Reduced ISA vendor dependency: designing applications multiarch by default lowers technological risks and broadens hardware options (Axion today, other Arm in the future), without overhauling software.
Cultural shifts: migration has revealed that the real friction was in tests, builds, and pipelines. This often “invisible” layer should be designed for portability.

Implications for Google Cloud clients

Although the post mainly discusses internal infrastructure, some lessons are clearly relevant for clients:

More instance options offering better price-performance and efficiency (Axion), with software compatibility if applications are multiarch-ready.
Improved TCO for elastic workloads, microservices, and data plane, leveraging the scheduler’s flexibility across ISAs.
A guideline on tackling realistic migrations in large organizations: start with tooling/tests, automate builds and rollouts, and use AI as an enabler, not a replacement for engineering.

What’s next: “multiarch by default”

Google plans to deploy automation to address tens of thousands of pending apps and has committed to making “multiarch by default” for new applications. Structural components —visible monorepo, multiarch tooling, automations like Rosie and CHAMP— are in place. With CogniPort taking on the “last mile”, the official goal is to bring production services closer to architecture neutrality.

The overarching message for the industry is clear: ISA migration is no longer science fiction; it’s an industrial operation. The focus — at least in this experience — is not on “vector translation” or “assembler adjustments”, but on domesticating the edge layer: tests, builds, configs, pipelines. And here, generative AI can be the tireless worker that fills the gaps.

Frequently Asked Questions

What are Google Axion processors, and what advantages do they offer over comparable instances?
Axion are Google-designed Arm CPUs for data centers. Google claims they deliver up to 65% better price-performance and up to 60% higher energy efficiency compared to similar instances, while maintaining software compatibility when applications are multiarch.

How does CogniPort, the AI agent for ISA portability, work?
CogniPort activates when a build or test on ARM fails. It uses nested agent loops — orchestrator, build-fixer, test-fixer — that modify code, build scripts, or configs iteratively until the build succeeds or the test passes. Early tests with historical commits showed it fixed 30% of failures without extra prompting.

How many applications has Google migrated, and which already run in multiarch production?
Google reports migrating over 30,000 applications to ARM, with services such as YouTube, Gmail, and BigQuery serving traffic on x86 and ARM simultaneously. ARM hardware is fully utilized and more capacity is deployed month by month.

Where were the main issues during the multiarch migration?
Contrary to expectations, the biggest challenges weren’t due to fundamental ISA differences but to overly x86-tuned tests, old build systems, and configurations that didn’t account for architecture variants. Automation —Rosie, sanitizers/fuzzers, CHAMP— plus AI, have been key to industrializing the process.

Sources: Technical input from Google Cloud (“Using AI and automation to migrate between instruction sets”) and the preprint “Instruction Set Migration at Warehouse Scale”; coverage on El Chapuzas Informático about Axion/Arm migration and AI’s role in the process.

via: cloud.google

X (Twitter) Facebook Pinterest LinkedIn E-mail