Global Outage in Google Cloud: Quota System Error Causes Hours of Worldwide Disruptions

Sure! Here’s the translation:

On June 12, Google Cloud experienced one of the largest global outages in recent years, impacting critical services for businesses and users worldwide. The incident, which began at 7:51 PM (Spanish time), lasted for at least three and a half hours, affecting dozens of products on Google Cloud Platform (GCP) and Google Workspace, from infrastructure services to email, storage, and data analytics.

What Happened?

According to the official information released by Google, the main cause was an incorrect automatic update of quotas in the API management system, which was distributed globally and led to a massive rejection of external requests. The failure affected API quota management, blocking legitimate requests and causing cascading 503 error responses in services such as Compute Engine, Cloud Storage, BigQuery, App Engine, Cloud SQL, Cloud Run, Vertex AI, Cloud Pub/Sub, Cloud DNS, Gmail, Google Drive, and Google Calendar, among others.

Although Google quickly detected the error and implemented a temporary mitigation by disabling the conflicting quota checks, the recovery process was uneven. In the us-central1 region (Iowa), where many resources are concentrated, restoration was slower due to an overload in the quota policy database.

Impact on Businesses and Users

For several hours, thousands of organizations in Europe, Asia, and America experienced intermittent failures in accessing dashboards, APIs, automatic backups, application execution, automations, and AI services, in addition to office services like Gmail and Drive. While running resources were not halted, the inability to access management, check logs, monitor incidents, or scale resources created uncertainty and continuity issues for IT teams.

The impact was particularly severe on managed data services, such as Cloud Bigtable, BigQuery, Spanner, Firestore, Cloud SQL, and Cloud Storage, where reading and writing interruptions were recorded, as well as in key AI products like Vertex AI and Looker Studio.

Google states that the incident should not have occurred and has announced immediate measures:

  • Strengthen the API management platform to prevent failures due to corrupt or invalid data.
  • Improve validation, testing, and monitoring before the global propagation of changes to metadata.
  • Reinforce error handling systems and testing against invalid data scenarios.

How Did It Affect Spain and Europe?

The affected data centers included those in Madrid, Finland, Paris, Berlin, London, Milan, Frankfurt, Brussels, and Warsaw, as well as the European multi-zone regions. The incident extended across the entire cloud and workspace infrastructure, impacting large and small businesses, governments, startups, and public administrations that depend on Google for their daily activities.

Recovery and Current Status

By 10:49 PM (Spanish time), Google confirmed that most services were restored, except for certain residual operations in heavily affected regions (like us-central1) and AI services like Vertex AI Online Prediction, which returned to normal a few hours later. However, the company acknowledged the severity of the incident and committed to publishing a detailed technical report with the root analysis and improvement actions.

Reflection: What Can We Learn?

This incident serves as a reminder that while cloud services offer high availability, automation, and scalability, no provider is exempt from catastrophic failures in their control plane. Companies should:

  • Implement multicloud strategies and independent backups.
  • Document contingency plans and responses for external provider outages.
  • Monitor critical services from external platforms.
  • Periodically evaluate SLAs and recovery capabilities against systemic failures.

Google, for its part, faces the pressure of regaining the trust of thousands of affected businesses. The ecosystem awaits details about the design flaw and the measures implemented to prevent a simple quota failure from triggering another global outage.

Source: Google Status

Scroll to Top