X (Twitter) Facebook Pinterest LinkedIn E-mail

The outage, which affected Yandex Cloud and other critical services, was due to a dual failure at the support substation and is considered the first of its magnitude in 15 years.

On March 30, Yandex’s main data center experienced an unprecedented service interruption that impacted multiple company services, including its cloud platform Yandex Cloud. The incident was attributed to a simultaneous failure in both high-voltage power supply lines coming from a substation near Moscow, according to an official company statement and a detailed post on its technical blog on Habr.

This data center, inaugurated in the 2010s on previously industrial land, was strategically located near a powerful 220 kV substation that had not recorded any failures since it began operation in 1960. Yandex had installed two independent 110 kV power lines, which in theory guaranteed sufficient redundancy to prevent interruptions. However, both lines failed simultaneously, triggering what the company has described as an event “with a probability of occurrence once every 20 years.”

A power outage that tested all systems

The blackout, which started at 12:25 PM (local time), forced the activation of emergency diesel generators and reliance on DUPS (diesel-rotary uninterruptible power supplies). Although critical elements such as network infrastructure and monitoring services remained operational, the availability zone ru-central1-b of Yandex Cloud was completely inactive for hours. Some services deployed across multiple zones also experienced availability issues.

Power restoration from the substation occurred at 3:30 PM, and the complete reactivation process for infrastructure and services extended until midnight the following day. The complexity of the procedure, which required manual validations and direct supervision by engineers, lengthened the recovery time.

Lessons and future measures

Yandex has announced that this event has prompted a complete review of its energy resilience model, including the possibility of adding a third level of backup based on generators in addition to the existing two. They will also implement more rigorous disaster recovery exercises with simulations of dual failures and improve the automation of cold start processes for their systems.

In parallel, efforts will continue to enhance multi-zone resilience tools in Yandex Cloud. Notably, “Zonal Shift” is highlighted, a traffic diversion technology that has already proven effective by allowing customers with distributed architectures to mitigate impact by automatically redirecting loads to other available zones.

A warning for the entire industry

The incident has served as a reminder for operators of critical infrastructures: even the most robust systems can fail if exceptional risks are not taken into account. “Multi-zone is no longer an option; it is a necessity for any mission-critical service,” Yandex warned in its report.

The company, known as “the Russian Google,” operates five data centers in the country located in Vladimir, Sasovo, Ivanteevka, Mytishchi, and Kaluga Oblast, the latter recently opened with 63 MW of capacity. Since its structural split from its European operations, now under the name Nebius, Yandex has doubled down on reinforcing its infrastructure within Russia.

This event, although controlled with no significant losses, will serve as a case study for the entire tech industry, demonstrating the importance of extreme planning, redundancy, and transparency in managing critical incidents.

Source: HABR and DCD

X (Twitter) Facebook Pinterest LinkedIn E-mail