OpenAI Accelerates Codex with Cerebras: 1,000 Tokens/Second and a Real “Plan B” to NVIDIA

OpenAI has taken a step that, beyond the headlines, could have significant implications for the inference market: its new gpt-5.3-codex-spark (a variant oriented toward “live work,” with ultra-fast responses) is hosted on Cerebras infrastructure. The message is twofold: on one hand, OpenAI emphasizes latency as the new obsession of coding AI; on the other, it hints that there is already a second path to run large-scale models without relying solely on the NVIDIA ecosystem.

The news comes at a time when programming assistants are competing less on “guessing” and more on feeling immediate: the time to first token (TTFT), streaming fluidity, and the ability to sustain technical dialogue without pauses have become the true frontiers of the product. And here, OpenAI believes it has found a tangible advantage with Cerebras.


What exactly is Codex-Spark and why does it matter?

According to OpenAI, Codex-Spark is designed for programming tasks with a user experience closer to that of a pair programmer: faster responses, more continuity in streaming, and fewer “micro-pauses” as the model generates code and explanations. The company claims that with this variant:

  • Reduces time to first token by about 50% (TTFT).
  • Can reach up to ~1,000 tokens per second output in favorable scenarios (crucial for quick editing and test/correct loops).
  • Maintains long context (OpenAI positions it for intensive programming sessions and tools).

In other words: it’s not just “another model,” but a commitment to extreme interactivity. This aligns with the market shift toward agentic flows (tools, function calls, automated testing, navigation, etc.), where latency impacts productivity more than a small fraction of extra precision.


The key piece: what does Cerebras bring to inference?

Cerebras has argued for years that its wafer-scale approach (a chip the size of an entire wafer) is not just an experimental lab curiosity, but an architecture with practical advantages when the bottleneck is memory and data movement, not just FLOPS.

For WSE-3 (Wafer-Scale Engine 3), the numbers are impressive on the “in-chip” side:

ParameterCerebras WSE-3 (public references)
Transistors~4 billion
Cores~900,000
On-chip memory~44 GB
Memory bandwidth~21 PB/s (per cluster specifications)

This design aims to achieve a clear goal: minimize internal bottlenecks and sustain a very high token cadence with low latency. In models oriented toward programming — involving repetitive patterns, iterative editing, and the need for instant responses — such an advantage can translate into a more “human-like” experience: less waiting and more continuity.


Does this mean NVIDIA is losing its throne? Not so fast

It’s tempting to frame this as a “breakthrough,” but it’s more likely — for now — to be a pragmatic move:

  • NVIDIA continues to dominate the stack (software, ecosystem, availability, OEM integration, etc.) and the economics of large-scale inference, especially in batches and general-purpose deployments.
  • What OpenAI seems to be indicating is something else: for certain products (like an ultra-fast code Copilot), the main concern isn’t just token cost but the response time and the sense of immediacy.

In simple terms: the industry is discovering that inference isn’t a single market. There’s “cheap” inference (high throughput in batches) and “instant” inference (low latency, constant interaction). And different architectures often excel in each.


Why this could change the game in product (more than in benchmarks)

In programming, every second counts — not arbitrarily: an assistant that responds instantly enables:

  1. Shorter iterations: propose → apply → test → correct.
  2. More useful agents: if the agent calls tools, searches, runs tests, and returns, accumulated latency determines whether it’s worth continuing or stopping.
  3. Reduced cognitive friction: when flow is interrupted, developers lose context, which diminishes potential productivity gains.

If OpenAI manages to make Codex-Spark consistently more agile, it’s not just a technical improvement — it’s a psychological and operational one. And in product design, that’s often a decisive factor.


Strategic reading: diversification and bargaining power

OpenAI explicitly mentioning Cerebras as the infrastructure for a visible part of its catalog also suggests an industry-level reading:

  • Supply resilience: less dependence on a single provider in a market where stock and computing capacity remain a key competitive leverage.
  • Real alternatives: while NVIDIA continues to dominate, having a second viable platform in production improves bargaining position for any large buyer.
  • Workload segmentation: training, general chat serving, interactive coding — these may end up on distinct hardware “islands.”

It’s no coincidence that the public discourse around inference is shifting toward concepts like TTFT, streaming overhead, “latency-sensitive workloads,” and full-path optimization (network + runtime + hardware). The value isn’t just in the model anymore; it’s in how it’s deployed.


What to watch from now on

If this trend consolidates, three clear signals to follow in 2026 are:

  • Adoption across more products: whether Codex-Spark is an isolated case or the beginning of a broader shift.
  • Response from the GPU ecosystem: targeted improvements in latency token-to-token and TTFT in interactive scenarios.
  • Emergence of more “alternative hardware” in inference: ASICs, non-NVIDIA GPUs, specialized architectures aiming to prioritize user experience over raw throughput.

via: wccftech

Scroll to Top