In a move that could redefine AI infrastructure, Cerebras Systems showcased its record-breaking Wafer Scale Engine (WSE) chip at Web Summit Vancouver, claiming it now holds the title of the world’s fastest AI inference engine.
Roughly the size of a dinner plate, the latest WSE chip spans 8.5 inches (22 cm) per side and packs an astonishing 4 trillion transistors — a monumental leap from traditional processors like Intel’s Core i9 (33.5 billion transistors) or Apple’s M2 Max (67 billion).
The result? A groundbreaking 2,500 tokens per second on Meta’s Llama 4 model, nearly 2.5 times faster than Nvidia’s recently announced benchmark of 1,000 tokens per second.
“Inference is where speed matters the most,” said Naor Penso, Chief Information Security Officer at Cerebras. “Last week Nvidia hit 1,000 tokens per second — which is impressive — but today, we’ve surpassed that with 2,500 tokens per second.”
Inference refers to how AI processes information to generate outputs like text, images, or decisions. Tokens, which can be words or characters, represent the basic units AI uses to interpret and respond. As AI agents take on more complex, multi-step tasks, inference speed becomes increasingly essential.
“Agents need to break large tasks into dozens of sub-tasks and communicate between them quickly,” Penso explained. “Slow inference disrupts that entire flow.”
What sets Cerebras apart isn’t just transistor count — it’s the chip’s design. Unlike Nvidia GPUs that require off-chip memory access, WSE integrates 44GB of high-speed RAM directly on-chip, ensuring ultra-fast data access and reduced latency.
Independent benchmarks back Cerebras’ claims.
Artificial Analysis, a third-party testing agency, confirmed the WSE achieved 2,522 tokens per second on Llama 4, outperforming Nvidia’s new Blackwell GPU (1,038 tokens/sec). “Cerebras is the only inference solution that currently outpaces Blackwell for Meta’s flagship model,” said Artificial Analysis CEO Micah Hill-Smith.
While CPUs and GPUs have driven AI advancements for decades, Cerebras’ WSE represents a shift toward a new compute paradigm. “This isn’t x86 or ARM, It’s a new architecture designed to supercharge AI workloads,” said Julie Shin, Chief Marketing Officer at Cerebras.