WaferLLM, shows these oversized chips can slash latency,enhance efficiency and deliver up to 100× faster inference, signaling a major shift in how next-gen AI systems will be built.

A new class of dinner-plate sized chips is emerging as a key contender for powering the next leap in artificial intelligence, as traditional GPU-based systems strain under the demands of ever-larger models. Wafer-scale processors silicon slabs roughly five times the size of conventional GPUs promise radically lower latency and higher efficiency by keeping massive neural networks on a single piece of silicon rather than splitting them across hundreds or thousands of interconnected chips.
The push comes as AI models balloon in size and complexity. Modern large language models require vast memory and ultra-fast data movement between processing units. That’s a growing problem for GPUs, originally built for gaming workloads. Their limited onboard memory forces AI models to be divided across large clusters of GPUs, slowing inference and increasing energy use. OpenAI’s GPT-5, for instance, reportedly required around 200,000 GPUs 20× more than GPT-3.
Companies like Cerebras have been betting on wafer-scale chips as an alternative. These processors pack hundreds of thousands of compute cores and substantial on-chip memory, removing the need for high-speed interconnects between separate chips. But hardware alone doesn’t solve the deeper bottleneck: communication delays within the wafer itself.
A research collaboration between the University of Edinburgh and Microsoft Research has now tackled that software challenge with a system called WaferLLM, designed to run large models efficiently on wafer-scale hardware. The team reorganised LLM computation so each core works primarily on locally stored data, avoiding long cross-wafer memory hops that can be a thousand times slower. New algorithms break large matrix operations into smaller tasks processed by neighbouring cores, while coordinated scheduling keeps computation, communication and data preparation overlapping.
Tests on major models including Llama and Qwen at Europe’s largest wafer-scale AI facility showed dramatic gains: up to 100× faster text generation on wafer hardware, 10× lower latency than a 16-GPU cluster, and 2× better energy efficiency.The approach suggests that software optimisation, not exotic new chips, may unlock the next tier of AI performance. Wafer-scale processors aren’t positioned to replace GPUs but to complement them in ultra-low-latency or energy-sensitive workloads such as financial modelling, drug discovery and scientific analysis. As AI systems grow, co-designing hardware and software is becoming essential and wafer-scale computing signals a shift toward architectures purpose-built for the demands of real-time, model-heavy AI.






