By integrating compute, memory, and acceleration, the chip allows LLMs to run efficiently, supporting the next generation of AI applications at scale.

As AI chatbots and assistants scale users, the cost of running inference and the process of serving AI responses, has become a major challenge. Speed, efficiency, and stability are now as important as raw training power, especially for cloud providers managing continuous, high-volume workloads. This has increased demand for AI chips optimised specifically for inference at scale.
Microsoft has introduced the Maia 200, its second-generation in-house AI processor, designed for large-scale inference. Maia 200 builds on the 2023 Maia 100, offering a significant boost in performance while supporting current and future AI models.
The Maia 200 integrates over 100 billion transistors and delivers more than 10 petaflops at 4-bit precision and roughly 5 petaflops at 8-bit precision. Large amounts of fast SRAM reduce latency for repeated queries, ensuring responsiveness even under high user traffic. The chip is optimised for real-world AI workloads rather than training benchmarks, allowing the company to run large models efficiently and continuously.
Manufactured using TSMC’s 3 nanometre process, Maia 200 also features high-bandwidth memory. The company claims the chip to provide 3x the FP4 performance of Amazon’s Trainium 3 and stronger FP8 performance than Google’s latest TPU. Maia 200 is part of the company’s broader strategy to reduce reliance on NVIDIA GPUs while maintaining competitive performance and cost control.
Key features of the chip include:
- Up to 180 TOPS of combined AI compute via CPU, GPU, and NPU
- Hybrid CPU architecture with high-performance, efficiency, and low-power cores
- Large on-chip SRAM for low-latency inference
- Industrial-grade design for continuous operation at scale
- 3 nanometre process technology with high-bandwidth memory
The chip is also paired with open-source AI software tools for efficient development, the platform is designed to meet the growing need for fast, reliable, and cost-effective AI inference.





