New ASIC Chip Embeds AI Models Directly Into Hardware

New inference hardware claims up to 10x faster AI response times with drastically lower power and cost by embedding models directly into custom silicon rather than relying on GPUs.

A new AI accelerator architecture by Taalas is challenging the conventional GPU-driven approach to artificial intelligence inference by converting entire AI models into dedicated silicon. The newly introduced chip architecture hardwires model weights and parameters directly into ASICs, enabling ultra-low latency and significantly lower energy consumption for large language model (LLM) inference workloads.

Unlike GPUs and even many AI accelerators that remain programmable, the new approach creates model-specific chips optimized for a single neural network. The result is dramatically faster inference performance, especially for high-volume workloads where the same AI model serves millions of queries repeatedly. According to disclosed benchmarks, the first-generation chip running a compressed Llama 3.1 8B model reportedly achieves over 14,000 to 17,000 tokens per second—far beyond most currently deployed inference hardware.

The key features are:

Hardwires AI models directly into ASIC silicon
Delivers over 14,000–17,000 tokens per second
Eliminates dependence on external HBM memory
Operates at significantly lower rack-level power
Supports rapid chip revisions in nearly two months

The architecture removes dependence on high-bandwidth memory (HBM), one of the costliest and most supply-constrained elements in modern AI systems. Instead, model parameters are etched directly into silicon, reducing memory bottlenecks and enabling lower rack-level power consumption. The company claims AI racks based on the architecture consume roughly 12–15kW, compared to 120–600kW for GPU-based AI racks.

The system also avoids liquid cooling and advanced packaging requirements, potentially allowing deployment in standard server infrastructure. Analysts suggest this could significantly reduce infrastructure costs for cloud providers and enterprise AI deployments.

However, the tradeoff is flexibility. Since each chip is tied to a specific AI model, updates require new silicon revisions. To address this, the startup says only two metal layers need modification during redesign, allowing updated chips to be produced within roughly two months instead of years.

The company plans to scale the architecture beyond smaller LLMs toward larger reasoning models and multi-chip AI systems later this year. Industry observers see the development as part of a broader trend toward highly specialized AI hardware, similar to how TPUs transformed cloud AI acceleration.

New ASIC Chip Embeds AI Models Directly Into Hardware

SHARE YOUR THOUGHTS & COMMENTS Cancel reply

EFY Prime

Unique DIY Projects

Electronics News

Truly Innovative Electronics

Latest DIY Videos

Electronics Components

Electronics Jobs

Calculators For Electronics