AEI

ASIA ELECTRONICS INDUSTRYYOUR WINDOW TO SMART MANUFACTURING

Intel, Weizmann to Speed up AI Further With New Method

Researchers from Intel Labs and the Weizmann Institute of Science introduced a major advance in speculative decoding. The new technique, presented at the International Conference on Machine Learning (ICML) in Vancouver, Canada, enables any small draft model to accelerate any large language model (LLM) regardless of vocabulary differences.

According to Oren Pereg, senior researcher, Natural Language Processing Group at Intel Labs, the researchers have solved a core inefficiency in generative AI with a new method. “Our research shows how to turn speculative acceleration into a universal tool. This isn’t just a theoretical improvement; these are practical tools that are already helping developers build faster and smarter applications today.”

How Speculative Decoding Works?

Speculative decoding is an inference optimization technique designed to make LLMs faster and more efficient without compromising accuracy. It works by pairing a small, fast model with a larger, more accurate one, creating a “team effort” between models.

For example, in the prompt for an AI model “What is the capital of France?”, a traditional LLM fully computes step by step the response “Paris, a famous city…” meaning writing the response one word at a time. However, with speculative decoding, the small assistant model quickly drafts the full phrase “Paris, a famous city…” then the large model verifies the sequence. Therefore, dramatically reducing the compute cycles per output token.

The Intel and Weizmann Institute method removes the limitations of shared vocabularies or co-trained model families. Thus, making speculative decoding practical across heterogenous models. It delivers performance gains as much as 2.8 times faster inference without loss of output quality. 

Moreover, the method also works across models from different developers and ecosystems as it is open source ready.

Intel and Weizmann said the speculative decoding breakthrough promotes openness, interoperability, and cost-effective deployment from cloud to edge. For that reason, developers, enterprises, and researchers can now mix and match models to suit their performance needs and hardware constraints.

 “This work removes a major technical barrier to making generative AI faster and cheaper,” said Nadav Timor, Ph.D. student in the research group of Prof. David Harel at the Weizmann Institute. “Our algorithms unlock state-of-the-art speedups that were previously available only to organizations that train their own small draft models.”

19 July 2025