Home icon

Faster LLMs with speculative decoding and AWS Inferentia2

Machine Learning Blog



This article discusses a technique called "speculative sampling" that can improve the computational efficiency and performance of large language models (LLMs) like Llama-3-70B and Llama-2-70B when running inference on AWS Inferentia and Trainium chips.

Specifically, the article covers:

  • Introduction to sequential token generation in LLMs and its computational challenges
  • How speculative sampling works by using a smaller "draft" model to speculate tokens that are verified by a larger "target" model
  • A walkthrough of using speculative sampling with Llama-2-70B and Llama-2-7B models on Inferentia2 and Trainium instances
  • Considerations for loading models, memory requirements, and tensor parallelism
  • Conclusion on how speculative sampling can enable using larger, higher-quality LLMs while maintaining speed and cost-efficiency


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Apr 15
2026
Accelerating decode-heavy LLM inference with speculative decoding on AWS Trainium and vLLM
Mar 13
2026
P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM
Apr 2
2024
Gradient makes LLM benchmarking cost-effective and effortless with AWS Inferentia
Apr 7
2025
How AWS and Intel make LLMs more accessible and cost-effective with DeepSeek

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.