Running and optimizing small language models on-premises and at the edge

Compute Blog

This article provides a comprehensive guide to running and optimizing small language models (SLMs) on-premises and at the edge using AWS infrastructure, specifically AWS Local Zones and AWS Outposts.

Discusses two generative AI deployment options: Large Language Models (LLMs) and Small Language Models (SLMs)
Uses Llama.cpp framework for efficient SLM deployment across different computing environments
Provides detailed technical steps for:
- Launching a GPU instance
- Installing NVIDIA drivers
- Setting up Llama.cpp
- Downloading and converting SLM models
Explains optimization parameters for SLMs, including GPU layer allocation, thread usage, and context size
Demonstrates practical use cases like chatbot interactions and text summarization

The solution enables organizations to deploy generative AI models while addressing data residency, security, and latency requirements in edge computing environments.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Jun 5
2025

Run small language models cost-efficiently with AWS Graviton and Amazon SageMaker AI

Dec 19
2025

Deploying Small Language Models at Scale with AWS IoT Greengrass and Strands Agents

Apr 2
2025

Using Large Language Models on Amazon Bedrock for multi-step task execution

Sep 25
2024

Opportunities for telecoms with small language models: Insights from AWS and Meta

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Running and optimizing small language models on-premises and at the edge

Related articles