Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Big Data Blog

This article discusses how to preprocess and fine-tune large language models (LLMs) using Amazon EMR Serverless and Amazon SageMaker.

Specifically, the article covers:

Introduction to the Common Crawl dataset and methods for exploring, filtering, and processing it using Athena and EMR Spark
Processing the filtered Common Crawl data using Amazon EMR Serverless to prepare it for LLM fine-tuning
Fine-tuning the Llama 2 LLM on the preprocessed data using Amazon SageMaker JumpStart, without writing any code
Evaluating and comparing the performance of the fine-tuned model vs. the original model on a sample task

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Apr 22
2025

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Jul 24
2024

LLM experimentation at scale using Amazon SageMaker Pipelines and MLflow

Apr 24
2024

Improve LLM performance with human and AI feedback on Amazon SageMaker for Amazon Engineering

Mar 26
2026

Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Related articles