Home icon

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Big Data Blog



This article discusses how to preprocess and fine-tune large language models (LLMs) using Amazon EMR Serverless and Amazon SageMaker.

Specifically, the article covers:

  • Introduction to the Common Crawl dataset and methods for exploring, filtering, and processing it using Athena and EMR Spark
  • Processing the filtered Common Crawl data using Amazon EMR Serverless to prepare it for LLM fine-tuning
  • Fine-tuning the Llama 2 LLM on the preprocessed data using Amazon SageMaker JumpStart, without writing any code
  • Evaluating and comparing the performance of the fine-tuned model vs. the original model on a sample task


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Apr 22
2025
Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15
Jul 24
2024
LLM experimentation at scale using Amazon SageMaker Pipelines and MLflow
Apr 24
2024
Improve LLM performance with human and AI feedback on Amazon SageMaker for Amazon Engineering
Mar 26
2026
Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.