Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

Machine Learning Blog

This article discusses how to fine-tune large multimodal models like Meta Llama 3.2 for vision and text tasks using Amazon SageMaker JumpStart. It covers how to fine-tune these models through the SageMaker Studio UI or Python SDK, and showcases improved performance on the DocVQA visual question answering benchmark after fine-tuning.

Specifically, the article covers:

Overview of Meta Llama 3.2 Vision models and the DocVQA dataset
Using SageMaker JumpStart to fine-tune models through the Studio UI or Python SDK
Quantitative metrics showing ANLS score improvements after fine-tuning on DocVQA
Qualitative examples of fine-tuned model outputs on visual question answering
Technical details like using low-rank adaptation and mixed precision training

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

May 29
2024

Fine-tune large multimodal models using Amazon SageMaker

Nov 15
2024

Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

May 14
2026

New models for image generation and text embeddings are now available in Amazon SageMaker JumpStart

May 6
2026

4 new Qwen models for multimodal reasoning, agentic coding, and multilingual applications are now available in Amazon SageMaker JumpStart

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

Related articles