Home icon

Fine-tune large multimodal models using Amazon SageMaker

Machine Learning Blog



This article discusses how to fine-tune and deploy large multimodal models like LLaVA (Large Language and Vision Assistant) using Amazon SageMaker. LLaVA is trained to understand both visual and textual data, combining pre-trained language models like Vicuna or LLaMA with visual models like CLIP's visual encoder.

Specifically, the article covers:

  • An overview of LLaVA's architecture and training process
  • Preparing a dataset of image-text pairs with detailed annotations for fine-tuning
  • Fine-tuning the LLaVA model using Amazon SageMaker, including techniques like LoRA and DeepSpeed
  • Deploying the fine-tuned model on a SageMaker endpoint and making inferences
  • Conclusion on the potential of multimodal models and future challenges


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Nov 15
2024
Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart
Nov 21
2024
Fine-tune large language models with Amazon SageMaker Autopilot
Jul 11
2025
Advanced fine-tuning methods on Amazon SageMaker AI
May 4
2026
Agent-guided workflows to accelerate model customization in Amazon SageMaker AI

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.