Skip to main content

Deploy a GPU-powered application with Defang

This tutorial guides you to create and deploy a GPU-powered application on AWS using Defang and Mistral. We will walk you through the whole deployment process based on this Deploying Mistral with vLLM sample.

Prerequisites

AWS Account with GPU Access

For any of this to work, you'll need to have access to GPU instances in your AWS account. To do that you'll need to go to the "Service Quotas" console in your AWS account. From there you can request access to spot GPU instances. You'll need to request 8 or more because the value is per vCPU and the smallest GPU instance has 8 vCPUs. The instance types you're requesting are "All G and VT spot instances".

Timing

This process can take a few days for AWS to approve.

Service Quotas console screenshot

HuggingFace Token

This sample requires a HugoingFace token to download the model. You can get a token by signing up at HuggingFace and then going to your settings to get your token.

Step 1: Clone the sample project

You'll need to clone this sample to go through this tutorial.

Step 2: Check your Defang BYOC settings

If you use the AWS CLI, you can verify that your are authenticated against AWS using the following command (note that the AWS CLI itself is not required to use the defang cli in BYOC mode):

aws sts get-caller-identity

Step 2: Prepare your Environment

  • Log into your Defang account
defang login
  • Set the HuggingFace token using the defang config command
defang config set --name HF_TOKEN

Configuration stores your sensitive information such as API keys, passwords, and other credentials for you.

Step 3: Explore the Compose File

The compose.yml file is where you define your services and their configurations.

The Mistral Service

In there you'll see the configuration we're using to deploy the Mistral model. We've highlighted some of the key aspects.

services:
mistral:
image: ghcr.io/mistralai/mistral-src/vllm:latest
ports:
- mode: host
target: 8000
command: ["--host","0.0.0.0","--model","TheBloke/Mistral-7B-Instruct-v0.2-AWQ","--quantization","awq","--dtype","auto","--tensor-parallel-size","1","--gpu-memory-utilization",".95","--max-model-len","8000"]
deploy:
resources:
reservations:
cpus: '2.0'
memory: 8192M
devices:
- capabilities: ["gpu"]
healthcheck:
test: ["CMD","curl","http://localhost:8000/v1/models"]
interval: 5m
timeout: 30s
retries: 10
environment:
- HF_TOKEN

Let's break it down.

We start with the latest vLLM docker image provided by Mistral AI.

mistral:
image: ghcr.io/mistralai/mistral-src/vllm:latest

We specify that we require a GPU to run our application.

deploy:
resources:
reservations:
cpus: '2.0'
memory: 8192M
devices:
- capabilities: ["gpu"]

The Mistral model will be downloaded from HuggingFace. We need to have a HuggingFace Token to enable the installation, so we specify that we need to get the HF_TOKEN configuration value from Defang.

Specifying the HF_TOKEN in the environment section of the service in the compose.yml file tells Defang to fetch the value from the encrypted configuration store.

environment:
- HF_TOKEN

The UI Service

In this sample we also provide a simple UI to interact with the endpoint created by vLLM. The UI service is a Next.js application that runs on port 3000.

Networking

You can see here how Defang's networking works. The mistral service is available at http://mistral:8000, exactly as it would be in a local docker-compose environment.

  ui:
restart: unless-stopped
build:
context: ui
dockerfile: Dockerfile
ports:
- mode: ingress
target: 3000
deploy:
resources:
reservations:
memory: 256M
healthcheck:
test: ["CMD","wget","--spider","http://localhost:3000"]
interval: 10s
timeout: 2s
retries: 10
environment:
- OPENAI_BASE_URL=http://mistral:8000/v1/

Step 4: Deploy to Your Own AWS Account with Defang

Run the following command to deploy your service:

defang compose up