Discover how to deploy and serve the Llama 3.1 model using Outpost Services.
Meta AI introduced the Llama 3.1 model family on July 23, 2024. Among them, the 405B parameter model stands out as the most advanced open LLM, challenging leading proprietary models like GPT-4o and Claude 3.5 Sonnet.
In this tutorial, we will cover the steps to deploy Llama 3.1 models on a GPU node using Outpost Services and package it for hassle-free deployment across various GPU types.
The Llama 3.1 models come in several sizes, each with its own GPU requirements. Here is a compatibility matrix for both pretrained and instruction-tuned models:
GPU | Meta-Llama-3.1-8B | Meta-Llama-3.1-70B | Meta-Llama-3.1-405B-FP8 |
---|---|---|---|
L4:1 | ✅ with --max-model-len 4096 | ❌ | ❌ |
L4:8 | ✅ | ❌ | ❌ |
A100:8 | ✅ | ✅ | ❌ |
A100-80GB:8 | ✅ | ✅ | ✅ with --max-model-len 4096 |
For a complete list of available models, check here:
Now, let's package the model for a smooth deployment process.
Using Outpost Services offers significant benefits:
Start by creating a deployment file (outpost.yaml
) to launch a fully managed service with load balancing and auto-recovery:
1service:
2 readiness_probe:
3 path: /health
4 headers:
5 Authorization: Bearer $API_KEY
6 replica_policy:
7 min_replicas: 1
8 max_replicas: 2
9 target_qps_per_replica: 5
10 upscale_delay_seconds: 300
11 downscale_delay_seconds: 1200
12
13resources:
14 accelerators: L4:1
15 use_spot: True
16 ports: 8080
17
18envs:
19 MODEL_ID: meta-llama/Meta-Llama-3.1-8B-Instruct
20 HF_TOKEN: "" # TODO: Enter your Hugging Face token.
21 API_KEY: ""
22
23run: |
24 docker run --gpus all --shm-size 1g -p 8080:80 \
25 -v ~/data:/data ghcr.io/huggingface/text-generation-inference \
26 --model-id $MODEL_ID
27 --api-key $API_KEY
Once the service is ready, set your endpoint:
ENDPOINT=$*****onoutpost.com
You can use curl to interact with TGI’s Messages API. Here is an example:
1ENDPOINT=$*****onoutpost.com
2
3curl $ENDPOINT/v1/chat/completions \
4 -X POST \
5 -d '{
6 "model": "tgi",
7 "messages": [
8 {
9 "role": "system",
10 "content": "You are a helpful assistant."
11 },
12 {
13 "role": "user",
14 "content": "What is deep learning?"
15 }
16 ],
17 "stream": true,
18 "max_tokens": 20
19}' \
20 -H 'Content-Type: application/json'
🎉 Congratulations! You are now successfully serving a Llama 3.1 8B model.