Tutorials

Deploying Llama 3.1 with TGI on Outpost Services

Discover how to deploy and serve the Llama 3.1 model using Outpost Services.

Meta AI introduced the Llama 3.1 model family on July 23, 2024. Among them, the 405B parameter model stands out as the most advanced open LLM, challenging leading proprietary models like GPT-4o and Claude 3.5 Sonnet.

Introduction to Llama 3.1

In this tutorial, we will cover the steps to deploy Llama 3.1 models on a GPU node using Outpost Services and package it for hassle-free deployment across various GPU types.

GPU Requirements for Different Llama 3.1 Models

The Llama 3.1 models come in several sizes, each with its own GPU requirements. Here is a compatibility matrix for both pretrained and instruction-tuned models:

GPUMeta-Llama-3.1-8BMeta-Llama-3.1-70BMeta-Llama-3.1-405B-FP8
L4:1✅ with --max-model-len 4096
L4:8
A100:8
A100-80GB:8✅ with --max-model-len 4096

For a complete list of available models, check here:

  • Pretrained:
    • Meta-Llama-3.1-8B
    • Meta-Llama-3.1-70B
    • Meta-Llama-3.1-405B-FP8
  • Instruction tuned:
    • Meta-Llama-3.1-8B-Instruct
    • Meta-Llama-3.1-70B-Instruct
    • Meta-Llama-3.1-405B-Instruct-FP8
The full precision 405B model (Meta-Llama-3.1-405B) requires multi-node inference.

Packaging and Deployment with Outpost Services

Now, let's package the model for a smooth deployment process.

Using Outpost Services offers significant benefits:

  • Automatic load balancing across multiple replicas.
  • Automatic recovery of replicas.
  • Cost efficiency by using different types of GPUs and combining reserved and spot GPUs.

Start by creating a deployment file (outpost.yaml) to launch a fully managed service with load balancing and auto-recovery:

yaml
1service:
2  readiness_probe:
3    path: /health
4    headers:
5      Authorization: Bearer $API_KEY
6  replica_policy:
7    min_replicas: 1
8    max_replicas: 2
9    target_qps_per_replica: 5
10    upscale_delay_seconds: 300
11    downscale_delay_seconds: 1200
12
13resources:
14  accelerators: L4:1
15  use_spot: True
16  ports: 8080
17
18envs:
19  MODEL_ID: meta-llama/Meta-Llama-3.1-8B-Instruct
20  HF_TOKEN: "" # TODO: Enter your Hugging Face token.
21  API_KEY: ""
22
23run: |
24  docker run --gpus all --shm-size 1g -p 8080:80 \
25    -v ~/data:/data ghcr.io/huggingface/text-generation-inference \
26    --model-id $MODEL_ID
27    --api-key $API_KEY

Once the service is ready, set your endpoint:

bash
ENDPOINT=$*****onoutpost.com

Making Requests to the Endpoint

You can use curl to interact with TGI’s Messages API. Here is an example:

bash
1ENDPOINT=$*****onoutpost.com
2
3curl $ENDPOINT/v1/chat/completions \
4    -X POST \
5    -d '{
6  "model": "tgi",
7  "messages": [
8    {
9      "role": "system",
10      "content": "You are a helpful assistant."
11    },
12    {
13      "role": "user",
14      "content": "What is deep learning?"
15    }
16  ],
17  "stream": true,
18  "max_tokens": 20
19}' \
20    -H 'Content-Type: application/json'

🎉 Congratulations! You are now successfully serving a Llama 3.1 8B model.