Run:AI - Triton with TensorRT

Triton Deployment with TensorRT and Streaming

This guide will walk you through setting up TensorRT-LLM using Docker for later deployment on Run:ai and Triton Inference Server. Follow the steps carefully to ensure a smooth setup process.

Triton Inference Server

Triton Inference Server is an open-source inference serving software designed to simplify AI inferencing. It allows teams to deploy any AI model from a variety of deep learning and machine learning frameworks, such as TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS AI, and more. The Triton Inference Server supports inference across cloud, data center, edge, and embedded devices on NVIDIA GPUs, x86 and ARM CPUs, or AWS Inferentia. It delivers optimized performance for various query types, including real-time, batched, ensembles, and audio/video streaming. Triton Inference Server is a component of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

  • Supports multiple deep learning frameworks
  • Supports multiple machine learning frameworks
  • Concurrent model execution
  • Dynamic batching
  • Sequence batching and implicit state management for stateful models
  • Provides Backend API that allows adding custom backends and pre/post processing operations
  • Supports writing custom backends in python, a.k.a. Python-based backends.
  • Model pipelines using Ensembling or Business Logic Scripting (BLS)
  • HTTP/REST and GRPC inference protocols based on the community developed KServe protocol
  • A C API and Java API allow Triton to link directly into your application for edge and other in-process use cases
  • Metrics indicating GPU utilization, server throughput, server latency, and more

Where can I find all the backends that are available for Triton?

Anyone can develop a Triton backend, so it isn't possible for us to know about all available backends. But the Triton project does provide a set of supported backends that are tested and updated with each Triton release.

  • TensorRT: The TensorRT backend is used to execute TensorRT models. The tensorrt_backend repo contains the source for the backend.
  • ONNX Runtime: The ONNX Runtime backend is used to execute ONNX models. The onnxruntime_backend repo contains the documentation and source for the backend.
  • TensorFlow: The TensorFlow backend is used to execute TensorFlow models in both GraphDef and SavedModel formats. The same backend is used to execute both TensorFlow 1 and TensorFlow 2 models. The tensorflow_backend repo contains the documentation and source for the backend.
  • PyTorch: The PyTorch backend is used to execute PyTorch models in both TorchScript and PyTorch 2.0 formats. The pytorch_backend repo contains the documentation and source for the backend.
  • OpenVINO: The OpenVINO backend is used to execute OpenVINO models. The openvino_backend repo contains the documentation and source for the backend.
  • Python: The Python backend allows you to write your model logic in Python. For example, you can use this backend to execute pre/post processing code written in Python, or to execute a PyTorch Python script directly (instead of first converting it to TorchScript and then using the PyTorch backend). The python_backend repo contains the documentation and source for the backend.
  • DALI: DALI is a collection of highly optimized building blocks and an execution engine that accelerates the pre-processing of the input data for deep learning applications. The DALI backend allows you to execute your DALI pipeline within Triton. The dali_backend repo contains the documentation and source for the backend.
  • FIL: The FIL (Forest Inference Library) backend is used to execute a variety of tree-based ML models, including XGBoost models, LightGBM models, Scikit-Learn random forest models, and cuML random forest models. The fil_backend repo contains the documentation and source for the backend.
  • vLLM: The vLLM backend is designed to run supported models on a vLLM engine. This backend depends on python_backend to load and serve models. The vllm_backend repo contains the documentation and source for the backend.

📘

Not all the above backends are supported on every platform supported by Triton. Look at the Backend-Platform Support Matrix to learn about the same.

TensorRT backend

The Triton backend for TensorRT-LLM. For more information on Triton backends, refer to the backend repository. The primary objective of the TensorRT-LLM Backend is to enable you to serve TensorRT-LLM models using the Triton Inference Server.

The inflight_batcher_llm directory houses the C++ implementation of the backend, which supports inflight batching, paged attention, and more.

Starting with the setup

📘

We recommend not changing the library version (0.9.0) or the Triton Inference Server 24.04-trtllm-python-py3, as it has been tested to work with most tools in the stack.

Run:AI Deployment

Create a container for the TensorRT temporal workload on Run:AI using the following command:

runai submit --image nvidia/cuda:12.4.0-devel-ubuntu22.04 --cpu <CPU> --gpu <GPU_NUM> --memory <MEM> --attach --interactive --existing-pvc claimname=<YOUR_PVC_NAME>,path=/<PATH>

Obtain the Basic Docker Image Environment

📘

This guide assumes that you have Docker installed with GPU support and that you also have sufficient disk space.

First, we need to run a Docker container with the necessary environment

This step is crucial; do not swap the container version unless you have tested the entire process against it.

mkdir TensorRT-LLM
cd TensorRT-LLM
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.4.0-devel-ubuntu22.04

Update the package list and install the required dependencies:

apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git

Install Dependencies

TensorRT-LLM requires Python 3.10. Change the working directory to /srv/TensorRT-LLM and clone the TensorRT-LLM repository:

git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

After cloning, the path should be /srv/TensorRT-LLM/TensorRT-LLM.

Install the Stable Version of TensorRT-LLM

Install the stable version (corresponding to the cloned branch) of TensorRT-LLM:

pip3 install tensorrt_llm==0.9.0 -U --extra-index-url https://pypi.nvidia.com

Log in to Huggingface-cli

huggingface-cli login --token hf_*******************************
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir Meta-Llama-3-8B-Instruct --exclude "*original*"

Build the Llama 8B Model

Build the Llama 8B model using a single GPU and F16:

python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_f16 \
            --dtype float16

Then we do:

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_f16 \
            --output_dir ./tmp/llama/8B/trt_engines/f16/1-gpu \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_num_tokens 8192 \
            --streamingllm enable

Clone and Set Up TensorRTLLM Backend

For this step we go back to /srv/TensorRT-LLM and execute the following:

git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/f16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

Set the tokenizer_dir and engine_dir paths

HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1

Fill the template configurations

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
 
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
 
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
 
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
 
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Launch the Triton Server

On the Run:ai Platform, we have a dedicated image for this process, which is built with our storage and all the necessary attachments. To use it, you will need to generate an image with the desired amount of GPUs for Tensor Parallelism support. In essence, the process is quite simple.

Change to the base working directory to /srv/TensorRT-LLM and run the Triton Inference Server with the necessary configurations:

docker run -it --rm --gpus all --network host --shm-size=1g -v $(pwd):/workspace --workdir /workspace nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3

Install additional Python dependencies

pip install sentencepiece protobuf

Run the Triton Server

python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1

Test the Setup

You can test the setup by sending a request to the server:

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d \
'{
  "text_input": "Compose a poem that explains the concept of recursion in programming.",
  "parameters": {
    "max_tokens": 150,
    "streaming": true
  }
}'

This will generate a response from the Llama model, verifying that your setup is complete.

Final Notes

  • After completing the "Fill the template configurations" process, the Docker container can be terminated, as it is no longer needed.
  • We utilize only the tokenizer from the Meta-Llama-3-8B-Instruct model. Once the conversion is complete, you can delete the *.safetensor files to save disk space.
  • OpenAI-style streaming is available through an open-source library that is not supported by our partners. (You may want to add a reference to the library: https://github.com/npuichigo/openai_trtllm)