Run:AI - Triton with TensorRT
Triton Deployment with TensorRT and Streaming
This guide will walk you through setting up TensorRT-LLM using Docker for later deployment on Run:ai and Triton Inference Server. Follow the steps carefully to ensure a smooth setup process.
Triton Inference Server
Triton Inference Server is an open-source inference serving software designed to simplify AI inferencing. It allows teams to deploy any AI model from a variety of deep learning and machine learning frameworks, such as TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS AI, and more. The Triton Inference Server supports inference across cloud, data center, edge, and embedded devices on NVIDIA GPUs, x86 and ARM CPUs, or AWS Inferentia. It delivers optimized performance for various query types, including real-time, batched, ensembles, and audio/video streaming. Triton Inference Server is a component of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.
Major features include:
- Supports multiple deep learning frameworks
- Supports multiple machine learning frameworks
- Concurrent model execution
- Dynamic batching
- Sequence batching and implicit state management for stateful models
- Provides Backend API that allows adding custom backends and pre/post processing operations
- Supports writing custom backends in python, a.k.a. Python-based backends.
- Model pipelines using Ensembling or Business Logic Scripting (BLS)
- HTTP/REST and GRPC inference protocols based on the community developed KServe protocol
- A C API and Java API allow Triton to link directly into your application for edge and other in-process use cases
- Metrics indicating GPU utilization, server throughput, server latency, and more
Where can I find all the backends that are available for Triton?
Anyone can develop a Triton backend, so it isn't possible for us to know about all available backends. But the Triton project does provide a set of supported backends that are tested and updated with each Triton release.
- TensorRT: The TensorRT backend is used to execute TensorRT models. The tensorrt_backend repo contains the source for the backend.
- ONNX Runtime: The ONNX Runtime backend is used to execute ONNX models. The onnxruntime_backend repo contains the documentation and source for the backend.
- TensorFlow: The TensorFlow backend is used to execute TensorFlow models in both GraphDef and SavedModel formats. The same backend is used to execute both TensorFlow 1 and TensorFlow 2 models. The tensorflow_backend repo contains the documentation and source for the backend.
- PyTorch: The PyTorch backend is used to execute PyTorch models in both TorchScript and PyTorch 2.0 formats. The pytorch_backend repo contains the documentation and source for the backend.
- OpenVINO: The OpenVINO backend is used to execute OpenVINO models. The openvino_backend repo contains the documentation and source for the backend.
- Python: The Python backend allows you to write your model logic in Python. For example, you can use this backend to execute pre/post processing code written in Python, or to execute a PyTorch Python script directly (instead of first converting it to TorchScript and then using the PyTorch backend). The python_backend repo contains the documentation and source for the backend.
- DALI: DALI is a collection of highly optimized building blocks and an execution engine that accelerates the pre-processing of the input data for deep learning applications. The DALI backend allows you to execute your DALI pipeline within Triton. The dali_backend repo contains the documentation and source for the backend.
- FIL: The FIL (Forest Inference Library) backend is used to execute a variety of tree-based ML models, including XGBoost models, LightGBM models, Scikit-Learn random forest models, and cuML random forest models. The fil_backend repo contains the documentation and source for the backend.
- vLLM: The vLLM backend is designed to run supported models on a vLLM engine. This backend depends on python_backend to load and serve models. The vllm_backend repo contains the documentation and source for the backend.
Not all the above backends are supported on every platform supported by Triton. Look at the Backend-Platform Support Matrix to learn about the same.
TensorRT backend
The Triton backend for TensorRT-LLM. For more information on Triton backends, refer to the backend repository. The primary objective of the TensorRT-LLM Backend is to enable you to serve TensorRT-LLM models using the Triton Inference Server.
The inflight_batcher_llm directory houses the C++ implementation of the backend, which supports inflight batching, paged attention, and more.
Starting with the setup
We recommend not changing the library version (0.9.0) or the Triton Inference Server 24.04-trtllm-python-py3, as it has been tested to work with most tools in the stack.
Run:AI Deployment
Create a container for the TensorRT temporal workload on Run:AI using the following command:
runai submit --image nvidia/cuda:12.4.0-devel-ubuntu22.04 --cpu <CPU> --gpu <GPU_NUM> --memory <MEM> --attach --interactive --existing-pvc claimname=<YOUR_PVC_NAME>,path=/<PATH>
Obtain the Basic Docker Image Environment
This guide assumes that you have Docker installed with GPU support and that you also have sufficient disk space.
First, we need to run a Docker container with the necessary environment
This step is crucial; do not swap the container version unless you have tested the entire process against it.
mkdir TensorRT-LLM
cd TensorRT-LLM
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.4.0-devel-ubuntu22.04
Update the package list and install the required dependencies:
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git
Install Dependencies
TensorRT-LLM requires Python 3.10. Change the working directory to /srv/TensorRT-LLM and clone the TensorRT-LLM repository:
git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
After cloning, the path should be /srv/TensorRT-LLM/TensorRT-LLM.
Install the Stable Version of TensorRT-LLM
Install the stable version (corresponding to the cloned branch) of TensorRT-LLM:
pip3 install tensorrt_llm==0.9.0 -U --extra-index-url https://pypi.nvidia.com
Log in to Huggingface-cli
huggingface-cli login --token hf_*******************************
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir Meta-Llama-3-8B-Instruct --exclude "*original*"
Build the Llama 8B Model
Build the Llama 8B model using a single GPU and F16:
python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu_f16 \
--dtype float16
Then we do:
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_f16 \
--output_dir ./tmp/llama/8B/trt_engines/f16/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_num_tokens 8192 \
--streamingllm enable
Clone and Set Up TensorRTLLM Backend
For this step we go back to /srv/TensorRT-LLM and execute the following:
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/f16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/
Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
Fill the template configurations
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
Launch the Triton Server
On the Run:ai Platform, we have a dedicated image for this process, which is built with our storage and all the necessary attachments. To use it, you will need to generate an image with the desired amount of GPUs for Tensor Parallelism support. In essence, the process is quite simple.
Change to the base working directory to /srv/TensorRT-LLM and run the Triton Inference Server with the necessary configurations:
docker run -it --rm --gpus all --network host --shm-size=1g -v $(pwd):/workspace --workdir /workspace nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
Install additional Python dependencies
pip install sentencepiece protobuf
Run the Triton Server
python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1
Test the Setup
You can test the setup by sending a request to the server:
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d \
'{
"text_input": "Compose a poem that explains the concept of recursion in programming.",
"parameters": {
"max_tokens": 150,
"streaming": true
}
}'
This will generate a response from the Llama model, verifying that your setup is complete.
Final Notes
- After completing the "Fill the template configurations" process, the Docker container can be terminated, as it is no longer needed.
- We utilize only the tokenizer from the Meta-Llama-3-8B-Instruct model. Once the conversion is complete, you can delete the *.safetensor files to save disk space.
- OpenAI-style streaming is available through an open-source library that is not supported by our partners. (You may want to add a reference to the library: https://github.com/npuichigo/openai_trtllm)
Updated 3 months ago