Skip to content

Supported Model Servers

Any model server that conform to the model server protocol are supported by the inference extension.

Compatible Model Server Versions

Model Server Version Commit Notes
vLLM V0 v0.6.4 and above commit 0ad216f
vLLM V1 v0.8.0 and above commit bc32bc7
Triton(TensorRT-LLM) TODO Pending PR. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet.

vLLM

vLLM is configured as the default in the endpoint picker extension. No further configuration is required.

Triton with TensorRT-LLM Backend

Specify the metric names when starting the EPP container by adding the following to the args of the EPP deployment.

- -totalQueuedRequestsMetric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- -kvCacheUsagePercentageMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"