Supported Model Servers¶
Any model server that conform to the model server protocol are supported by the inference extension.
Compatible Model Server Versions¶
Model Server | Version | Commit | Notes |
---|---|---|---|
vLLM V0 | v0.6.4 and above | commit 0ad216f | |
vLLM V1 | v0.8.0 and above | commit bc32bc7 | |
Triton(TensorRT-LLM) | TODO | Pending PR. | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. |
vLLM¶
vLLM is configured as the default in the endpoint picker extension. No further configuration is required.
Triton with TensorRT-LLM Backend¶
Specify the metric names when starting the EPP container by adding the following to the args
of the EPP deployment.
- -totalQueuedRequestsMetric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- -kvCacheUsagePercentageMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"