Install with LLM-D
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) in combination with LLM-D. This will also illustrate a key design pattern namely use of the vsr as a model picker in combination with the use of LLM-D as endpoint picker.
A model picker provides the ability to route an LLM query to one of multiple LLM models that are entirely different from each other, whereas an endpoint picker selects one of multiple endpoints that each serve an equivalent model (and most often the exact same base model). Hence this deployment shows how vLLM Semantic Router in its role as a model picker is perfectly complementary to endpoint picker solutions such as LLM-D.
Since LLM-D has a number of deployment configurations some of which require a larger hardware setup we will demonstrate a baseline version of LLM-D working in combination with vsr to introduce the core concepts. These same core concepts will also apply when using vsr with more complex LLM-D configurations and production grade well-lit paths as described in the LLM-D repo at this link.
Also we will use LLM-D with Istio as the Inference Gateway in order to build on the steps and hardware setup from the Istio deployment example documented in this repo. Istio is also commonly used as the default gateway for LLM-D with or without vsr.
Architecture Overview​
The deployment consists of:
- vLLM Semantic Router: Provides intelligent request routing and processing decisions to Envoy based Gateways
- LLM-D: Distributed Inference platform used for scaleout LLM inferencing with SOTA performance.
- Istio Gateway: Istio's implementation of Kubernetes Gateway API that uses an Envoy proxy under the covers
- Gateway API Inference Extension: Additional APIs to extend the Gateway API for Inference via ExtProc servers
- Two instances of vLLM serving 1 model each: Example backend LLMs for illustrating semantic routing in this topology
Prerequisites​
Before starting, ensure you have the following tools installed:
- Docker - Container runtime
- minikube - Local Kubernetes
- kind - Kubernetes in Docker
- kubectl - Kubernetes CLI
- istioctl - Istio CLI
We use minikube in the description below. As noted above, this guide builds upon the vsr + Istio deployment guide from this repo hence will point to that guide for the common portions of documentation and add the incremental additional steps here.
As was the case for the Istio guide, you will need a machine that has GPU support with at least 2 GPUs to run this exercise so that we can deploy and test the use of vsr to do model routing between two different LLM base models.
Step 1: Common Steps from Istio Guide​
First, follow the steps documented in the Istio guide, to create a local minikube cluster.