How llm-d Prefix-Cache Routing Made Qwen 7B on EKS 2.3x Faster

Date: 26 June 2026

llm-d prefix-cache routing benchmark on Amazon EKS

Introduction

I wanted to benchmark how much the routing layer matters for LLM inference when the workload has repeated long prefixes.

The setup was intentionally simple: Qwen2.5-7B-Instruct, vLLM, AWS EKS, FSx for Lustre, and eight g5.xlarge GPU nodes. Each node had one NVIDIA A10G GPU and ran one vLLM decode replica. The interesting part was the comparison in front of those same eight pods.

One path used a plain Kubernetes ClusterIP Service, which effectively gives round-robin-style traffic distribution. The other path used llm-d with the precise prefix-cache-aware endpoint picker.

The result was not small. With the same hardware and the same vLLM pods, llm-d finished the 512-concurrency benchmark in 358.7 seconds instead of 840.2 seconds. Output throughput went from 2,742 tok/s to 6,423 tok/s, and mean time to first token dropped from 19.0 seconds to 0.86 seconds.

The Problem

vLLM has a KV cache. If many requests share the same long prefix, the best case is to reuse the cached prefix blocks instead of recomputing the prefill again and again.

But there is a catch: each vLLM replica has its own KV cache.

With plain round-robin routing, repeated-prefix requests are scattered across replicas. A request may land on a pod that has never seen that prefix before, even though another pod already has the right KV blocks. That means the cluster burns GPU time on repeated prefill work, fills KV cache, and eventually starts queueing requests.

llm-d solves this specific problem by making routing aware of prefix-cache locality. In this benchmark, the llm-d endpoint picker routed prompts to the replica that was most likely to already hold the matching prefix blocks.

Round-robin Grafana dashboard

Architecture

Architecture diagram

The benchmark cluster was built on AWS EKS in us-west-2.

The main components were:

The important detail is that both paths used the same eight vLLM decode pods. The only meaningful difference in the A/B test was the routing layer.

Traffic flow from the vLLM benchmark client through round-robin and llm-d routing to the decode pods and FSx for Lustre

Realization

The infrastructure was created with Terraform, then the cluster dependencies were installed with Helm and scripts:

./scripts/tf-init.sh
./scripts/tf-apply.sh
./scripts/update-kubeconfig.sh
./scripts/install-gpu-operator.sh
./scripts/install-fsx-csi-driver.sh
./scripts/install-monitoring.sh

The model weights were not downloaded by the inference pod at runtime. They were already available on FSx for Lustre and mounted into the pods. This made pod restarts much faster and avoided pushing large model downloads into the benchmark path.

For the llm-d test, I installed the precise prefix-cache routing stack and used the repo customization under deploy/llm-d/. The important pieces were:

There were also a few practical gotchas:

8 node Grafana overview

Benchmark Setup

The main benchmark used vllm bench serve with a repeated-prefix dataset:

dataset: prefix_repetition
prefixes: 150
prefix length: 2048 tokens
suffix length: 128 tokens
output length: 256 tokens
request rate: inf
max concurrency: 512
prompts: 9000

The benchmark was collected on 15 June 2026.

I also ran a smaller rate ladder at requested rates of 20, 40, and 60 requests per second. That helped show where the round-robin path started saturating and where llm-d still had useful headroom.

Results

Here is the 512-concurrency result:

MetricRound-robinllm-dllm-d advantage
Successful / failed requests9000 / 09000 / 0Same
Benchmark duration840.2 s358.7 s2.3x faster
Request throughput10.71 req/s25.09 req/s+134%
Output token throughput2,742 tok/s6,423 tok/s+134%
Total token throughput26,362 tok/s61,748 tok/s+134%
Mean TTFT19,029 ms863 ms-95%
Median TTFT18,458 ms340 ms-98%
P99 TTFT36,739 ms12,544 ms-66%
Mean TPOT109.2 ms75.3 ms-31%
P99 TPOT157.4 ms111.0 ms-29%
Prefix cache hit rateabout 11%about 93%Much higher
GPU KV cache usageabout 98-99%about 64-71%Avoided saturation
Waiting requestsabout 1800Queue removed

The rate ladder showed the same shape:

Requested rateEndpointAchieved req/sOutput tok/sMean TTFTP99 TTFTMean TPOT
20 req/sRound-robin7.051,805.93,338.5 ms17,075.5 ms78.5 ms
20 req/sllm-d11.853,034.4514.0 ms1,142.9 ms52.8 ms
40 req/sRound-robin8.572,192.722,055.0 ms56,710.0 ms99.9 ms
40 req/sllm-d22.645,795.01,901.0 ms5,585.9 ms76.5 ms
60 req/sRound-robin8.902,278.141,661.2 ms90,767.7 ms104.3 ms
60 req/sllm-d21.525,507.93,496.8 ms8,605.3 ms122.3 ms

Round-robin saturated very early. Even when I requested 40 or 60 req/s, it only delivered about 8 to 9 req/s. TTFT then collapsed into tens of seconds.

llm-d did not make the GPUs infinitely fast, of course. Eight A10Gs still have a real ceiling. But it moved the useful ceiling much higher because it avoided a large amount of repeated prefill work.

Why llm-d Won

The workload had 150 repeated long prefixes. That is exactly the kind of traffic where cache locality matters.

Round-robin distributed requests without knowing which replica had which prefix in its KV cache. So requests kept forcing prefills on replicas that did not need to do that work if traffic had been routed differently.

With llm-d, vLLM emitted KV events and the router used those events to build a prefix-cache-aware view of the replicas. When the next request arrived, the endpoint picker could prefer the replica that already had the relevant prefix blocks.

The result:

The most interesting part is that this was not a model change, GPU change, or replica-count change. It was the routing layer.

NVIDIA SMI during load

Notes From The Run

The vLLM logs showed the llm-d path running with no waiting queue while the prefix hit rate warmed up:

Running: 64 reqs, Waiting: 0 reqs, GPU KV cache usage: 62.1%, Prefix cache hit rate: 63.5%
Running: 68 reqs, Waiting: 0 reqs, GPU KV cache usage: 68.9%, Prefix cache hit rate: 66.4%
Running: 76 reqs, Waiting: 0 reqs, GPU KV cache usage: 70.1%, Prefix cache hit rate: 72.7%

The final aggregate showed a wider P99 TTFT than the steady-state Grafana view, because the beginning of the run included cold-cache ramp-up. After the cache warmed, the median TTFT was 340 ms and the steady-state dashboard showed the system serving 512-concurrency traffic without queue buildup.

There was also an FSx CSI controller warning about missing DescribeFileSystems permission. In this setup it was not blocking, because I used static FSx PV/PVC configuration. The file system identity and mount details were already known, so dynamic FSx discovery was not part of the benchmark path.

Conclusion

This benchmark was a good reminder that LLM inference performance is not only about the GPU count.

For repeated-prefix workloads, the routing layer can decide whether the cluster reuses KV cache or recomputes the same long prefixes again and again. In this run, llm-d precise prefix-cache routing made the same 8 x A10G fleet finish the workload 2.3x faster, while cutting mean TTFT by 95%.

If your traffic has shared system prompts, long common instructions, retrieval templates, chat prefixes, or agent scaffolding, round-robin routing can quietly waste a lot of GPU time. Prefix-cache-aware routing is one of those changes that looks small in the architecture diagram but very large in the benchmark results.

I hope you enjoyed this article.

You can find all of my code in my GitHub repository: https://github.com/andygolubev/aws-eks-inference-llmd-vllm-benchmark-qwen-7b

Feel free to connect with me on LinkedIn: https://www.linkedin.com/in/andy-golubev/

Developed and designed by Olga Golubev