vLLM V1 Accuracy Issue

vLLM V1 Migration: Inference Accuracy Issues and Solutions

vLLM V1 introduces significant architectural changes, which may lead to inference accuracy discrepancies when migrating from V0. The ServiceNow AI team shares four crucial corrections from their actual migration experience with PipelineRL.

What Changed

vLLM V1 is a complete rewrite of the V0 engine. The ServiceNow AI team encountered severe inference discrepancies in their migration from vLLM 0.8.5 to 0.18.1, particularly in reinforcement learning (RL) systems.

PipelineRL uses vLLM as the inference engine to obtain token sampling and logprob (log probability). The trainer calculates policy ratios, KL divergence, clip rates, entropy, and rewards using these logprobs. Differences in logprob calculations can alter training dynamics.

Initial V1 runs showed clear deviations in clamp_log_ratio_new_old_indicator, kl_new_old, entropy, and reward metrics. This issue can also occur in PPO, GRPO, and other online RL systems.

(Source: vLLM V0 to V1: Correctness Before Corrections in RL)

Four Corrections for V1 Backend

Unifying Logprob Semantics

The first issue was with semantics. vLLM V1 returns logprob from raw model output by default, which is the value before post-processing, such as temperature scaling, penalties, and top-k/top-p filtering. PipelineRL expected logprob from the processed distribution used by the sampler.

The necessary setting is:

logprobs-mode=processed_logprobs

This setting removed the obvious average offset in rollout logprob.

Adjusting Runtime Defaults

V1-specific runtime default settings also caused discrepancies. V0 and V1 had different default values, affecting inference results.

Correcting In-Flight Weight Update Paths

V1 introduced changes in dynamic weight updates, requiring corrections in this part. The in-flight weight update mechanism behaved differently than in V0.

Applying fp32 lm_head

The final correction involved running the lm_head used in the final projection at fp32 precision. The precision difference caused subtle discrepancies in inference results.

These four corrections enabled V1 to produce results consistent with the V0 reference.

(Source: vLLM V0 to V1: Correctness Before Corrections in RL)

Diagnostic Approach During Migration

The ServiceNow AI team diagnosed issues by categorizing them into three layers:

Backend operation issues: logprob calculations, default settings, weight update processing
Integration layer issues: differences in API interfaces or calling methods
RL algorithm issues: implementation differences specific to reinforcement learning

It was crucial to address the first two categories as backend operation issues before suspecting the third category. This ordered diagnostic approach helped identify the root causes.

Summary

vLLM V1 migration can ensure logprob calculation consistency using the logprobs-mode=processed_logprobs setting
Explicitly setting runtime defaults can achieve inference result consistency between V0 and V1
Applying in-flight weight updates and fp32 lm_head can maintain training dynamics consistency in reinforcement learning systems
Resolving backend operation issues before adjusting RL algorithm layers enables efficient migration