Skip to content

vLLM

Prerequisites

The Outlines VLLM model is intended to be used along with a vLLM instance running on a separate server (can be local or remote). Make sure you have a vLLM server running and accessible before using the VLLM model. As the vLLM client relies on the openai python sdk, you need to have the openai package installed. Install all optional dependencies for the VLLM model with: pip install openai.

If you want to use the vllm offline inference mode instead of the server mode, please refer to the VLLMOffline model documentation.

Model Initialization

To load the model, you can use the from_vllm function. The argument of the function is either an OpenAI or AsyncOpenAI instance from the openai library. Make sure the value of the base_url argument of the OpenAI client points to your running vLLM server. Consult the vLLM documentation on using an OpenAI client with a vLLM server for more information.

Based on whether the openai client instance is synchronous or asynchronous, you will receive a VLLM or AsyncVLLM model instance.

For instance:

import openai
import outlines

# Create the OpenAI client
sync_openai_client = openai.OpenAI(base_url="http://localhost:8080")
async_openai_client = openai.AsyncOpenAI(base_url="http://localhost:8080")

# Create a sync model
sync_model = outlines.from_vllm(sync_openai_client)
print(type(sync_model)) # <class 'outlines.models.vllm.VLLM'>

# Create an async model
async_model = outlines.from_vllm(async_openai_client)
print(type(async_model)) # <class 'outlines.models.vllm.AsyncVLLM'>

Text Generation

To generate text, you can simply call the model with a prompt.

For instance:

import openai
import outlines

# Create the model
openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

# Call it to generate text
result = model("Write a short story about a cat.", max_tokens=100)
print(result) # 'In a quiet village where the cobblestones hummed softly beneath the morning mist...'

The VLLM model also supports streaming. For instance:

import openai
import outlines

# Create the model
openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

# Stream text
for chunk in model.stream("Write a short story about a cat.", max_tokens=100):
    print(chunk) # 'In ...'

Asynchronous Calls

vLLM supports asynchronous operations by passing an AsyncOpenAI client instead of a regular OpenAI client. This returns an AsyncVLLM model instance that supports async/await patterns.

Basic Async Generation

import asyncio
import openai
import outlines

async def generate_text():
    async_client = openai.AsyncOpenAI(base_url="http://localhost:8080")
    async_model = outlines.from_vllm(async_client)

    result = await async_model("Write a haiku about Python.", max_tokens=50)
    print(result)

asyncio.run(generate_text())

Async Streaming

The async model also supports streaming with async iteration:

import asyncio
import openai
import outlines

async def stream_text():
    async_client = openai.AsyncOpenAI(base_url="http://localhost:8080")
    async_model = outlines.from_vllm(async_client)

    async for chunk in async_model.stream("Tell me a story about a robot.", max_tokens=100):
        print(chunk, end="")

asyncio.run(stream_text())

Concurrent Async Requests

One of the main benefits of async calls is the ability to make multiple concurrent requests:

import asyncio
import openai
import outlines

async def generate_multiple():
    async_client = openai.AsyncOpenAI(base_url="http://localhost:8080")
    async_model = outlines.from_vllm(async_client)

    prompts = [
        "Write a tagline for a coffee shop.",
        "Write a tagline for a bookstore.",
        "Write a tagline for a gym."
    ]

    tasks = [async_model(prompt, max_tokens=30) for prompt in prompts]
    results = await asyncio.gather(*tasks)

    for prompt, result in zip(prompts, results):
        print(f"{prompt}\n{result}\n")

asyncio.run(generate_multiple())

Structured Generation

vLLM supports all output types available in Outlines. Simply provide an output_type after the prompt when calling the model. All structured generation features work with both synchronous and asynchronous models.

Simple Type

import openai
import outlines

output_type = int

openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

result = model("How many countries are there in the world?", output_type)
print(result) # '200'

JSON Schema

import openai
import outlines
from typing import List
from pydantic import BaseModel

class Character(BaseModel):
    name: str
    age: int
    skills: List[str]

openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

result = model("Create a character.", output_type=Character, frequency_penalty=1.5)
print(result) # '{"name": "Evelyn", "age": 34, "skills": ["archery", "stealth", "alchemy"]}'
print(Character.model_validate_json(result)) # name=Evelyn, age=34, skills=['archery', 'stealth', 'alchemy']

Multiple Choice

from typing import Literal
import openai
import outlines

output_type = Literal["Paris", "London", "Rome", "Berlin"]

openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

result = model("What is the capital of France?", output_type, temperature=0)
print(result) # 'Paris'

Regex

import openai
import outlines
from outlines.types import Regex

output_type = Regex(r"\d{3}-\d{2}-\d{4}")

openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

result = model("Generate a fake social security number.", output_type, top_p=0.1)
print(result) # '782-32-3789'

Context-Free Grammar

import openai
import outlines
from outlines.types import CFG

arithmetic_grammar = """
?start: sum

?sum: product
| sum "+" product   -> add
| sum "-" product   -> sub

?product: atom
| product "*" atom  -> mul
| product "/" atom  -> div

?atom: NUMBER           -> number
| "-" atom         -> neg
| "(" sum ")"

%import common.NUMBER
%import common.WS_INLINE

%ignore WS_INLINE
"""
output_type = CFG(arithmetic_grammar)

openai_client = openai.OpenAI(base_url="http://localhost:8080")
model = outlines.from_vllm(openai_client)

result = model("Write an addition.", output_type, extra_body={"guided_decoding_backend": "outlines"})
print(result) # '23 + 48'

Async Structured Generation

All structured generation features work seamlessly with async models:

import asyncio
import openai
import outlines
from pydantic import BaseModel

class User(BaseModel):
    name: str
    email: str
    age: int

async def generate_user():
    async_client = openai.AsyncOpenAI(base_url="http://localhost:8080")
    async_model = outlines.from_vllm(async_client)

    result = await async_model("Generate a random user profile.", output_type=User)
    user = User.model_validate_json(result)
    print(f"Name: {user.name}, Email: {user.email}, Age: {user.age}")

asyncio.run(generate_user())

Inference Arguments

When calling the model, you can provide optional parameters on top of the prompt and the output type. Those will be passed on to the chat.completions.create method of the OpenAI client.

An optional parameter of particular interest is extra_body, which is a dictionary containing arguments that are specific to vLLM and are not part of the standard openai interface. Among those, guided_decoding_backend allows you to select the library used by the vLLM server to control structured generation. You can use the value outlines to generated structured text with Outlines.

See the vLLM documentation on extra parameters for the OpenAI-compatible server for more information on inference parameters.