vLLM
Installation
You need to install the vllm
library to use the vLLM integration. See the installation section for instructions to install vLLM for CPU or ROCm. To get started you can also run:
Load the model
Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:
Or alternatively:
import vllm
from outlines import models
llm = vllm.LLM("microsoft/Phi-3-mini-4k-instruct")
model = models.VLLM(llm)
Models are loaded from the HuggingFace hub.
Device
The default installation of vLLM only allows to load models on GPU. See the installation instructions to run models on CPU.
You can pass any parameter that you would normally pass to vllm.LLM
, as keyword arguments:
from outlines import models
model = models.vllm(
"microsoft/Phi-3-mini-4k-instruct",
trust_remote_code=True,
gpu_memory_utilization=0.7
)
Main parameters:
Parameters | Type | Description | Default |
---|---|---|---|
tokenizer_mode |
str |
"auto" will use the fast tokenizer if available and "slow" will always use the slow tokenizer. | auto |
trust_remote_code |
bool |
Trust remote code when downloading the model and tokenizer. | False |
tensor_parallel_size |
int |
The number of GPUs to use for distributed execution with tensor parallelism. | 1 |
dtype |
str |
The data type for the model weights and activations. Currently, we support float32 , float16 , and bfloat16 . If auto , we use the torch_dtype attribute specified in the model config file. However, if the torch_dtype in the config is float32 , we will use float16 instead. |
auto |
quantization |
Optional[str] |
The method used to quantize the model weights. Currently, we support "awq", "gptq" and "squeezellm". If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights. |
None |
revision |
Optional[str] |
The specific model version to use. It can be a branch name, a tag name, or a commit id. | None |
tokenizer_revision |
Optional[str] |
The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id. | None |
gpu_memory_utilization |
float |
The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors. | 0.9 |
swap_space |
int |
The size (GiB) of CPU memory per GPU to use as swap space. This can be used for temporarily storing the states of the requests when their best_of sampling parameters are larger than 1. If all requests will have best_of=1 , you can safely set this to 0. Otherwise, too small values may cause out-of-memory (OOM) errors. |
4 |
enforce_eager |
bool |
Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode. If False, we will use CUDA graph and eager execution in hybrid. | False |
enable_lora |
bool |
Whether to enable loading LoRA adapters | False |
See the vLLM code for a list of all the available parameters.
Use quantized models
vLLM supports AWQ, GPTQ and SqueezeLLM quantized models:
from outlines import models
model = models.vllm("TheBloke/Llama-2-7B-Chat-AWQ", quantization="awq")
model = models.vllm("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ", quantization="gptq")
model = models.vllm("https://huggingface.co/squeeze-ai-lab/sq-llama-30b-w4-s5", quantization="squeezellm")
Dependencies
To use AWQ model you need to install the autoawq library pip install autoawq
.
To use GPTQ models you need to install the autoGTPQ and optimum libraries pip install auto-gptq optimum
.
Multi-GPU usage
To run multi-GPU inference with vLLM you need to set the tensor_parallel_size
argument to the number of GPUs available when initializing the model. For instance to run inference on 2 GPUs:
from outlines import models
model = models.vllm(
"microsoft/Phi-3-mini-4k-instruct"
tensor_parallel_size=2
)
Load LoRA adapters
You can load LoRA adapters and alternate between them dynamically:
from outlines import models
model = models.vllm("facebook/opt-350m", enable_lora=True)
model.load_lora("ybelkaa/opt-350m-lora") # Load LoRA adapter
model.load_lora(None) # Unload LoRA adapter
Generate text
In addition to the parameters described in the text generation section you can pass an instance of SamplingParams
directly to any generator via the sampling_params
keyword argument:
from vllm.sampling_params import SamplingParams
from outlines import models, generate
model = models.vllm("microsoft/Phi-3-mini-4k-instruct")
generator = generate.text(model)
params = SamplingParams(n=2, frequency_penalty=1., min_tokens=2)
answer = generator("A prompt", sampling_params=params)
This also works with generators built with generate.regex
, generate.json
, generate.cfg
, generate.format
and generate.choice
.
Note
The values passed via the SamplingParams
instance supersede the other arguments to the generator or the samplers.
SamplingParams
attributes:
Parameters | Type | Description | Default |
---|---|---|---|
n |
int |
Number of output sequences to return for the given prompt. | 1 |
best_of |
Optional[int] |
Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n . This is treated as the beam width when use_beam_search is True. By default, best_of is set to n . |
None |
presence_penalty |
float |
Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. | 0.0 |
frequency_penalty |
float |
Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. | 0.0 |
repetition_penalty |
float |
Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. | 1.0 |
temperature |
float |
Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. | 1.0 |
top_p |
float |
Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | 1.0 |
top_k |
int |
Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens. | -1 |
min_p |
float |
Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this. | 0.0 |
seed |
Optional[int] |
Random seed to use for the generation. | None |
use_beam_search |
bool |
Whether to use beam search instead of sampling. | False |
length_penalty |
float |
Float that penalizes sequences based on their length. Used in beam search. | 1.0 |
early_stopping |
Union[bool, str] |
Controls the stopping condition for beam search. It accepts the following values: True , where the generation stops as soon as there are best_of complete candidates; False , where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never" , where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm). |
False |
stop |
Optional[Union[str, List[str]]] |
List of strings that stop the generation when they are generated. The returned output will not contain the stop strings. | None |
stop_token_ids |
Optional[List[int]] |
List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. | None |
include_stop_str_in_output |
bool |
Whether to include the stop strings in output text. Defaults to False. | False |
ignore_eos |
bool |
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | False |
max_tokens |
int |
Maximum number of tokens to generate per output sequence. | 16 |
min_tokens |
int |
Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated | 0 |
skip_special_tokens |
bool |
Whether to skip special tokens in the output. | True |
spaces_between_special_tokens |
bool |
Whether to add spaces between special tokens in the output. Defaults to True. | True |
Streaming
Warning
Streaming is not available for the offline vLLM integration.
Installation
By default the vLLM library is installed with pre-commpiled C++ and CUDA binaries and will only run on GPU:
CPU
You need to have the gcc
compiler installed on your system. Then you will need to install vLLM from source. First clone the repository:
Install the Python packages needed for the installation:
pip install --upgrade pip
pip install wheel packaging ninja setuptools>=49.4.0 numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
and finally run:
See the vLLM documentation for more details, alternative installation methods (Docker) and performance tips.
ROCm
You will need to install vLLM from source. First install Pytorch on ROCm:
pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.pytorch.org/whl/nightly/rocm5.7 # tested version
You will then need to install flash attention for ROCm following these instructions. You can then install xformers=0.0.23
and apply the patches needed to adapt Flash Attention for ROCm:
And finally build vLLM:
cd vllm
pip install -U -r requirements-rocm.txt
python setup.py install # This may take 5-10 minutes.
See the vLLM documentation for alternative installation methods (Docker).