Llama.cpp

Outlines provides an integration with Llama.cpp using the llama-cpp-python library. Llamacpp allows to run quantized models on machines with limited compute.

Installation

You need to install the llama-cpp-python library to use the llama.cpp integration. See the installation section for instructions to install llama-cpp-python with CUDA, Metal, ROCm and other backends.

Load the model

You can initialize the model by passing the name of the repository on the HuggingFace Hub, and the filenames (or glob pattern):

from outlines import models

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")

This will download the model files to the hub cache folder and load the weights in memory.

You can also initialize the model by passing the path to the weights on your machine. Assuming Phi2's weights are in the current directory:

from outlines import models
from llama_cpp import Llama

llm = Llama("./phi-2.Q4_K_M.gguf")
model = models.LlamaCpp(llm)

If you need more control, you can pass the same keyword arguments to the model as you would pass in the llama-ccp-library:

from outlines import models

model = models.llamacpp(
    "TheBloke/phi-2-GGUF",
    "phi-2.Q4_K_M.gguf"
    n_ctx=512,  # to set the context length value
)

Main parameters:

Parameters	Type	Description	Default
`n_gpu_layers`	`int`	Number of layers to offload to GPU. If -1, all layers are offloaded	`0`
`split_mode`	`int`	How to split the model across GPUs. `1` for layer-wise split, `2` for row-wise split	`1`
`main_gpu`	`int`	Main GPU	`0`
`tensor_split`	`Optional[List[float]]`	How split tensors should be distributed accross GPUs. If `None` the model is not split.	`None`
`n_ctx`	`int`	Text context. Inference from the model if set to `0`	`0`
`n_threads`	`Optional[int]`	Number of threads to use for generation. All available threads if set to `None`.	`None`
`verbose`	`bool`	Print verbose outputs to `stderr`	`False`

See the llama-cpp-python documentation for the full list of parameters.

Load the model on GPU

Note

Make sure that you installed llama-cpp-python with GPU support.

To load the model on GPU, pass n_gpu_layers=-1:

from outlines import models

model = models.llamacpp(
    "TheBloke/phi-2-GGUF",
    "phi-2.Q4_K_M.gguf",
    n_gpu_layers=-1,  # to use GPU acceleration
)

This also works with generators built with generate.regex, generate.json, generate.cfg, generate.format and generate.choice.

Load LoRA adapters

You can load LoRA adapters dynamically:

from outlines import models, generate

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)
answer_1 = generator("prompt")

model.load_lora("./path/to/adapter.gguf")
answer_2 = generator("prompt")

To load another adapter you need to re-initialize the model. Otherwise the adapter will be added on top of the previous one:

from outlines import models

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter1.gguf")  # Load first adapter

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter2.gguf")  # Load second adapter

Generate text

In addition to the parameters described in the text generation section you can pass extra keyword arguments, for instance to set sampling parameters not exposed in Outlines' public API:

from outlines import models, generate


model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)

answer = generator("A prompt", presence_penalty=0.8)

Extra keyword arguments:

The value of the keyword arguments you pass to the generator suspersede the values set when initializing the sampler or generator. All extra sampling methods and repetition penalties are disabled by default.

Parameters	Type	Description	Default
`suffix`	`Optional[str]`	A suffix to append to the generated text. If `None` no suffix is added.	`None`
`echo`	`bool`	Whether to preprend the prompt to the completion.	`False`
`seed`	`int`	The random seed to use for sampling.	`None`
`max_tokens`	`Optional[int]`	The maximum number of tokens to generate. If `None` the maximum number of tokens depends on `n_ctx`.	`16`
`frequence_penalty`	`float`	The penalty to apply to tokens based on their frequency in the past 64 tokens.	`0.0`
`presence_penalty`	`float`	The penalty to apply to tokens based on their presence in the past 64 tokens.	`0.0`
`repeat_penalty`	`float`	The penalty to apply to repeated tokens in the past 64 tokens.	`1.`
`stopping_criteria`	`Optional[StoppingCriteriaList]`	A list of stopping criteria to use.	`None`
`logits_processor`	`Optional[LogitsProcessorList]`	A list of logits processors to use. The logits processor used for structured generation will be added to this list.	`None`
`temperature`	`float`	The temperature to use for sampling	`1.0`
`top_p`	`float`	The top-p value to use for nucleus sampling.	`1.`
`min_p`	`float`	The min-p value to use for minimum-p sampling.	`0.`
`typical_p`	`float`	The p value to use for locally typical sampling.	`1.0`
`stop`	`Optional[Union[str, List[str]]]`	A list of strings that stop generation when encountered.	`[]`
`top_k`	`int`	The top-k value used for top-k sampling. Negative value to consider all logit values.	`-1.`
`tfs_z`	`float`	The tail-free sampling parameter.	`1.0`
`mirostat_mode`	`int`	The mirostat sampling mode.	`0`
`mirostat_tau`	`float`	The target cross-entropy for mirostat sampling.	`5.0`
`mirostat_eta`	`float`	The learning rate used to update `mu` in mirostat sampling.	`0.1`

See the llama-cpp-python documentation for the full and up-to-date list of parameters and the llama.cpp code for the default values of other sampling parameters.

Streaming

Installation

You need to install the llama-cpp-python library to use the llama.cpp integration.

CPU

For a CPU-only installation run:

pip install llama-cpp-python

Warning

Do not run this command if you want support for BLAS, Metal or CUDA. Follow the instructions below instead.

CUDA

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

It is also possible to install pre-built wheels with CUDA support (Python 3.10 and above):

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>

Where <cuda-version> is one of the following, depending on the version of CUDA installed on your system:

cu121 for CUDA 12.1
cu122 for CUDA 12.2
cu123 CUDA 12.3

Metal

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

It is also possible to install pre-build wheels with Metal support (Python 3.10 or above, MacOS 11.0 and above):

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal

OpenBLAS

CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

Other backend

llama.cpp supports many other backends. Refer to the llama.cpp documentation to use the following backends:

CLBast (OpenCL)
hipBLAS (ROCm)
Vulkan
Kompute
SYCL