Llama.cpp
Outlines provides an integration with Llama.cpp using the llama-cpp-python library. Llamacpp allows to run quantized models on machines with limited compute.
Installation
You need to install the llama-cpp-python
library to use the llama.cpp integration. See the installation section for instructions to install llama-cpp-python
with CUDA, Metal, ROCm and other backends. To get started quickly you can also run:
Load the model
You can initialize the model by passing the name of the repository on the HuggingFace Hub, and the filenames (or glob pattern):
This will download the model files to the hub cache folder and load the weights in memory.
You can also initialize the model by passing the path to the weights on your machine. Assuming Phi2's weights are in the current directory:
from outlines import models
from llama_cpp import Llama
llm = Llama("./phi-2.Q4_K_M.gguf")
model = models.LlamaCpp(llm)
If you need more control, you can pass the same keyword arguments to the model as you would pass in the llama-ccp-library:
from outlines import models
model = models.llamacpp(
"TheBloke/phi-2-GGUF",
"phi-2.Q4_K_M.gguf"
n_ctx=512, # to set the context length value
)
Main parameters:
Parameters | Type | Description | Default |
---|---|---|---|
n_gpu_layers |
int |
Number of layers to offload to GPU. If -1, all layers are offloaded | 0 |
split_mode |
int |
How to split the model across GPUs. 1 for layer-wise split, 2 for row-wise split |
1 |
main_gpu |
int |
Main GPU | 0 |
tensor_split |
Optional[List[float]] |
How split tensors should be distributed across GPUs. If None the model is not split. |
None |
n_ctx |
int |
Text context. Inference from the model if set to 0 |
0 |
n_threads |
Optional[int] |
Number of threads to use for generation. All available threads if set to None . |
None |
verbose |
bool |
Print verbose outputs to stderr |
False |
See the llama-cpp-python documentation for the full list of parameters.
Load the model on GPU
Note
Make sure that you installed llama-cpp-python
with GPU support.
To load the model on GPU, pass n_gpu_layers=-1
:
from outlines import models
model = models.llamacpp(
"TheBloke/phi-2-GGUF",
"phi-2.Q4_K_M.gguf",
n_gpu_layers=-1, # to use GPU acceleration
)
This also works with generators built with generate.regex
, generate.json
, generate.cfg
, generate.format
and generate.choice
.
Load LoRA adapters
You can load LoRA adapters dynamically:
from outlines import models, generate
model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)
answer_1 = generator("prompt")
model.load_lora("./path/to/adapter.gguf")
answer_2 = generator("prompt")
To load another adapter you need to re-initialize the model. Otherwise the adapter will be added on top of the previous one:
from outlines import models
model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter1.gguf") # Load first adapter
model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter2.gguf") # Load second adapter
Generate text
In addition to the parameters described in the text generation section you can pass extra keyword arguments, for instance to set sampling parameters not exposed in Outlines' public API:
from outlines import models, generate
model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)
answer = generator("A prompt", presence_penalty=0.8)
Extra keyword arguments:
The value of the keyword arguments you pass to the generator suspersede the values set when initializing the sampler or generator. All extra sampling methods and repetition penalties are disabled by default.
Parameters | Type | Description | Default |
---|---|---|---|
suffix |
Optional[str] |
A suffix to append to the generated text. If None no suffix is added. |
None |
echo |
bool |
Whether to preprend the prompt to the completion. | False |
seed |
int |
The random seed to use for sampling. | None |
max_tokens |
Optional[int] |
The maximum number of tokens to generate. If None the maximum number of tokens depends on n_ctx . |
16 |
frequence_penalty |
float |
The penalty to apply to tokens based on their frequency in the past 64 tokens. | 0.0 |
presence_penalty |
float |
The penalty to apply to tokens based on their presence in the past 64 tokens. | 0.0 |
repeat_penalty |
float |
The penalty to apply to repeated tokens in the past 64 tokens. | 1. |
stopping_criteria |
Optional[StoppingCriteriaList] |
A list of stopping criteria to use. | None |
logits_processor |
Optional[LogitsProcessorList] |
A list of logits processors to use. The logits processor used for structured generation will be added to this list. | None |
temperature |
float |
The temperature to use for sampling | 1.0 |
top_p |
float |
The top-p value to use for nucleus sampling. | 1. |
min_p |
float |
The min-p value to use for minimum-p sampling. | 0. |
typical_p |
float |
The p value to use for locally typical sampling. | 1.0 |
stop |
Optional[Union[str, List[str]]] |
A list of strings that stop generation when encountered. | [] |
top_k |
int |
The top-k value used for top-k sampling. Negative value to consider all logit values. | -1. |
tfs_z |
float |
The tail-free sampling parameter. | 1.0 |
mirostat_mode |
int |
The mirostat sampling mode. | 0 |
mirostat_tau |
float |
The target cross-entropy for mirostat sampling. | 5.0 |
mirostat_eta |
float |
The learning rate used to update mu in mirostat sampling. |
0.1 |
See the llama-cpp-python documentation for the full and up-to-date list of parameters and the llama.cpp code for the default values of other sampling parameters.
Streaming
Installation
You need to install the llama-cpp-python
library to use the llama.cpp integration.
CPU
For a CPU-only installation run:
Warning
Do not run this command if you want support for BLAS, Metal or CUDA. Follow the instructions below instead.
CUDA
It is also possible to install pre-built wheels with CUDA support (Python 3.10 and above):
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
Where <cuda-version>
is one of the following, depending on the version of CUDA installed on your system:
cu121
for CUDA 12.1cu122
for CUDA 12.2cu123
CUDA 12.3
Metal
It is also possible to install pre-build wheels with Metal support (Python 3.10 or above, MacOS 11.0 and above):
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
OpenBLAS
Other backend
llama.cpp
supports many other backends. Refer to the llama.cpp documentation to use the following backends:
- CLBast (OpenCL)
- hipBLAS (ROCm)
- Vulkan
- Kompute
- SYCL