llama.cpp
Outlines provides an integration with Llama.cpp using the llama-cpp-python library. Llamacpp allows to run quantized models on machines with limited compute.
Installation
You need to install the llama-cpp-python
library to use the llama.cpp integration. Install all optional dependencies of the LlamaCpp
model with: pip install outlines[llamacpp]
.
See the llama-cpp-python Github page for instructions on installing with CUDA, Metal, ROCm and other backends.
Model Initialization
To load the model, you can use the from_llamacpp
function. The single argument of the function is a Llama
model instance from the llama_cpp
library. Consult the Llama class API reference for detailed information on how to create a model instance and on the various available parameters.
For instance:
import outlines
from llama_cpp import Llama
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
Text Generation
To generate text, you can simply call the model with a prompt.
For instance:
import outlines
from llama_cpp import Llama
# Create the model
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
# Call it to generate text
result = model("What's the capital of Latvia?", max_tokens=20)
print(result) # 'Riga'
The LlamaCpp
model also supports streaming. For instance:
import outlines
from llama_cpp import Llama
# Create the model
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
# Stream text
for chunk in model.stream("Write a short story about a cat.", max_tokens=100):
print(chunk) # 'In...'
Structured Generation
The LlamaCpp
model supports all output types available in Outlines except for context-free grammars. Simply provide an output_type
after the prompt when calling the model.
Basic Type
import outlines
from llama_cpp import Llama
output_type = int
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
result = model("How many countries are there in the world?", output_type)
print(result) # '200'
JSON Schema
from typing import List
from pydantic import BaseModel
import outlines
from llama_cpp import Llama
class Character(BaseModel):
name: str
age: int
skills: List[str]
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
result = model("Create a character.", output_type=Character, max_tokens=200)
print(result) # '{"name": "Evelyn", "age": 34, "skills": ["archery", "stealth", "alchemy"]}'
print(Character.model_validate_json(result)) # name=Evelyn, age=34, skills=['archery', 'stealth', 'alchemy']
Multiple Choice
from typing import Literal
import outlines
from llama_cpp import Llama
output_type = Literal["Paris", "London", "Rome", "Berlin"]
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
result = model("What is the capital of France?", output_type)
print(result) # 'Paris'
Regex
from outlines.types import Regex
import outlines
from llama_cpp import Llama
output_type = Regex(r"\d{3}-\d{2}-\d{4}")
model = outlines.from_llamacpp(
Llama.from_pretrained(
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
filename="mistral-7b-instruct-v0.2.Q5_K_M.gguf",
)
)
result = model("Generate a fake social security number.", output_type)
print(result) # '782-32-3789'
Inference Arguments
When calling the model, you can provide optional inference parameters on top of the prompt and the output type. These parameters will be passed on to the __call__
method of the llama_cpp.Llama
model. Some common inference arguments include max_tokens
, temperature
, frequency_penalty
and top_p
.
See the llama-cpp-python documentation for more information on inference parameters.