mlx-lm

Outlines provides an integration with mlx-lm, allowing models to be run quickly on Apple Silicon via the mlx library.

Installation

You need to install the mlx and mlx-lm libraries on a device which supports Metal to use the mlx-lm integration. To get started quickly you can also run:

pip install "outlines[mlxlm]"

Load the model

You can initialize the model by passing the name of the repository on the HuggingFace Hub. The official repository for mlx-lm supported models is mlx-community.

from outlines import models

model = models.mlxlm("mlx-community/Meta-Llama-3.1-8B-Instruct-8bit")

This will download the model files to the hub cache folder and load the weights in memory.

The arguments model_config and tokenizer_config are available to modify loading behavior. For example, per the mlx-lm documentation, you must set an eos_token for qwen/Qwen-7B. In outlines you may do so via

model = models.mlxlm(
    "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit",
    tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)

Main parameters:

(Subject to change. Table based on mlx-lm.load docstring)

Parameters	Type	Description	Default
`tokenizer_config`	`dict`	Configuration parameters specifically for the tokenizer. Defaults to an empty dictionary.	`{}`
`model_config`	`dict`	Configuration parameters specifically for the model. Defaults to an empty dictionary.	`{}`
`adapter_path`	`str`	Path to the LoRA adapters. If provided, applies LoRA layers to the model.	`None`
`lazy`	`bool`	If False, evaluate the model parameters to make sure they are loaded in memory before returning.	`False`

Generate text

You may generate text using the parameters described in the text generation documentation.

With the loaded model, you can generate text or perform structured generation, e.g.

from outlines import models, generate

model = models.mlxlm("mlx-community/Meta-Llama-3.1-8B-Instruct-8bit")
generator = generate.text(model)

answer = generator("A prompt", temperature=2.0)

Streaming

You may creating a streaming iterable with minimal changes

from outlines import models, generate

model = models.mlxlm("mlx-community/Meta-Llama-3.1-8B-Instruct-8bit")
generator = generate.text(model)

for token_str in generator.text("A prompt", temperature=2.0):
    print(token_str)

Structured

You may perform structured generation with mlxlm to guarantee your output will match a regex pattern, json schema, or lark grammar.

Example: Phone number generation with pattern "\\+?[1-9][0-9]{7,14}":

from outlines import models, generate

model = models.mlxlm("mlx-community/Meta-Llama-3.1-8B-Instruct-8bit")

phone_number_pattern = "\\+?[1-9][0-9]{7,14}"
generator = generate.regex(model, phone_number_pattern)

model_output = generator("What's Jennys Number?\n")
print(model_output)
# '8675309'