Working with Llama models
This guide provides step-by-step guidance on using Llama.cpp based models in
arey
. We will download a model from Huggingface, configure arey
to use the
model and finally run a few commands.
Ollama is an alternate way to automatically download your models and run them locally. See our Ollama guide if you prefer an automated quick start.
Concepts
Llama is a family of large language models (LLMs) provided by Meta. While this was the initial open weights architecture available for local use, Mistral and their fine-tunes are quite popular too.
LLMs are represented by the number of parameters they were trained with. E.g., 7B, 13B, 33B and 70B are usual buckets. Size of the model increases with the training parameters.
Llama.cpp provides a quantization mechanism (GGUF) to reduce the file size and
allows running these models on smaller form factors including CPU-only devices.
You will see quantization in the model name. E.g., Q4_K_M
implies 4-bit
quantization.
Choosing a quantization
Always choose the lower quantization of a higher param model. E.g., Q4_K_M of 13B is better than Q8_K_M of 7B.
Get the models
Please use Huggingface search to find the GGUF models.
Ollama maintains a registry of their supported models.
https://www.reddit.com/r/LocalLLaMA/ is a fantastic community to stay updated and learn more about local models.
Our favorite models
Model | Parameters | Quant | Purpose |
---|---|---|---|
OpenHermes-2.5-Mistral | 7B | Q4_K_M | General chat |
Deepseek-Coder-6.7B | 7B | Q4_K_M | Coding |
NousHermes-2-Solar-10.7B | 11B | Q4_K_M | General chat |
After you locate the huggingface repository, please download the model locally. Here's an example to download the Tiny Dolphin model.
$ mkdir ~/models
$ cd ~/models
# If wget is not available on your platform, open the below link
# in your browser and save it to ~/models.
# Size of this model: ~668MB
$ wget https://huggingface.co/s3nh/TinyDolphin-2.8-1.1b-GGUF/resolve/main/tinydolphin-2.8-1.1b.Q4_K_M.gguf
# ...
$ ls
tinydolphin-2.8-1.1b.Q4_K_M.gguf
Configure
Let's add an entry for this model in arey's config file.
Noteworthy changes to the configuration file:
- Line 7-10: we added a new model definition with the path of the downloaded model.
- Line 38: we instruct
arey
to usetinydolphin
model for chat. - Line 41:
arey
will usetinydolphin
for the queries in ask command.
Usage
You can use chat
, ask
or play
commands to run this model.
Completion settings
You can use profiles to configure arey
. A profile is a collection of
settings used for tuning the AI model's response. Usually it includes following
parameters:
Parameter | Value | Purpose |
---|---|---|
max_tokens | 512 | Maximum number of tokens to generate |
repeat_penalty | 1-2 | Higher value discourages repetition of token |
stop | [] | Comma separated list of stop words |
temperature | 0.0-1.0 | Lower temperature implies precise response |
top_k | 0-30 | Number of tokens to consider for sampling |
top_p | 0.0-1.0 | Lower value samples from most likely tokens |
See the list of all parameters in create_completion API documentation.
Chatting with the model
Let's run arey chat
to start a chat session. See below for an illustration.
❯ arey chat
Welcome to arey chat!
Type 'q' to exit.
✓ Model loaded. 0.13s.
How can I help you today?
> Who are you?
I am an artificial intelligence model that has been programmed to simulate human behavior, emotions, and responses based on data gathered from various sources. My primary goal is to provide
assistance in various areas of life, including communication, problem-solving, decision-making, and learning.
◼ Completed. 0.49s to first token. 2.10s total. 75.58 tokens/s. 159 tokens. 64 prompt tokens.
>
See quickstart for an example of ask
and play
commands.