Working with Ollama models
This guide provides step-by-step guidance on using Ollama based models in
arey
. We will download a model with ollama
, configure arey
to use the
model and finally run a few commands.
Llama.cpp is an alternate way to download your models and run them locally. See our Llama guide if you prefer more customizable way to manage your models.
Concepts
Llama is a family of large language models (LLMs) provided by Meta. While this was the initial open weights architecture available for local use, Mistral and their fine-tunes are quite popular too.
LLMs are represented by the number of parameters they were trained with. E.g., 7B, 13B, 33B and 70B are usual buckets. Size of the model increases with the training parameters.
Llama.cpp provides a quantization mechanism (GGUF) to reduce the file size and
allows running these models on smaller form factors including CPU-only devices.
You will see quantization in the model name. E.g., Q4_K_M
implies 4-bit
quantization.
Choosing a quantization
Always choose the lower quantization of a higher param model. E.g., Q4_K_M of 13B is better than Q8_K_M of 7B.
Get the models
Ollama maintains a registry of their supported models.
https://www.reddit.com/r/LocalLLaMA/ is a fantastic community to stay updated and learn more about local models.
Our favorite models
Model | Parameters | Quant | Purpose |
---|---|---|---|
OpenHermes-2.5-Mistral | 7B | Q4_K_M | General chat |
Deepseek-Coder-6.7B | 7B | Q4_K_M | Coding |
NousHermes-2-Solar-10.7B | 11B | Q4_K_M | General chat |
After you locate the ollama model info, please run it in ollama. In this example, we will try the Tiny Dolphin model.
Configure
Let's add an entry for this model in arey's config file.
Noteworthy changes to the configuration file:
- Line 11-14: we added a new model definition with the path of the downloaded model.
- Line 38: we instruct
arey
to usetinydolphin
model for chat. - Line 41:
arey
will usetinydolphin
for the queries in ask command.
Usage
You can use chat
, ask
or play
commands to run this model.
Completion settings
You can use profiles to configure arey
. A profile is a collection of
settings used for tuning the AI model's response. Usually it includes following
parameters:
Parameter | Value | Purpose |
---|---|---|
num_predict | 512 | Maximum number of tokens to generate |
repeat_penalty | 1-2 | Higher value discourages repetition of token |
stop | [] | Comma separated list of stop words |
temperature | 0.0-1.0 | Lower temperature implies precise response |
top_k | 0-30 | Number of tokens to consider for sampling |
top_p | 0.0-1.0 | Lower value samples from most likely tokens |
See the list of all parameters in Model file API documentation.
Chatting with the model
Let's run arey chat
to start a chat session. See below for an illustration.
❯ arey chat
Welcome to arey chat!
Type 'q' to exit.
✓ Model loaded. 0.13s.
How can I help you today?
> Who are you?
I am an artificial intelligence model that has been programmed to simulate human behavior, emotions, and responses based on data gathered from various sources. My primary goal is to provide
assistance in various areas of life, including communication, problem-solving, decision-making, and learning.
◼ Completed. 0.49s to first token. 2.10s total. 75.58 tokens/s. 159 tokens. 64 prompt tokens.
>
See quickstart for an example of ask
and play
commands.