~ 3 min read
Using Ollama to Run LLMs on the Raspberry Pi 5
I recently got my hands on a Raspberry Pi 5. It’s made right here in Wales at the Sony factory in Pencoed. I have every previous iteration of it because it’s such a versatile and affordable little device.
With the 5th version, I’ve got the larger 8GB model from Pimoroni for just under £80. I wanted to see if it was possible to run large language models on it and how they performed. Turns out it is and Ollama makes it incredibly easy to do so.
Install
I’ve recorded myself going through the motions here in this video and half expected it to fail on a fresh install of Raspbian. It only requires copying the given curl command to install and you can get to straight to downloading and running models. You should of course inspect the script to make sure you’re happy with it’s contents first.
curl https://ollama.ai/install.sh | sh
Performance
I ran both llama 2 (uncensored) and tinyllama, along with llava to do some image analysis. Initially I was very impressed with tinyllama, which gave responses of about 15 tokens/s. But this was using a model with only 1B parameters. Default models typically start at 7B parameters.
llama 2 and llava with their default 7B parameter models are as one might expect, much slower. I was getting under 2 token/s for relatively simple prompts for each which meant a response time of several minutes per prompt. For comparison, my 16GB M1 Pro is able to get about 30 tokens/s and my M1 about half that in a couple of seconds.
If you want to see how each model performs on your own machine, you can supply the —verbose flag before a run:
ollama run llama2-uncensored --verbose
Then go ahead and enter your prompt, in this case I asked for a regex for matching emails. This returns some stats after the model response like the following, explaining how fast the model is running:
total duration: 3m33.368881798S
load duration: 543.462us
prompt eval count: 33 token(s)
prompt eval duration: 15.3866445ms
prompt eval rate: 2.14 tokens/s
eval count: 352 token(s)
eval duration: 3m17.976257s
eval rate: 1.78 tokens/s
You can see this isn’t going to win any performance awards on my Pi, but what’s really impressive is this is fully local on a £80 computer! I’ll take that if I have to wait a bit of time for a response. You can imagine in an educational environment using the smaller models like tinyllama would be acceptable.