Complete solution for running large language models locally using Ollama
I took a detour down the rabbit-hole of local LLM's (worth it), and learned how to run models locally, serve responses to a GUI interface instead of the default terminal, how to tunnel via ngrok, and expose the backend to a public client.
It definitely felt like drinking water from a firehose, and I leveraged AI heavily during the process as there were so many moving pieces I was not familiar with.
Some of the more interesting things I learned along the way include:
- Running LLM's locally with Ollama
- Setting model hyperparameters
- Chunking model responses
- Model quantization
- ngrok tunelling
If you're interested in doing this too, I've linked a 3-part walkthrough below. You can also check out the repo for some helpful documentation on how to choose the right model, install dependencies, configure your server, deploy your app, and manage model responses.
Part 1: Run LLMs Locally with Ollama CLI
Part 2: Handling Raw Bytes Stream from Ollama API Endpoint
Part 3: Exposing Your Local API for Remote Access w/ ngrok