Cloud-based AI coding tools are incredibly useful, but they come with a tradeoff: your code gets sent to external servers. I wanted the benefits of an AI coding assistant without that compromise, so I built Local Coder: a CLI tool that runs entirely on your local machine using llama.cpp and open-source GGUF models.
Why I Built This
I use AI coding tools daily, and tools like Claude Code have become a core part of my workflow. But there are situations where sending code to an external API is not ideal — working on proprietary projects, coding offline, or just wanting full control over your data. I wanted something that gave me a similar interactive experience but kept everything local.
What It Does
Local Coder is a Python CLI with three main modes:
- Interactive Chat — Start a multi-turn conversation with
python main.py chat. Ask coding questions, get explanations, and iterate on ideas with streaming markdown-rendered responses right in your terminal. - One-Shot Questions — Ask a quick question without entering a session using
python main.py ask "your question here". - Code Editing — Request changes to files with
python main.py edit "Add error handling to @helpers.py". It supports a preview and confirmation workflow so you can review changes before they are applied.
The Tech Stack
The project is built with:
- llama-cpp-python for running GGUF models locally with GPU acceleration
- Typer for the CLI framework
- Rich for beautiful terminal rendering with syntax-highlighted markdown and code blocks
By default I use the Qwen2.5-Coder 7B model, but you can swap in any GGUF model — DeepSeek-Coder, CodeLlama, or whatever works best for your use case by using "/model" in the chat. The tool supports both CUDA and Apple Silicon Metal acceleration, so it runs fast on most modern hardware.
How It Works Under the Hood
The architecture is straightforward:
- User input is parsed by Typer in
main.py @filenamereferences are extracted and file contents are loaded byhelpers.py- The prompt is constructed with proper context by
prompt_builder.py - llama.cpp processes the prompt and streams tokens back
- Rich renders the markdown output in real-time
There is also Docker support if you prefer running it in a container.
update
I just created a custom skill for this project in the repo called local-coder. This will provide other coding agents (claude code, cursor, codex, etc...) the necessary instructions in order to setup and use this project's functionality.
What I Learned
Building this taught me a lot about working with local LLMs. Tuning the context window size (n_ctx), managing GPU layer offloading, and picking the right quantization level all have a noticeable impact on the balance between speed and quality. A Q4_K_M quantization hits a sweet spot for most coding tasks — fast enough for interactive use while retaining good output quality.
I also learned that prompt construction matters even more with smaller local models than it does with large cloud models. Being precise about how you format file context and instructions makes a big difference in the quality of responses you get from a 7B parameter model.
Try It Out
The project is open source and available on GitHub. If you have a machine with a decent GPU (or even just an M-series Mac), give it a try. Your code stays on your machine, and you might be surprised at how capable local models have become for everyday coding tasks.