Local AI coding
Local LLM inference on macOS Apple Silicon (M-series) — no API keys, no rate limits, no cloud — used as the default backend for opencode and pi (and as a generic OpenAI-compatible endpoint for anything else).
Overview
| Layer | Tool | Where it lives |
|---|---|---|
| Server | mlxserve (mlx-openai-server) | LaunchAgent dev.cade.mlxserve (KeepAlive) or foreground shell function; port 8080, OpenAI-compat + tool calling |
| Server (fallback) | Ollama | LaunchAgent, port 11434, OpenAI-compat |
| Client | opencode, pi | Both point at localhost:8080/v1 by default on macOS |
| Cloud | Anthropic, OpenAI | Available everywhere via ANTHROPIC_API_KEY / OPENAI_API_KEY |
MLX is the primary backend because it’s roughly 2-3× faster than Ollama
(llama.cpp) on the M3 Max for the same quants, and mlx-openai-server adds
OpenAI tool-call parsing on top — which mlx_lm.server upstream still lacks.
Ollama remains installed as a plain fallback.
Quick start
# LaunchAgent (preferred — survives terminal close, KeepAlive):
mlxstart # launchctl bootstrap dev.cade.mlxserve
mlxstatus # is it running?
mlxstop
# Or foreground in a terminal:
mlxserve # default: Qwen3.6-27B 8-bit (served as "qwen3.6-27b")
mlxserve qwen3.6-35b-a3b # MoE alternative — fast tokens (3B active)
mlxserve coder-next # Qwen3-Coder-Next 80B/3B MoE (no thinking)
# Then launch any client:
opencode # TUI agent, full tool-calling loop
pi # TUI agent, full tool-calling loop
All requests use the served-model-name qwen3.6-27b regardless of which
physical model is loaded — client configs stay stable when you swap models.
mlxserve and mlx-openai-server
mlxserve is a shell function (defined in both .zshrc and .bashrc) that
starts mlx-openai-server with the right parsers for the chosen model:
mlx-openai-server launch \
--model-type lm \
--model-path unsloth/Qwen3.6-27B-MLX-8bit \
--served-model-name qwen3.6-27b \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--reasoning-parser qwen3_5 \
--kv-bits 8 --kv-group-size 64 \
--host 127.0.0.1 --port 8080
The parser flags are critical: opencode and pi are tool-call-heavy, and the
upstream mlx_lm.server does not emit tool_calls[] in OpenAI format
(ml-explore/mlx-lm#1096).
mlx-openai-server adds parser layers that translate model output into the
standard format. Qwen3.6 emits Qwen3-Coder’s XML tool-call wire format, so
the tool parser is qwen3_coder even on non-Coder variants; the reasoning
parser (qwen3_5) strips <think> blocks before clients see the output.
Override the port with MLX_PORT=9000 mlxserve.
Pre-pulled models
Models live in packages/mlx-models.txt:
unsloth/Qwen3.6-27B-MLX-8bit # primary (~35 GB, 256K ctx, reasoning-tuned)
# mlx-community/Qwen3.6-35B-A3B-8bit # MoE alternative — pull on demand
# mlx-community/Qwen3-Coder-Next-8bit# max tool-call throughput (~85 GB)
Pre-pull the default set in one shot:
bash ~/dotfiles/install/local-llm.sh pull-models
This is opt-in (the default local-llm.sh run only verifies binaries —
pulling ~35 GB of models on every bootstrap would be unfriendly). The
commented entries are one mlxpull <alias> away.
HF_HOME is set by .zprofile to $_LOCAL_PLAT/.cache/huggingface, so
weights live on scratch when scratch is configured.
Per-tool config
Both coding agents are configured to use localhost:8080/v1 as their
default backend on macOS. Each one lives under chezmoi:
| Tool | Default config | AGENTS file |
|---|---|---|
| opencode | ~/.config/opencode/opencode.json (+ plugin/git-context.ts) | ~/.config/opencode/AGENTS.md |
| pi | ~/.pi/agent/{settings,models}.json (+ themes/dotfiles.json) | ~/.pi/agent/AGENTS.md |
Both AGENTS files (plus Claude’s CLAUDE.md and Codex’s AGENTS.md)
include a shared partial — see Agent guidance. Cloud model pins
are single-sourced in home/.chezmoidata.toml ({{ .models.opus }} etc.).
Switching to cloud
# opencode — switch agent or model in the TUI
/agent plan # plan agent runs Opus
/model anthropic/claude-sonnet-4-6
# pi — Ctrl+L (or /model)
/model anthropic/claude-sonnet-4-6
API keys come from ~/.<service>.env files (written by bash auth.sh),
sourced into the shell by ~/.zprofile.
Ollama (fallback)
Installed via Homebrew (brew "ollama"). Managed as a LaunchAgent on macOS
— starts at login at http://127.0.0.1:11434. No model fleet is maintained
for it; an ad-hoc pull (ollama pull qwen3-coder:30b) is one command away.
(The old context-boosted alias machinery was removed — nothing consumed it.)
run_onchange hooks
| Trigger file | Script re-run |
|---|---|
packages/pip.txt | install/local-llm.sh (verifies binaries) |
home/dot_config/opencode/opencode.json.tmpl | install/opencode.sh (binary check) |
chezmoi update after pulling dotfile changes re-verifies the setup.