AI playground — where I try out models and prompts
A local playground for LLMs, embeddings, prompt engineering. No token costs, no data leaving the box, arbitrarily painful iterations.
Why this exists
Professionally I work on AI product development. I can't do that credibly without first-hand experience of how a model behaves when fed real data. I use cloud models too — but for experimentation I need a place where I don't count every token.
Second motivation: my own projects (SIDELINE, AKTA, LERN) need LLMs. I don't want to depend on an external provider — and my family shouldn't end up in a training dataset because I sent documents through a cloud API.
Time invested
Bursty. New interesting model? A weekend installing, testing, comparing against the incumbents. Own project with a new use case? A few evenings prompting and evaluating.
What worked
- Ollama as the central model server. Pull, run, done. No conda hell, no CUDA version surprises.
- Local embeddings (nomic, mxbai) for full-text search in AKTA. Surprisingly good, runs on CPU at acceptable latency.
- Prompt versioning as a plain folder of markdown files. No tool, no SaaS — works.
- Comparing different models (Llama 3, Qwen, Gemma) on the same prompts. Makes it clear how much "the LLM" is a rough oversimplification.
What didn't
- Hardware limits. Beyond 30 B parameters RAM gets tight. 70 B only with aggressive quantisation — and answer quality drops noticeably.
- Hallucinations in German are often worse than in English. Models are clearly more English-trained. Surprised me initially.
- Tool calling on local models is, in 2025, still not a solved topic. It works, but breaks in interesting ways.
- Prompt drift: a prompt that was good yesterday is suddenly worse after a model upgrade. Versioning becomes a necessity from there on.
Where I am now
The playground is my sparring partner. Professional discussions about AI come easier because I have personally tested every myth.