AI playground — where I try out models and prompts

Published: March 3, 2025

A local playground for LLMs, embeddings, prompt engineering. No token costs, no data leaving the box, arbitrarily painful iterations.

Why this exists

Professionally I work on AI product development. I can't do that credibly without first-hand experience of how a model behaves when fed real data. I use cloud models too — but for experimentation I need a place where I don't count every token.

Second motivation: my own projects (SIDELINE, AKTA, LERN) need LLMs. I don't want to depend on an external provider — and my family shouldn't end up in a training dataset because I sent documents through a cloud API.

Time invested

Bursty. New interesting model? A weekend installing, testing, comparing against the incumbents. Own project with a new use case? A few evenings prompting and evaluating.

What worked

Ollama as the central model server. Pull, run, done. No conda hell, no CUDA version surprises.
Local embeddings (nomic, mxbai) for full-text search in AKTA. Surprisingly good, runs on CPU at acceptable latency.
Prompt versioning as a plain folder of markdown files. No tool, no SaaS — works.
Comparing different models (Llama 3, Qwen, Gemma) on the same prompts. Makes it clear how much "the LLM" is a rough oversimplification.

What didn't

Hardware limits. Beyond 30 B parameters RAM gets tight. 70 B only with aggressive quantisation — and answer quality drops noticeably.
Hallucinations in German are often worse than in English. Models are clearly more English-trained. Surprised me initially.
Tool calling on local models is, in 2025, still not a solved topic. It works, but breaks in interesting ways.
Prompt drift: a prompt that was good yesterday is suddenly worse after a model upgrade. Versioning becomes a necessity from there on.

Where I am now

The playground is my sparring partner. Professional discussions about AI come easier because I have personally tested every myth.