Gemma 4 Hack Recipes

🏃 Run Gemma 4 with Ollama

★★★★★5 minAnyRun

🔧 Run with raw llama.cpp flags

★★★★★15 minAnyRun

🍎 MLX on Apple Silicon

★★★★★15 minMacRun

📏 Stretch context past 128K with p-RoPE

★★★★★20 minAnyLong context

💾 KV-cache q4 for long context on 8 GB

★★★★★20 minAnyLong context

📐 Custom GGUF quantization

★★★★★30 minAnyQuantize

🎓 MLX LoRA fine-tune on an Air

★★★★★90 minMacFine-tune

🎓 Unsloth QLoRA fine-tune

★★★★★90 minNVIDIAFine-tune

📦 Export your LoRA to GGUF

★★★★★20 minAnyFine-tune

🗜️ Prune unused vocabulary

★★★★★2 hrsAnyQuantize

🎯 Global-layer-only LoRA

★★★★★3 hrsNVIDIAFine-tune

🧠 Belief layer + separate episodic memory

★★★★★2–3 hrsAnyAugment

⚗️ Distill belief+memory into a LoRA

★★★★★4+ hrsMacAugment

📐 Hierarchical belief DAG — portable spec

★★★★★design phaseAnyAugment

🐍 Hierarchical belief DAG — Python reference

★★★★★4+ hrsAnyAugment

🏃Run Gemma 4 with Ollama

★Time: 5 minHardware: AnyGoal: Run

The absolute lowest-friction path. One shell command downloads E4B, sets up the runtime, and drops you into an interactive chat. If you want to see Gemma 4 running on your machine in the next five minutes, this is it. ⚠️ One caveat: Ollama's default Gemma 4 tags are NOT the QAT-Q4 quants — they're the higher-precision variants. <strong>gemma4:e4b ≈ 9.6 GB, gemma4:e2b ≈ 7.2 GB on disk</strong>. For genuinely small/fast inference on an 8 GB Air, see Recipe 2 (llama.cpp + the QAT Q4_K_M GGUF, ~3 GB) instead.

Prerequisites

~10 GB free disk for E4B (~7.5 GB for E2B)
Homebrew (Mac) or curl (Linux)

What you're actually changing

Token Embedding↗MLP (GeGLU)↗Per-Layer Embeddings↗

Recipe

Install Ollama.

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull and run Gemma 4 E4B. First run downloads ~9.6 GB (Ollama default is high-precision, NOT Q4); subsequent runs are instant.

ollama run gemma4:e4b

# Or for the smaller edge variant (~7.2 GB):
ollama run gemma4:e2b

You're in an interactive prompt. Try: Explain PLE in three sentences.

To exit: Ctrl+D or type /bye.

Expected result

Measured on an 8 GB M1 Air (April 2026): gemma4:e2b at ~1–2 tok/s for short replies (~13 s for a 15-word summary, warm). Cold first-token load ~8 s. On a 16 GB M2/M3, expect 8–15 tok/s. Memory footprint matches disk size because Ollama's default is high-precision, not Q4: ~9.6 GB resident for E4B, ~7.2 GB for E2B. On an 8 GB Air the default E4B variant will swap heavily — use E2B or switch to the Q4 GGUF path in Recipe 2.

Gotchas

Ollama's default gemma4:e4b is NOT the QAT Q4_K_M variant. It's high-precision (~9.6 GB) and will thrash an 8 GB Air. For the genuinely small ~3 GB variant, use Recipe 2 with the official google/gemma-4-E4B-QAT-GGUF from HuggingFace.
Do NOT set num_predict on Gemma 4 via the API. Gemma 4 produces hidden reasoning tokens before visible output; a low cap (e.g. 80) consumes them all during internal thought and returns an empty string. Leave num_predict unset and let Ollama use the default, or set it to at least 400.
If E4B's 9.6 GB is too much, use ollama run gemma4:e2b (~7.2 GB) — genuinely smaller and runs on 8 GB machines, though still tight.

Next up → 🔧 Run with raw llama.cpp flags