Gemma 4 Hack Recipes

← Architecture Explorer
All
Mac
NVIDIA
CPU-only
All
Run
Long ctx
Quantize
Fine-tune
Augment

🏃Run Gemma 4 with Ollama

Time: 5 minHardware: AnyGoal: Run
The absolute lowest-friction path. One shell command downloads E4B, sets up the runtime, and drops you into an interactive chat. If you want to see Gemma 4 running on your machine in the next five minutes, this is it. ⚠️ One caveat: Ollama's default Gemma 4 tags are NOT the QAT-Q4 quants — they're the higher-precision variants. <strong>gemma4:e4b ≈ 9.6 GB, gemma4:e2b ≈ 7.2 GB on disk</strong>. For genuinely small/fast inference on an 8 GB Air, see Recipe 2 (llama.cpp + the QAT Q4_K_M GGUF, ~3 GB) instead.

Prerequisites

  • ~10 GB free disk for E4B (~7.5 GB for E2B)
  • Homebrew (Mac) or curl (Linux)

Recipe

1
Install Ollama.
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh
2
Pull and run Gemma 4 E4B. First run downloads ~9.6 GB (Ollama default is high-precision, NOT Q4); subsequent runs are instant.
ollama run gemma4:e4b

# Or for the smaller edge variant (~7.2 GB):
ollama run gemma4:e2b
3
You're in an interactive prompt. Try: Explain PLE in three sentences.
4
To exit: Ctrl+D or type /bye.

Expected result

Measured on an 8 GB M1 Air (April 2026): gemma4:e2b at ~1–2 tok/s for short replies (~13 s for a 15-word summary, warm). Cold first-token load ~8 s. On a 16 GB M2/M3, expect 8–15 tok/s. Memory footprint matches disk size because Ollama's default is high-precision, not Q4: ~9.6 GB resident for E4B, ~7.2 GB for E2B. On an 8 GB Air the default E4B variant will swap heavily — use E2B or switch to the Q4 GGUF path in Recipe 2.

Gotchas

  • Ollama's default gemma4:e4b is NOT the QAT Q4_K_M variant. It's high-precision (~9.6 GB) and will thrash an 8 GB Air. For the genuinely small ~3 GB variant, use Recipe 2 with the official google/gemma-4-E4B-QAT-GGUF from HuggingFace.
  • Do NOT set num_predict on Gemma 4 via the API. Gemma 4 produces hidden reasoning tokens before visible output; a low cap (e.g. 80) consumes them all during internal thought and returns an empty string. Leave num_predict unset and let Ollama use the default, or set it to at least 400.
  • If E4B's 9.6 GB is too much, use ollama run gemma4:e2b (~7.2 GB) — genuinely smaller and runs on 8 GB machines, though still tight.