mattmireles/gemma-tuner-multimodal: Fine-tune Gemma 4 and 3n with audio, images and text on Apple Silicon, using PyTorch and Metal Performance Shaders. · GitHub

✨ Read this trending post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

Gemma macOS Tuner wizard: system check, then LoRA / model / dataset steps

Fine-tune Gemma on text, images, and audio — on your Mac, on data that doesn’t fit on your Mac.

🖼️ Image + text LoRA — captioning and VQA on local CSV.
🎙️ Audio + text LoRA — the only Apple-Silicon-native path that does this.
📝 Text-only LoRA — instruction or completion on CSV.
☁️ Stream from GCS / BigQuery — train on terabytes without filling your SSD.
🍎 Runs on Apple Silicon — MPS-native, no NVIDIA box required.

Source: github.com/mattmireles/gemma-tuner-multimodal (public).

LoRA for Gemma 4 & 3n — why not just use…?

	This	MLX-LM	Unsloth	axolotl
Fine-tune Gemma (text-only CSV)	✅	✅	✅	✅
Fine-tune Gemma image + text (caption / VQA CSV)	✅	⚠️ varies	⚠️ varies	⚠️ varies
Fine-tune Gemma audio + text	✅	❌	❌	⚠️ CUDA only
Runs on Apple Silicon (MPS)	✅	✅	❌	❌
Stream training data from cloud	✅	❌	❌	⚠️ partial
No NVIDIA GPU required	✅	✅	❌	❌

If you want to fine-tune Gemma on text, images, or audio without renting an H100 or copying a terabyte of data to your laptop, this is the only toolkit that does all three modalities on Apple Silicon.

Text-only fine-tuning (instruction or completion on CSV) is supported: set modality = text in your profile and use local CSV splits under data/datasets//. See Text-only fine-tuning below.

Image + text fine-tuning (captioning or VQA on local CSV) uses modality = image, image_sub_mode, and image_token_budget; see Image fine-tuning below. v1 is local CSV only (same constraint as text-only).

Under the hood: Hugging Face Gemma checkpoints + PEFT LoRA, supervised fine-tuning in gemma_tuner/models/gemma/finetune.py, exported as a merged HF / SafeTensors tree by gemma_tuner/scripts/export.py. For Core ML conversion and GGUF inference tooling, see README/guides/README.md — this repo’s training path is Gemma-only by design.

Deeper reading: README/guides/README.md · README/specifications/Gemma3n.md

What you can build with this

Domain-specific ASR — fine-tune on medical dictation, legal depositions, call-center recordings, or any field where off-the-shelf Whisper / Gemma mishears the jargon.
Domain-specific vision — captioning or VQA on receipts, charts, screenshots, manufacturing defects, medical imagery — any visual domain where generic models hallucinate.
Document & screen understanding — train on screenshot → structured-output pairs for UI agents, OCR-adjacent pipelines, or chart QA.
Accent, dialect, and low-resource language adaptation — adapt a base Gemma model to underrepresented voices and languages with your own labeled audio.
Multimodal assistants — extend Gemma’s text reasoning with image or audio grounding for transcription, captioning, and Q&A pipelines.
Private, on-device pipelines — train and run entirely on your Mac. Data never leaves the machine; weights never touch a third-party API.

If your data lives in GCS or BigQuery, you can do all of this on a laptop without copying terabytes locally — the dataloader streams shards on demand.

Training targets Gemma multimodal (text + image + audio) checkpoints loaded via base_model in config/config.ini and routed to gemma_tuner/models/gemma/finetune.py. The default file ships these [model:…] entries (LoRA on top of the Hub weights):

Add your own [model:your-name] section with group = gemma and a compatible base_model if you need another any-to-any Gemma 3n / Gemma 4 E2B–E4B checkpoint. Larger Gemma 4 weights on Hugging Face (for example 26B or 31B class) use a different Transformers architecture than this trainer’s AutoModelForCausalLM audio path—they are not supported here yet.

Wizard time and memory hints come from gemma_tuner/wizard/base.py (ModelSpecs).

Architecture (what actually calls what)

Run layout (typical):

output/
├── 💬-⚡/
│   ├── metadata.json
│   ├── metrics.json
│   ├── checkpoint-*/
│   └── adapter_model/          # LoRA artifacts when applicable

Configuration: hierarchical INI—defaults, groups, models, datasets, then profiles—read by gemma_tuner/core/config.py. Set GEMMA_TUNER_CONFIG if you invoke the CLI outside the repo root.


Python	3.10+ (matches `pyproject.toml`; 3.8 is a fond memory)
macOS	12.3+ for MPS; use native arm64 Python—not Rosetta
RAM	16 GB workable for small Gemma runs; more is calmer
CUDA	Optional; install the CUDA build of PyTorch that matches your driver

1. Create a virtual environment (do this first)

macOS’s built-in Python is 3.9 — too old. This project requires Python 3.10+.
Homebrew has a newer one; install it if you haven’t:

Then create a virtual environment (this also gives you pip — macOS doesn’t ship it standalone):

python3.12 -m venv .venv
source .venv/bin/activate

Your prompt changes to (.venv) …. Every command below assumes the venv is active.
To reactivate in a new terminal: source .venv/bin/activate.

2. Prove you are on arm64 (Mac)

python -c "import platform; print(platform.machine())"
# arm64  ← good
# x86_64 ← wrong Python; fix before blaming MPS

If you see x86_64, your Python is running under Rosetta. Install a native arm64 Python
from python.org or via Homebrew (brew install python@3.12),
then recreate the venv.

pip install torch torchaudio

The default dependency pin is tested for Gemma 3n on Transformers 4.x. To train or load Gemma 4 checkpoints you need a newer Transformers line (see README/plans/gemma4-upgrade.md):

pip install -r requirements/requirements-gemma4.txt

Use a separate virtual environment if you want to keep a Gemma 3n-only env and a Gemma 4 env side by side.

Gemma 3n vs Gemma 4 elsewhere: pip install -e . is enough for Gemma 3n everywhere (including finetune). Gemma 4 training needs requirements/requirements-gemma4.txt. Several non-training commands (gemma_generate, dataset-prep validation used for multimodal probing, ASR eval, etc.) still reject Gemma 4 model ids with an explicit error until those code paths are upgraded; export uses the same family-aware loader as finetune. Otherwise use a Gemma 3n id or run finetune for Gemma 4.

The wizard walks you through model selection, dataset config, and training — answering questions and writing config/config.ini for you.

Before the wizard downloads model weights, you need a Hugging Face account with access to Gemma.
Accept the license on the model card, then authenticate:

Or set HF_TOKEN in your environment.

If something seems broken, run gemma-macos-tuner system-check first.

# Dataset prep (profile names come from config.ini)
gemma-macos-tuner prepare <dataset-profile>

# Train (model in profile must be a Gemma id / local path with "gemma" in the string)
gemma-macos-tuner finetune <profile> --json-logging

# Evaluate
gemma-macos-tuner evaluate <profile-or-run>

# Export merged HF/SafeTensors tree (LoRA merged when adapter_config.json is present)
gemma-macos-tuner export <run-dir-or-profile>

# Blacklist generation from errors
gemma-macos-tuner blacklist <profile>

# Run index
gemma-macos-tuner runs list

# Guided setup
gemma-macos-tuner wizard

Migration from main.py / old habits: docs/MIGRATION.md. Runs management moved to the runs subcommand—not a separate manage.py in this tree.

Train on CSV text (local splits under data/datasets//) without audio. v1 supports local CSV only — not BigQuery or Granary streaming (those remain audio-oriented).

Set in your [profile:…] (see also README/Datasets.md):

modality = text
text_sub_mode = instruction — user/assistant turns: set prompt_column and text_column (response).
text_sub_mode = completion — one column; the full sequence is trained (no prompt mask).

Optional: max_seq_length (default 2048).

Instruction example (profile snippet):

modality = text
text_sub_mode = instruction
text_column = response
prompt_column = prompt
max_seq_length = 2048

Completion example:

modality = text
text_sub_mode = completion
text_column = text
max_seq_length = 2048

The checkpoint is still a multimodal Gemma AutoModelForCausalLM; the USM audio tower weights remain in memory in v1 even when you only train on text. See README/KNOWN_ISSUES.md.

Train on image + text pairs from local CSV splits under data/datasets// (train.csv / validation.csv). v1 supports captioning (image_sub_mode = caption) and VQA (image_sub_mode = vqa). See README/Datasets.md for all keys.

Caption / OCR-style: user turn = image + fixed instruction (“Describe this image.”); assistant = your caption column.
VQA: user turn = image + question (prompt_column); assistant = answer (text_column).

Profile snippet (caption):

modality = image
image_sub_mode = caption
text_column = caption
image_path_column = image_path
image_token_budget = 280

Profile snippet (VQA):

modality = image
image_sub_mode = vqa
prompt_column = question
text_column = answer
image_path_column = image_path
image_token_budget = 560

image_token_budget must be one of 70, 140, 280, 560, 1120. Use the same value at inference as during training. Higher budgets improve detail but increase memory and step time on MPS. Export saves the processor next to weights; if metadata.json from the run is present, export reapplies the stored budget to the processor for consistency.

Gemma 3n / Gemma 4 on Apple Silicon

End-to-end notes live in README/specifications/Gemma3n.md. Multimodal Gemma 4 + MPS field guide: README/guides/apple-silicon/gemma4-guide.md. Short version:

python -m gemma_tuner.scripts.gemma_preflight
python -m gemma_tuner.scripts.gemma_profiler --model google/gemma-3n-E2B-it

gemma-macos-tuner wizard

python -m gemma_tuner.scripts.gemma_tiny_overfit --profile gemma-lora-test --max-samples 32

python tools/eval_gemma_asr.py \
  --csv data/datasets/<your_dataset>/validation.csv \
  --model google/gemma-3n-E2B-it \
  --adapters output/<your_run>/ \
  --text-column text \
  --limit 200

MPS reality check: prefer bf16 when supported; attention is forced to eager for stability; do not leave PYTORCH_ENABLE_MPS_FALLBACK=1 on in production (it hides silent CPU fallbacks).

Data: CSVs, GCS, BigQuery

Local / HTTP / GCS paths in your prepared CSV; use gemma-macos-tuner prepare --no-download to avoid copying GCS audio locally.
BigQuery import (wizard or scripts): needs pip install .[gcp] and Application Default Credentials (gcloud auth application-default login or GOOGLE_APPLICATION_CREDENTIALS). The wizard can materialize _prepared.csv and append a dataset section to config/config.ini.

Patch layout (by dataset source):

data_patches/{source}/
├── override_text_perfect/
├── do_not_blacklist/
└── delete/

Training visualizer (optional)

Install viz extras, set visualize=true in the profile, open the URL the trainer prints (default bind 127.0.0.1, port starting at 8080). If Flask isn’t installed, training continues without drama.

NVIDIA Granary & streaming

Large-corpus workflows: gemma-macos-tuner prepare-granary and streaming-oriented dataset keys—see README/Datasets.md.

# Debug only—surfaces unsupported ops by falling back to CPU (slow)
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Cap MPS allocator appetite (try 0.7–0.9)
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.8

Preprocessing worker count and dataloader settings are controlled from config/config.ini; defaults favor using available CPU cores for Dataset.map.

Workflows under .github/workflows/: lint (ruff), fast tests (pytest -k "not slow"), macOS smoke. Regenerate lockfiles with pip-compile when you change pyproject.toml—see comments in requirements/requirements.txt.

Runs update output/experiments.csv and optional SQLite—handy SQL examples are still valid; swap profile names for whatever you actually train.

Symptom	Likely fix
`Unsupported model` from finetune	Use a Gemma model id / path containing `gemma`.
MPS not available	macOS 12.3+, arm64 Python, current PyTorch.
OOM / swap storm	Smaller batch, gradient checkpointing, lower `PYTORCH_MPS_HIGH_WATERMARK_RATIO`.
Slow training with fallback env on	Unset `PYTORCH_ENABLE_MPS_FALLBACK` after debugging.
Config not found	`GEMMA_TUNER_CONFIG`, or run from the repo with `config/config.ini`, or pass `--config`.
401 / gated model / cannot download weights	Accept the license on the model’s Hugging Face page; run `huggingface-cli login` or set `HF_TOKEN`.

See docs/CONTRIBUTING.md. Prefer extending cli_typer.py and shared helpers in gemma_tuner/core/ over one-off scripts.

Google’s Gemma team, Hugging Face Transformers & PEFT, PyTorch MPS maintainers—and everyone who filed an issue after watching Activity Monitor turn red.

Released under the MIT License.

{💬|⚡|🔥} **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#mattmirelesgemmatunermultimodal #Finetune #Gemma #audio #images #text #Apple #Silicon #PyTorch #Metal #Performance #Shaders #GitHub**

🕒 **Posted on**: 1775638350

🌟 **Want more?** Click here for more info! 🌟