🔥 MoGemma

Python/Mojo interface for Google Gemma 4.

Features

Embeddings — Dense vector embeddings via a pure Mojo backend, using the pretrained E4B variant by default.
Text generation — Synchronous and async streaming with configurable sampling.
Multimodal — Native support for Gemma 4 vision (all variants) and audio (E2B/E4B) with zero-copy processing.
Google Cloud Storage — Automatic model download from Google's gemma-data bucket.
OpenTelemetry — Optional tracing instrumentation.

Installation

Recommended for most users:

pip install 'mogemma[llm]'

This enables the text generation and embedding examples shown below.

For multimodal generation with automatic image decoding from str, Path, or raw bytes inputs:

pip install 'mogemma[vision]'

Base package only:

pip install mogemma

Use the base package if you're already preparing tokens or image arrays yourself.

Quick Start

Text Generation

The default getting-started path is mogemma[llm].

from mogemma import SyncGemmaModel

model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))

Multimodal Vision

All Gemma 4 variants support vision inputs; the default (google/gemma-4-E4B-it) additionally accepts audio.

Install mogemma[vision] to pass image file paths or raw image bytes directly.

from mogemma import SyncGemmaModel

# Default model is google/gemma-4-E4B-it — multimodal out of the box
model = SyncGemmaModel()

response = model.generate("Describe this image in detail:", images=["input.jpg"])
print(response)

Async Streaming

import asyncio
from mogemma import AsyncGemmaModel

async def main():
    model = AsyncGemmaModel()
    async for token in model.generate_stream("Once upon a time"):
        print(token, end="", flush=True)

asyncio.run(main())

Embeddings

Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.

from mogemma import SyncEmbeddingModel

model = SyncEmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape)  # (2, 768)

Selecting a Model Variant

Four Gemma 4 variants are supported (auto-detected from config.json):

Model ID	Description
`google/gemma-4-E2B-it`	Compact multimodal (text + image + audio), ~2B params
`google/gemma-4-E4B-it`	Default for `SyncGemmaModel` / `AsyncGemmaModel` / `SyncEmbeddingModel` — latest small multimodal
`google/gemma-4-26B-A4B-it`	MoE (128 experts, top-8), 4B active — heavier reasoning

Pretrained (non-instruction-tuned) E2B / E4B are listed in the Gemma 4 family but are not currently published to gs://gemma-data; SyncEmbeddingModel therefore defaults to the -it variant until pretrained ships.

Pass a model ID to override the default:

model = SyncGemmaModel("google/gemma-4-26B-A4B-it")

For full control over sampling parameters, pass a GenerationConfig:

from mogemma import GenerationConfig, SyncGemmaModel

config = GenerationConfig(model_path="google/gemma-4-26B-A4B-it", temperature=0.7)
model = SyncGemmaModel(config)

Device Selection

GenerationConfig and EmbeddingConfig accept:

device="cpu"
device="gpu"
device="gpu:0" (or other index)

Device handling is deterministic:

device="cpu" always runs on CPU
explicit GPU requests never silently fall back to CPU
unavailable GPU requests raise an explicit error

Current runtime status:

cpu and gpu are executable backends today
gpu / gpu:N execute via a mathematically verified runtime polyfill

from mogemma import EmbeddingConfig, SyncEmbeddingModel, GenerationConfig, SyncGemmaModel

generation = SyncGemmaModel(
    GenerationConfig(
        model_path="google/gemma-4-E4B-it",
        device="cpu",
    )
)

embeddings = SyncEmbeddingModel(
    EmbeddingConfig(
        model_path="google/gemma-4-E4B",
        device="cpu",
    )
)

GPU Requirements: GPU acceleration requires Mojo nightly with GPU support, compatible GPU drivers (NVIDIA CUDA, AMD ROCm, or Apple Metal), and sufficient VRAM for model weights and KV cache.

Runtime Requirements

MoGemma leverages the latest Mojo features for maximum performance.

Mojo Nightly: Version 0.26.3.0.dev or later is required for building from source.
Python: 3.10+

Development & Architecture

Architecture Specific Builds

MoGemma automatically optimizes its Mojo core for your specific CPU architecture during the build process.

x86_64: Uses --target-cpu x86-64-v3 for optimized vector instructions.
aarch64: Uses native ARM optimizations.

Local Development

To build the Mojo extension locally:

make build

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.agents		.agents
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 MoGemma

Features

Installation

Quick Start

Text Generation

Multimodal Vision

Async Streaming

Embeddings

Selecting a Model Variant

Device Selection

Runtime Requirements

Development & Architecture

Architecture Specific Builds

Local Development

License

About

Uh oh!

Releases 8

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔥 MoGemma

Features

Installation

Quick Start

Text Generation

Multimodal Vision

Async Streaming

Embeddings

Selecting a Model Variant

Device Selection

Runtime Requirements

Development & Architecture

Architecture Specific Builds

Local Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Contributors

Uh oh!

Languages