Skip to content

cofin/mogemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”₯ MoGemma

Python/Mojo interface for Google Gemma 4.

Features

  • Embeddings β€” Dense vector embeddings via a pure Mojo backend, using the pretrained E4B variant by default.
  • Text generation β€” Synchronous and async streaming with configurable sampling.
  • Multimodal β€” Native support for Gemma 4 vision (all variants) and audio (E2B/E4B) with zero-copy processing.
  • Google Cloud Storage β€” Automatic model download from Google's gemma-data bucket.
  • OpenTelemetry β€” Optional tracing instrumentation.

Installation

Recommended for most users:

pip install 'mogemma[llm]'

This enables the text generation and embedding examples shown below.

For multimodal generation with automatic image decoding from str, Path, or raw bytes inputs:

pip install 'mogemma[vision]'

Base package only:

pip install mogemma

Use the base package if you're already preparing tokens or image arrays yourself.

Quick Start

Text Generation

The default getting-started path is mogemma[llm].

from mogemma import SyncGemmaModel

model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))

Multimodal Vision

All Gemma 4 variants support vision inputs; the default (google/gemma-4-E4B-it) additionally accepts audio.

  • Install mogemma[vision] to pass image file paths or raw image bytes directly.
from mogemma import SyncGemmaModel

# Default model is google/gemma-4-E4B-it β€” multimodal out of the box
model = SyncGemmaModel()

response = model.generate("Describe this image in detail:", images=["input.jpg"])
print(response)

Async Streaming

import asyncio
from mogemma import AsyncGemmaModel

async def main():
    model = AsyncGemmaModel()
    async for token in model.generate_stream("Once upon a time"):
        print(token, end="", flush=True)

asyncio.run(main())

Embeddings

Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.

from mogemma import SyncEmbeddingModel

model = SyncEmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape)  # (2, 768)

Selecting a Model Variant

Four Gemma 4 variants are supported (auto-detected from config.json):

Model ID Description
google/gemma-4-E2B-it Compact multimodal (text + image + audio), ~2B params
google/gemma-4-E4B-it Default for SyncGemmaModel / AsyncGemmaModel / SyncEmbeddingModel β€” latest small multimodal
google/gemma-4-26B-A4B-it MoE (128 experts, top-8), 4B active β€” heavier reasoning

Pretrained (non-instruction-tuned) E2B / E4B are listed in the Gemma 4 family but are not currently published to gs://gemma-data; SyncEmbeddingModel therefore defaults to the -it variant until pretrained ships.

Pass a model ID to override the default:

model = SyncGemmaModel("google/gemma-4-26B-A4B-it")

For full control over sampling parameters, pass a GenerationConfig:

from mogemma import GenerationConfig, SyncGemmaModel

config = GenerationConfig(model_path="google/gemma-4-26B-A4B-it", temperature=0.7)
model = SyncGemmaModel(config)

Device Selection

GenerationConfig and EmbeddingConfig accept:

  • device="cpu"
  • device="gpu"
  • device="gpu:0" (or other index)

Device handling is deterministic:

  • device="cpu" always runs on CPU
  • explicit GPU requests never silently fall back to CPU
  • unavailable GPU requests raise an explicit error

Current runtime status:

  • cpu and gpu are executable backends today
  • gpu / gpu:N execute via a mathematically verified runtime polyfill
from mogemma import EmbeddingConfig, SyncEmbeddingModel, GenerationConfig, SyncGemmaModel

generation = SyncGemmaModel(
    GenerationConfig(
        model_path="google/gemma-4-E4B-it",
        device="cpu",
    )
)

embeddings = SyncEmbeddingModel(
    EmbeddingConfig(
        model_path="google/gemma-4-E4B",
        device="cpu",
    )
)

GPU Requirements: GPU acceleration requires Mojo nightly with GPU support, compatible GPU drivers (NVIDIA CUDA, AMD ROCm, or Apple Metal), and sufficient VRAM for model weights and KV cache.

Runtime Requirements

MoGemma leverages the latest Mojo features for maximum performance.

  • Mojo Nightly: Version 0.26.3.0.dev or later is required for building from source.
  • Python: 3.10+

Development & Architecture

Architecture Specific Builds

MoGemma automatically optimizes its Mojo core for your specific CPU architecture during the build process.

  • x86_64: Uses --target-cpu x86-64-v3 for optimized vector instructions.
  • aarch64: Uses native ARM optimizations.

Local Development

To build the Mojo extension locally:

make build

License

MIT

About

πŸ”₯ Python / Mojo Interface for Google Gemma 4

Topics

Resources

License

Stars

Watchers

Forks

Contributors