Contrib: FLUX.1-lite-8B-alpha (native FLUX.1 compatibility) by jimburtoft · Pull Request #147 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-28T06:06:59Z

Summary

FLUX.1-lite-8B-alpha (Freepik) is architecturally identical to FLUX.1-dev with 8 double-stream MMDiT blocks instead of 19. All other components (CLIP + T5-XXL encoders, VAE, scheduler, RoPE) are the same.
NxDI's first-party FLUX.1 implementation reads num_layers from the model's config.json at runtime, so FLUX.1-lite works out of the box with no custom modeling code.
This contrib provides a standalone generation script, integration tests, and documentation demonstrating native compatibility.

Validation Results (trn2.3xlarge, LNC=2, TP=4)

Metric	Value
Resolution	1024x1024
Inference steps	25
E2E generation time	5.91s avg
Pipeline steps/sec	4.23
Backbone forward/sec	4.49
Compilation time	~128s

Checklist

Model Type

Diffusion/image generation model

Contribution Contents

README with model info, benchmarks, usage instructions
src/ directory with generation script
test/integration/test_model.py with 3 passing tests
Sample output image
vLLM integration (N/A -- diffusion model)

Testing

All code tested on Neuron hardware (trn2.3xlarge)
All numbers in README are measured, not estimated
Integration tests pass: smoke test, image generation, timing

SDK Compatibility

Neuron SDK 2.29 (DLAMI 20260410)
NxD Inference 0.9
PyTorch 2.9

FLUX.1-lite-8B-alpha (Freepik) is architecturally identical to FLUX.1-dev with 8 double-stream blocks instead of 19. NxDI's FLUX.1 implementation reads num_layers from config.json at runtime, so it works out of the box with FLUX.1-lite weights -- no custom modeling code needed. Validated on trn2.3xlarge (LNC=2, TP=4): - 5.91s per 1024x1024 image (25 steps) - 4.49 backbone fwd/sec - ~128s compilation time - SDK 2.29, NxD Inference 0.9

tejasamx-aws · 2026-05-10T19:07:18Z

@@ -0,0 +1,3 @@
+# NxDI FLUX.1-lite-8B-alpha Diffusion Model


this __init__.py is just 3 comment lines with no actual imports/exports. Since there's no custom modeling code just a generation script, this is technically fine, but for future, we should be exporting the generation function for programmatic use: from .generate_flux_lite import run_generate.

tejasamx-aws · 2026-05-10T19:12:26Z

The first commit, Remove InternVL3 contrib (belongs to separate PR). For future, Please squash commits for a cleaner history.

tejasamx-aws · 2026-05-10T19:10:20Z

+    print(f"Image saved, pixel std={img_array.std():.1f}")
+
+
+def test_warm_generation_time(neuron_app):


The description reports 5.91s average generation time, but the test threshold is 15s. we should tighen to 10s to catch performance regressions earlier.

Add generate_flux_lite_highres.py script and documentation for generating images at 2048x2048 and 4096x4096 resolution on Neuron. The original FLUX.1-lite/dev/schnell models do not natively support these resolutions. Key implementation: - 2K (16,384 tokens): Backbone compiled natively at TP=4 on trn2.3xlarge. VAE uses tiled decode (4 tiles) since it exceeds instruction limit at 2K. - 4K (65,536 tokens): Context parallelism (TP=4, CP=4, world_size=16) on trn2.48xlarge splits sequence to 16,384 tokens/shard. VAE uses 25 tiles. Requires NEURON_RT_VISIBLE_CORES=0-15 to prevent 64-rank collective deadlock. Benchmark results: - 1K: 5.91s, 2K: 31.53s, 4K: 107.25s (25 steps, guidance_scale=3.5) - 4K backbone: 3.98s/step, tiled VAE: 0.29s/tile Validated on SDK 2.30 (DLAMI 20260522), trn2.48xlarge.

jimburtoft added 3 commits April 28, 2026 02:06

Remove estimated comparison table from README (only measured numbers)

85261a0

Remove InternVL3 contrib (belongs to separate PR)

e4fc514

tejasamx-aws reviewed May 10, 2026

View reviewed changes

tejasamx-aws approved these changes May 10, 2026

View reviewed changes

lutfanm-aws approved these changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrib: FLUX.1-lite-8B-alpha (native FLUX.1 compatibility)#147

Contrib: FLUX.1-lite-8B-alpha (native FLUX.1 compatibility)#147
jimburtoft wants to merge 4 commits into
aws-neuron:mainfrom
jimburtoft:contrib/flux1-lite-8b

jimburtoft commented Apr 28, 2026

Uh oh!

tejasamx-aws May 10, 2026 •

edited

Loading

Uh oh!

tejasamx-aws commented May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		print(f"Image saved, pixel std={img_array.std():.1f}")


		def test_warm_generation_time(neuron_app):

Conversation

jimburtoft commented Apr 28, 2026

Summary

Validation Results (trn2.3xlarge, LNC=2, TP=4)

Checklist

Model Type

Contribution Contents

Testing

SDK Compatibility

Uh oh!

tejasamx-aws May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasamx-aws commented May 10, 2026

Uh oh!

tejasamx-aws May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tejasamx-aws May 10, 2026 •

edited

Loading