Skip to content

Adds support for int8 w8a8_gemlite quantization#34

Open
anm-ol wants to merge 19 commits intowp-1.5from
quantization
Open

Adds support for int8 w8a8_gemlite quantization#34
anm-ol wants to merge 19 commits intowp-1.5from
quantization

Conversation

@anm-ol
Copy link

@anm-ol anm-ol commented Mar 20, 2026

No description provided.

@anm-ol anm-ol requested a review from lapp0 March 20, 2026 07:04
Copy link
Collaborator

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, requesting some cleanup changes. Please merge latest wp-1.5 first.

# "quant": "int8_weights",
"quant": None,
"taehv_ae": True,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run with,

MODEL_URI="Overworld-Models/MR160k" pytest ./examples/benchmark.py

no need to specify all these overrides

I recommend updating MODEL_OVERRIDES
to

MODEL_OVERRIDES = [
    {},  # default
    {"quant": "intw8a8"},
]


frame = cv2.imdecode(np.frombuffer(urllib.request.urlopen(url).read(), np.uint8), cv2.IMREAD_COLOR)
engine.append_frame(torch.from_numpy(np.repeat(frame[None], 4, axis=0)))
frame = cv2.resize(frame, (1024, 512))[:, :, ::-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No resize needed after #33 merged

device="cuda")

total_linear_params = sum(mod.weight.numel() for _, mod in engine.model.named_modules() if isinstance(mod, torch.nn.Linear))
print(f"Total linear layer parameters: {total_linear_params:,}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to update gen_sample.py. Could you document the available quants in a brief section in README.md though?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, should quant be included by default in gen_sample.py

# Create inference engine
engine = WorldEngine(sys.argv[1], quant="intw8a8", device="cuda")

Or

# Create inference engine
engine = WorldEngine(sys.argv[1], quant=None, device="cuda")

src/quantize.py Outdated
try:
from lmdeploy.pytorch.models.q_modules import QLinear
except ImportError:
QLinear = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only gemlite works, let's only use it and call it intw8a8 or similar please.

from .ae import get_ae
from .patch_model import apply_inference_patches
from .quantize import quantize_model
from .quantize import quantize_model, apply_ptq_model, apply_qat
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports don't exist

pyproject.toml Outdated
"torchvision==0.25.0",
"torchaudio==2.10.0",
"torchao==0.16.0",
"flashinfer-python==0.6.6",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both out of scope for PR

pyproject.toml Outdated
"torchao==0.16.0",
"flashinfer-python==0.6.6",
"fbgemm-gpu-genai==1.5.0; sys_platform == 'linux'",
"gemlite==0.5.1.post1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this works on Windows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants