compression rate

good work!
I'm using 10M audio codecs to train codec-bpe with a `vocab_size=30k,num_codebook=4,codebook_size=1024`, after training I get a tokenizer and found the compression rate is nearly 1. For example, a 4s audio with 25hz, I get 400 audio codec tokens, and using codec-bpe I get about nearly the same token-size  about more than 350 which is less then 1 ,`this can yield savings of 2-5x in sequence length compared to directly modeling the flattened codebooks `. I follow the steps in the readme, so can you share your codec-bpe tokenizer based on encodec or some other codec?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression rate #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

compression rate #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions