Skip to content

Chapter 8 Zarr Chromosome def process_chunk logic and labels are incorrect #6

@hanqi-monarch

Description

@hanqi-monarch

https://github.com/tiagoantao/python-performance/blob/master/08-persistence/sec4-zarr/creation.py#L43-L60

I assume each process_chunk is only supposed to work within their assigned chunks.

In the current implementation, a certain process_chunk could be doing the copying work that is supposed to be done by the next process_chunk with a my_chunk value that is 1 larger.

Imagine chunk sizes are 5 and the first file's (chromosome-1/calls) chrom_size is 7 (any number larger than chunk_size will do)

When my_chunk = 0,
all_start = my_chunk * chunk_size -> all_start is 0, remaining too

remaining -= chrom_size -> 0 - 7 = -7
chrom_start = chrom_size + remaining -> 7 - 7 = 0 , chrom_start is 0, negative remaining gets flipped into positive 7

Issue is at

    while remaining > 0:
        write_from_chrom = min(remaining, CHUNK_SIZE)

This while loop will have 2 iterations

  • First iteration consumes CHUNK_SIZE 5 from the remaining 7.
  • Second iteration consumes the last 2 from remaining. However this is the job of the next process_chunk with my_chunk = 1, so there is duplicate writing to the same destination from different process_chunk functions.

Even if such "doing the next process's job" was intended, there seems to still be missing code.
Currently the destination indexes are updated (all_start = all_start + write_from_chrom) but the source indexes are not. I expected chrom_start = chrom_start + write_from_chrom after the destination index update.

If we want 1 process_chunk to only do it's own job, a solution is to track how much a single process_chunk has already written. Once CHUNK_SIZE is reached, its job is done. Do not try to consume the entire file to make entire use of the source, because its responsibility should solely be to fill up a CHUNK_SIZE amount in write destination. This could mean we shouldn't even while loop. But always just run write_from_chrom = min(remaining, CHUNK_SIZE) once.

Then check if less than CHUNK_SIZE is written with a remaining < CHUNK_SIZE. This would mean we need the next file to fill in this CHUNK.
Which is fulfilled by the commented section https://github.com/tiagoantao/python-performance/blob/master/08-persistence/sec4-zarr/creation.py#L63-L67.

I'm not sure why this section is commented because I think it is necessary.
Imagine the last chunk is size 5, and the last file is size 3, and second last file is size 5.
remaining will be 2 after break, which leaves 3 more in the last chunk to be filled. I think this requires the commented out code.

In the book, there is an arrow annotation at while remaining > 0: saying "A chunk might require more than one chromosome".
However, the pointed at code does not demonstrate how the next chromosome file is read. It is the commented code on github that is not printed in the book that reads the next file.

Overall this section is impossible to understand if someone only reads the book and doesn't see the source code with the commented out part.
The annotation was also confusing.

I wish this book had some technical reviewers before publishing because there were numerous (20+ at least) mistakes in labelling, code naming consistency and logic, It was very strong in first 3 chapters but quality went off a cliff from chapter 4 onwards. The content is still understandable and very good, but the frictions from all the tiny and not so tiny mistakes makes me hesitant to recommend to another reader.

The most shocking logic error was at chapter 7 in https://github.com/tiagoantao/python-performance/blob/master/07-pandas/sec2-speed/traversing.py, where the first 3 functions summed down the rows before dividing 2 columns, while the rest of the functions divided across the columns first within each row before mean. That's when my heart sank.

Imagine data like
10 2
8 4
First 3 functions will give (10+8)/(2+4) = 3
Last 3 functions will give mean((10/2) , (8/4)) = mean (5,2) = 3.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions