Chapter 8 Zarr Chromosome `def process_chunk` logic and labels are incorrect

https://github.com/tiagoantao/python-performance/blob/master/08-persistence/sec4-zarr/creation.py#L43-L60

I assume each `process_chunk` is only supposed to work within their assigned chunks.

In the current implementation, a certain `process_chunk` could be doing the copying work that is supposed to be done by the next `process_chunk` with a `my_chunk` value that is 1 larger.

Imagine chunk sizes are 5 and the first file's (`chromosome-1/calls`) `chrom_size` is 7 (any number larger than chunk_size will do)

When my_chunk = 0,
all_start = my_chunk * chunk_size -> all_start is 0, remaining too

remaining -= chrom_size -> 0 - 7 = -7
chrom_start = chrom_size + remaining -> 7 - 7 = 0 , chrom_start is 0, negative remaining gets flipped into positive 7

Issue is at 
```
    while remaining > 0:
        write_from_chrom = min(remaining, CHUNK_SIZE)
```
This while loop will have 2 iterations
- First iteration consumes CHUNK_SIZE 5 from the remaining 7. 
- Second iteration consumes the last 2 from remaining. However this is the job of the next `process_chunk` with my_chunk = 1, so there is duplicate writing to the same destination from different `process_chunk` functions.

Even if such "doing the next process's job" was intended, there seems to still be missing code.
Currently the destination indexes are updated (`all_start = all_start + write_from_chrom`) but the source indexes are not. I expected `chrom_start  = chrom_start + write_from_chrom`  after the destination index update.

If we want 1 `process_chunk` to only do it's own job, a solution is to track how much a single `process_chunk` has already written. Once CHUNK_SIZE is reached, its job is done. Do not try to consume the entire file to make entire use of the source, because its responsibility should solely be to fill up a CHUNK_SIZE amount in write destination. This could mean we shouldn't even while loop. But always just run `write_from_chrom = min(remaining, CHUNK_SIZE)` once.

Then check if less than CHUNK_SIZE is written with a remaining < CHUNK_SIZE. This would mean we need the next file to fill in this CHUNK. 
Which is fulfilled by the commented section https://github.com/tiagoantao/python-performance/blob/master/08-persistence/sec4-zarr/creation.py#L63-L67. 

I'm not sure why this section is commented because I think it is necessary.
Imagine the last chunk is size 5, and the last file is size 3, and second last file is size 5. 
remaining will be 2 after break, which leaves 3 more in the last chunk to be filled. I think this requires the commented out code.

In the book, there is an arrow annotation at `while remaining > 0:` saying "A chunk might require more than one chromosome". 
However, the pointed at code does not demonstrate how the next chromosome file is read. It is the commented code on github that is not printed in the book that reads the next file.

Overall this section is impossible to understand if someone only reads the book and doesn't see the source code with the commented out part. 
The annotation was also confusing. 

I wish this book had some technical reviewers before publishing because there were numerous (20+ at least) mistakes in labelling, code naming consistency and logic, It was very strong in first 3 chapters but quality went off a cliff from chapter 4 onwards. The content is still understandable and very good, but the frictions from all the tiny and not so tiny mistakes makes me hesitant to recommend to another reader.  

The most shocking logic error was at chapter 7 in https://github.com/tiagoantao/python-performance/blob/master/07-pandas/sec2-speed/traversing.py, where the first 3 functions summed down the rows before dividing 2 columns, while the rest of the functions divided across the columns first within each row before mean. That's when my heart sank.

Imagine data like 
10 2
8  4
First 3 functions will give (10+8)/(2+4) = 3
Last 3 functions will give mean((10/2) , (8/4)) = mean (5,2) = 3.5



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chapter 8 Zarr Chromosome `def process_chunk` logic and labels are incorrect #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Chapter 8 Zarr Chromosome def process_chunk logic and labels are incorrect #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Chapter 8 Zarr Chromosome `def process_chunk` logic and labels are incorrect #6