Problems with pyxlma_flash_sort_grid script

I found a possible issue while trying to run the `examples/pyxlma_flash_sort_grid.py` file. The code would break and throw error at the following line in the line `dataset, start_time = lma_read.dataset(paths_to_read)` within the `flash_sort_grid` function. This occurs beacuse the `lmafile` class calls the `gen_sta_data` function:

```
overview = pd.DataFrame(self.gen_sta_data(),
            columns=['ID','Name','win(us)', 'data_ver', 'rms_error(ns)',
                     'sources','percent','<P/P_m>','active'])
```

Apparently, this happens when some `LYLOUT*.dat.gz` files have inconsistent number of columns under station data. For example, here's what I found in two different OKLMA files from the same day:

Notice how the second file doesn't contain any values corresponding to the `dec_win(us)` column header.


File content in `LYLOUT_110524_000000_0600.dat.gz`

<img width="793" alt="Screenshot 2024-08-26 at 4 58 58 PM" src="https://github.com/user-attachments/assets/8ed847c2-9e18-4d49-8fb9-a0a217023e68">

File content in `LYLOUT_110524_205000_0600.dat.gz`


<img width="791" alt="Screenshot 2024-08-26 at 5 00 37 PM" src="https://github.com/user-attachments/assets/0dbb6a32-0e50-4c80-bc69-53b01fe328b2">

I figured some flexibility in both `gen_sta_data` and `gen_sta_info` functions can deal with this inconsistency. For example, here's what worked for me:

```
def gen_sta_info(self):
    """ Parse the station info table from the header. Some files do not
    have fixed width columns, and station names may have spaces, so this
    function chops out the space-delimited columns to the left and right
    of the station names.
    """
    nstations = self.station_data_start-self.station_info_start-1
    with open_gzip_or_dat(self.file) as f:
        for i in range(self.station_info_start+1):
            line = next(f)
        for line_num in range(nstations):
            line = next(f)
            parts = line.decode("utf-8").split()

            if line_num == 0:
                slen = len(parts)

            if slen == 9:
                name = ' '.join(parts[2:-5])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-5:-1])

            elif slen == 10: # files with one extra station data column
                name = ' '.join(parts[2:-6])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-6:-2])
```
 
```
def gen_sta_data(self):
    """ Parse the station data table from the header. Some files do not
    have fixed width columns, and station names may have spaces, so this
    function chops out the space-delimited columns to the left and right
    of the station names.
    """
    nstations = self.station_data_start-self.station_info_start-1

    with open_gzip_or_dat(self.file) as f:
        for i in range(self.station_data_start+1):
            line = next(f)

        for line_num in range(nstations):
            line = next(f)
            parts = line.decode("utf-8").split()

            if line_num == 0:  # Calculate slen only for the first line
                slen = len(parts)

            if slen == 11:
                name = ' '.join(parts[2:-7])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-7:])

            elif slen == 12: # files with one extra station data column
                name = ' '.join(parts[2:-8])
                sta_info, code = parts[0:2]
                yield (code, name) + tuple(parts[-7:])
```

I could run the flash_sort script after these modifications, but it was quite slow compared to simply running lmatools. Ingesting too many files at the same time overloaded the kernel due to out-of-memory issues with xarray data handler. I am not sure if this script is still WIP or is meant to replace lmatools eventually, but at the time of testing, did not offer any advantage over the good old lmatools' processing speed. I'd love to hear what @deeplycloudy or @wx4stg have to say. Happy to be corrected, of course!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with pyxlma_flash_sort_grid script #51

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problems with pyxlma_flash_sort_grid script #51

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions