Support deleted rows in SAS data files#366
Conversation
2e793b5 to
189d873
Compare
|
I've rebased on the latest commits in
If there is anything else that should be changed, please let me know. The existing unit tests for the reading of SAS data files all rely on the write feature of ReadStat to write the files first. Would it be okay to add a unit test that uses the example file from #284? It's quite small so commiting it to the repository seems okay to me. Or should I add a feature to the write API to write deleted rows? |
|
I've tested this PR with SAS data files from SAS 9.4 and the file from the linked issue, which has been created with SAS 8.0202M0. And it works fine for those as far as I can tell. However, the PR doesn't work for the SAS file from #379, which has been generated with SAS 7. The SAS 7 file is also treated differently by SAS 9.4 clients. Usually, the number of deleted rows is visible in the GUI of a local SAS installation but not for this one. While the incremental row number in the data view is presented correctly for the (only) non-deleted row, the total number of rows in the meta-data is presented as just 1. There are two reasons why the PR doesn't work for the SAS 7 file:
Both problems are fixable:
If there is interest in an adaptation of this PR for SAS 7 files, please let me know. |
|
As far as I am concerned (but I am just a biased user) support for SAS8+ files with deleted rows would already be very nice. I mean, support for SAS7 files would be even nicer but not if that delays merging the original PR. |
Resolves #284.
Introduction
This PR adds support for deleted rows in SAS data files, which may be compressed or uncompressed.
The handling of deleted uncompressed rows follows the description of the sas7bdat file format here: https://github.com/FredHutch/sas7bdat-specification/blob/master/sas7bdat.rst
The handling of deleted compressed rows is not described in the document linked above. However, it is much simpler as it doesn't involve a dedicated bitmap but only the compression type
0x05, which marks compressed data rows as deleted. The compression type0x05seems to be the combination of0x01(meaning the data can be skipped) and0x04(indicating a compressed data row), which fits well with my theory put forth in #365 that the compression type is actually a bitmap.Row limits and counters
Currently, the
row_limitis not only an upper limit on the number of rows but also the exact number of rows that are expected to be parsed. As therow_limitalso includes the deleted rows, the corresponding counterparsed_row_countis now also increased when encountering a deleted row. This allows for the validation checks againstrow_limitto remain unchanged.The two new variables
deleted_row_limitandparsed_deleted_row_countserve a corresponding purpose but count only deleted rows.The row count in the meta data is computed as
row_limit - deleted_row_limit. Accordingly, the row id that is passed to the value handler and is used in error messages is now computed asparsed_row_count - parsed_deleted_row_count.Implementation alternative
To the user of the library, deleted rows are transparent with the proposed change, which is a bit different in SAS itself. There, the number of deleted rows is visible in the metadata and the GUI viewer indicates positions of deleted rows by non-subsequent row ids.
One could consider adding the number of deleted rows to the meta data. And one could also consider marking rows as deleted either by explicitly passing the information to the value handler or by implicitly not passing data for deleted row ids to it. This would require a breaking change in the API, I think. It's mostly a discussion on what the information
row_countin the meta data should exactly refer to in the case of deleted rows.Validation
The code works as expected with the example file from #284, which uses a single page and contains only 5 rows (of which 2 are deleted).
I've also tested it successfully against data files written on both Windows and Linux that contain multiple pages and have deleted rows across them.
As generating generic test data is quite easy in this case, I can generate test files if needed.