Support deleted rows in SAS data files by hpoettker · Pull Request #366 · WizardMac/ReadStat

hpoettker · 2026-04-18T21:05:20Z

Resolves #284.

Introduction

This PR adds support for deleted rows in SAS data files, which may be compressed or uncompressed.

The handling of deleted uncompressed rows follows the description of the sas7bdat file format here: https://github.com/FredHutch/sas7bdat-specification/blob/master/sas7bdat.rst

The handling of deleted compressed rows is not described in the document linked above. However, it is much simpler as it doesn't involve a dedicated bitmap but only the compression type 0x05, which marks compressed data rows as deleted. The compression type 0x05 seems to be the combination of 0x01 (meaning the data can be skipped) and 0x04 (indicating a compressed data row), which fits well with my theory put forth in #365 that the compression type is actually a bitmap.

Row limits and counters

Currently, the row_limit is not only an upper limit on the number of rows but also the exact number of rows that are expected to be parsed. As the row_limit also includes the deleted rows, the corresponding counter parsed_row_count is now also increased when encountering a deleted row. This allows for the validation checks against row_limit to remain unchanged.

The two new variables deleted_row_limit and parsed_deleted_row_count serve a corresponding purpose but count only deleted rows.

The row count in the meta data is computed as row_limit - deleted_row_limit. Accordingly, the row id that is passed to the value handler and is used in error messages is now computed as parsed_row_count - parsed_deleted_row_count.

Implementation alternative

To the user of the library, deleted rows are transparent with the proposed change, which is a bit different in SAS itself. There, the number of deleted rows is visible in the metadata and the GUI viewer indicates positions of deleted rows by non-subsequent row ids.

One could consider adding the number of deleted rows to the meta data. And one could also consider marking rows as deleted either by explicitly passing the information to the value handler or by implicitly not passing data for deleted row ids to it. This would require a breaking change in the API, I think. It's mostly a discussion on what the information row_count in the meta data should exactly refer to in the case of deleted rows.

Validation

The code works as expected with the example file from #284, which uses a single page and contains only 5 rows (of which 2 are deleted).

I've also tested it successfully against data files written on both Windows and Linux that contain multiple pages and have deleted rows across them.

As generating generic test data is quite easy in this case, I can generate test files if needed.

hpoettker · 2026-04-25T16:56:07Z

I've rebased on the latest commits in dev, and I've made some minor changes:

the type unsigned char is now used instead of uint8_t for the bitmap
the code style is more aligned with the existing code (indentation and placement of the asterisk in pointer declarations)

If there is anything else that should be changed, please let me know.

The existing unit tests for the reading of SAS data files all rely on the write feature of ReadStat to write the files first. Would it be okay to add a unit test that uses the example file from #284? It's quite small so commiting it to the repository seems okay to me. Or should I add a feature to the write API to write deleted rows?

hpoettker · 2026-06-30T20:54:03Z

I've tested this PR with SAS data files from SAS 9.4 and the file from the linked issue, which has been created with SAS 8.0202M0. And it works fine for those as far as I can tell.

However, the PR doesn't work for the SAS file from #379, which has been generated with SAS 7.

The SAS 7 file is also treated differently by SAS 9.4 clients. Usually, the number of deleted rows is visible in the GUI of a local SAS installation but not for this one. While the incremental row number in the data view is presented correctly for the (only) non-deleted row, the total number of rows in the meta-data is presented as just 1.

There are two reasons why the PR doesn't work for the SAS 7 file:

The row size subheader doesn't provide the total and deleted row count as 178 and 177, respectively, but as 1 and 0. This also explains the different display in the SAS GUI.
The calculation of the offset of the deleted row bitmap is off by 14 for all 5 pages.

Both problems are fixable:

The deleted_row_limit would need to be removed and the parsed_row_count would not be incremented for deleted rows.
Instead of calculating the bitmap position by starting from the beginning of the page area for the uncompressed rows, the position could be calculated from the position of the first subheader. In all files I've analyzed, the bitmap is directly before the first subheader with no bytes in between.

If there is interest in an adaptation of this PR for SAS 7 files, please let me know.

gdementen · 2026-06-30T21:16:29Z

As far as I am concerned (but I am just a biased user) support for SAS8+ files with deleted rows would already be very nice. I mean, support for SAS7 files would be even nicer but not if that delays merging the original PR.

hpoettker force-pushed the deleted-rows branch from bb004aa to fe54fb9 Compare April 19, 2026 16:48

hpoettker mentioned this pull request Apr 19, 2026

Deleted observations being included with read_sas7bdat Roche/pyreadstat#307

Closed

hpoettker force-pushed the deleted-rows branch 4 times, most recently from 2e793b5 to 189d873 Compare April 25, 2026 15:22

Support deleted rows in SAS data files

1cbd51e

hpoettker force-pushed the deleted-rows branch from 189d873 to 1cbd51e Compare April 25, 2026 15:54

hpoettker mentioned this pull request Jun 30, 2026

Relax version string validation for SAS 7 #379

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support deleted rows in SAS data files#366

Support deleted rows in SAS data files#366
hpoettker wants to merge 1 commit into
WizardMac:devfrom
hpoettker:deleted-rows

hpoettker commented Apr 18, 2026

Uh oh!

hpoettker commented Apr 25, 2026

Uh oh!

hpoettker commented Jun 30, 2026

Uh oh!

gdementen commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hpoettker commented Apr 18, 2026

Introduction

Row limits and counters

Implementation alternative

Validation

Uh oh!

hpoettker commented Apr 25, 2026

Uh oh!

hpoettker commented Jun 30, 2026

Uh oh!

gdementen commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants