Skip to content

Conversation

@mpetroff
Copy link

@mpetroff mpetroff commented Apr 6, 2021

This PR adds read-only support for reading Dirfiles that are in uncompressed Zip files. Development of the patch was motivated by a need to reduce the total file count for FLAC-encoded Dirfiles, to alleviate the backup and data transfer overheads that result from having a very large number of small files. CLASS has been using these changes for more than a year at this point. The PR is identical to the patch attached to my 2020-02-28 post to the getdata-devel mailing list, except without the documentation (since it isn't part of this Git repository). The original version of the patch dates back to 2018.

Documentation

Separate from the Dirfile encoding scheme, GetData will read Dirfiles contained in uncompressed Zip files. This functionality is meant for reading archival data, so writing to these Zip files is not supported. Using the Info-ZIP zip utility, a Zip file can be created by running zip -r0 ../dirfile.zip * from within the root of an existing Dirfile. All encoding schemes are supported by
this functionality except for the two encoding schemes that already use Zip files, zzip and zzslim. The encoding scheme must be specified using the /ENCODING directive, even if the Dirfile is unencoded. For /INCLUDE directives and LINTERP field look up table files, only relative paths are supported and only without ./ and ../ syntax.

Although Zip files are most commonly created using Deflate compression, the Zip standard (ISO/IEC 21320-1) also supports Store compression, i.e., no compression at all. GetData's Zip file support requires Store compression for all data files, although either Store compression or Deflate compression can be used for any format files or any LINTERP field look up table files. With Store compression, a Zip file effectively concatenates a Dirfile's individual files together into a single file. Since a Zip file contains an offset table, unlike a tarball, random reads are supported without the need to load the entire file from disk.

Documentation patch

Index: html/dirfile.html.in
===================================================================
--- html/dirfile.html.in	(revision 1175)
+++ html/dirfile.html.in	(working copy)
@@ -1222,6 +1222,30 @@
       example isn't strictly necessary, since <i>z.r</i> could be used wherever
       <i>re_z</i> would be.)
 
+      <h2><a name="zippeddirfiles">Zipped Dirfiles</a></h2>
+      <p>Separate from the Dirfile encoding scheme, GetData will read Dirfiles
+      contained in uncompressed Zip files. This functionality is meant for
+      reading archival data, so writing to these Zip files is not supported.
+      Using the Info-ZIP <span class="syntax">zip</span> utility, a Zip file can
+      be created by running <span class="syntax">zip -r0 ../dirfile.zip *</span>
+      from within the root of an existing Dirfile. All encoding schemes are
+      supported by this functionality except for the two encoding schemes that
+      already use Zip files, <b>zzip</b> and <b>zzslim</b>. The encoding scheme
+      must be specified using the /ENCODING directive, even if the Dirfile is
+      unencoded. For /INCLUDE directives and LINTERP field look up table files,
+      only relative paths are supported and only without
+      <span class="syntax">./</span> and <span class="syntax">../</span> syntax.
+      <p>Although Zip files are most commonly created using <i>Deflate</i>
+      compression, the Zip standard (ISO/IEC 21320-1) also supports <i>Store</i>
+      compression, i.e., no compression at all. GetData's Zip file support
+      requires <i>Store</i> compression for all data files, although either
+      <i>Store</i> compression or <i>Deflate</i> compression can be used for any
+      <b>format</b> files or any LINTERP field look up table files. With
+      <i>Store</i> compression, a Zip file effectively concatenates a Dirfile's
+      individual files together into a single file. Since a Zip file contains an
+      offset table, unlike a tarball, random reads are supported without the
+      need to load the entire file from disk.
+
       <h2><a name="versions">History</a></h2>
       <p>The latest version of the Dirfile Standards is Version 10.
       <div class="inset">

@ketiltrout
Copy link
Owner

@mpetroff This looks reasonable, but I need to take a closer look. I'll probably have a few, mostly stylistic, issues.

Could you remove all the libtool wrappers from the PR (like test/sie_get_little_zip) and instead list them in the .gitignore file?

Separate from the Dirfile encoding scheme, GetData will read Dirfiles contained
in uncompressed Zip files. This functionality is meant for reading archival
data, so writing to these Zip files is not supported. Using the Info-ZIP `zip`
utility, a Zip file can be created by running `zip -r0 ../dirfile.zip *` from
within the root of an existing Dirfile. All encoding schemes are supported by
this functionality except for the two encoding schemes that already use Zip
files, *zzip* and *zzslim*. The encoding scheme must be specified using the
/ENCODING directive, even if the Dirfile is unencoded. For /INCLUDE directives
and LINTERP field look up table files, only relative paths are supported and
only without `./` and `../` syntax.

Although Zip files are most commonly created using _Deflate_ compression, the
Zip standard (ISO/IEC 21320-1) also supports _Store_ compression, i.e., no
compression at all. GetData's Zip file support requires _Store_ compression for
all data files, although either _Store_ compression or _Deflate_ compression
can be used for any *format* files or any LINTERP field look up table files.
With _Store_ compression, a Zip file effectively concatenates a Dirfile's
individual files together into a single file. Since a Zip file contains an
offset table, unlike a tarball, random reads are supported without the
need to load the entire file from disk.
@mpetroff mpetroff force-pushed the add-zipped-dirfile-support branch from 756e4b9 to 3fda822 Compare May 22, 2021 15:39
@mpetroff
Copy link
Author

Could you remove all the libtool wrappers from the PR (like test/sie_get_little_zip) and instead list them in the .gitignore file?

Those were added accidentally when I converted the existing patch into the Git commit. I just removed them and added them to the .gitignore file. I squashed and force-pushed this change to remove the files from the branch history.

@ketiltrout
Copy link
Owner

@mpetroff I'm working through this, and hope to have it ready some time next week. Among other things, I've added a ./configure option to enable/disable the feature and improved error propagation.

Would you prefer I push the changes to your fork or copy your branch here and do it locally?

Also, would you have time to test the changes?

@mpetroff
Copy link
Author

Pushing to my fork's branch is fine. It might take me a week or two to get to it, but I'll have time to test the changes. Thanks for working on getting this merged.

@ketiltrout
Copy link
Owner

Just to keep you up to date: I was hoping to get this finished up before releasing v0.11.0, but I think it needs more work, and it's pointing me to some changes that need fixing within the encoding framework, so it'll have to wait for the next release, which I'm hoping won't be too long from now. (Pushing GetData-0.11.0 out the door has laid bare some things that really do need some work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants