Skip to content

Conversation

@mlell
Copy link

@mlell mlell commented Jan 4, 2026

While CSVReader has numerous options to skip header or footer lines, or skip lines based on prefixes, it is cumbersome to read lines with varying lines before or after the actual statements. Users would need to override CSVReader.read(). Simply extending using super().read(...) is possible but requires creating a temporary file because read() needs a physical file path. Simply adding an open(str) -> file-like method to CSVReader allows to keep the useful functionality of read() while still being able to preprocess the file, for example, skipping a variable number of lines based on a pattern.

This would fix #196.

This allows subclasses to customize raw file reading.
@johannesjh
Copy link
Contributor

I second the proposition. It would also help with #197

Instead of strictly requiring a file-like object, the importer could allow more general Iterable[String] type of objects. Users could still pass file-like objects, but also generator expressions. Generator expressions could be a convenient and idiomatic way of manipulating / filtering CSV lines.

This method uses the class members header, footer, names, dialect,
comments, and order. Overriding this method causes those members
to be ignored unless the overriding method explicitly uses them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not belong to this commit or this PR. I also find it unnecessary.

"""Open the CSV file for reading.
This method can be overridden in subclasses to customize raw file reading,
for example to skip lines before processing or to handle special file formats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "special file formats" mean here? This class supports a declarative way to define an importer for CSV files. Why would you want to use it for something else?

Returns:
A file object opened for reading.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @johannesjh: this interface is not optimal. Having the open() method return an Iterable[str] would be better. A

def iterlines(self, fd):
    yield from fd

and use it where appropriate, would be another, maybe even better, possibility.

c, d
"""

class CustomReader(CSVReader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class CustomReader(CSVReader):
class Reader(CSVReader):

Please keep the classes in the tests named uniformly.

Comment on lines +465 to +479
class CustomReader(CSVReader):
first = Column("First")
second = Column("Second")

def open(self, filepath):
"""Skip lines until we find the column headers."""
fd = super().open(filepath)
# Read lines until we find one containing "First"
for line in fd:
if "First" in line:
# Create a new file-like object with the header line and remaining content
remaining = fd.read()
fd.close()
return io.StringIO(line + remaining)
return fd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the iterlines() idea exposed above, this would become simply:

from itertools import dropwhile

class Reader(CSVReader):
    first = Column("First")
    second = Column("Second")

    def iterlines(self, fd):
        return dropwhile(lambda line: "First" not in line, fd)

which seems much better to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Preprocessing a file for csvbase.Importer requires acrobatics with temporary files

3 participants