Skip to content

TypeError (str/bytes) in warc.py error path #23

@bnewbold

Description

@bnewbold

In production at IA, probably caused by petabox downtime or network error, I got a the following exception and stack trace:

TypeError: sequence item 0: expected str instance, bytes found
  File "extraction_ungrobided.py", line 272, in <module>
    MRExtractUnGrobided.run()
  File "mrjob/job.py", line 424, in run
    mr_job.execute()
  File "mrjob/job.py", line 433, in execute
    self.run_mapper(self.options.step_num)
  File "mrjob/job.py", line 517, in run_mapper
    for out_key, out_value in mapper(key, value) or ():
  File "extraction_ungrobided.py", line 228, in mapper
    info, status = self.extract(info)
  File "extraction_ungrobided.py", line 143, in extract
    info['file:cdx']['c_size'])
  File "extraction_ungrobided.py", line 126, in fetch_warc_content
    gwb_record = rstore.load_resource(warc_uri, offset, c_size)
  File "wayback/resourcestore.py", line 65, in load_resource
    return create_resource(loader.load_block(bstart, blen))
  File "wayback/resource.py", line 583, in create_resource
    record, errors, offset = parser.parse(rs, 0, line)
  File "hanzo/warctools/warc.py", line 223, in parse
    % (",".join(self.KNOWN_VERSIONS)),

self.KNOWN_VERSIONS is defined as bytes at https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/warc.py#L177, but is being joined with a string.

One fix, though i'm not sure it would work in Python 2.7, would be:

(",".join([s.decode('utf-8') for s in self.KNOWN_VERSIONS])

There's probably a more idiomatic way, but I can submit a patch for that.

While we're at it, might want to make it a join on ", ", not ","?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions