Skip to content

Commit 6a514b4

Browse files
authored
Merge pull request #35 from jelmervdl/metadata-only
Add `--jsonl` option
2 parents 7cec357 + 8be9393 commit 6a514b4

13 files changed

Lines changed: 278 additions & 157 deletions

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ if (NOT CMAKE_BUILD_TYPE)
1616
set(CMAKE_BUILD_TYPE Release)
1717
endif ()
1818

19-
find_package(Boost 1.71 COMPONENTS program_options log log_setup REQUIRED)
19+
find_package(Boost 1.75 COMPONENTS program_options json log log_setup REQUIRED)
2020

2121
# compile executable into bin/
2222
set(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}/bin)

README.md

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,15 @@ warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]
4242
[ --paragraph-identification ] [ --tag-filters <filters_file> ] <warc_file>...
4343
```
4444
* `--output`/`-o` output folder
45-
* `--files`/`-f` list of output files separated by commas (and without `.gz`); `text` and `url` are always written, while `mime` and `html` are optional
45+
* `--files`/`-f` list of output files separated by commas (and without `.gz`); Options are `text`,`html`,`url`,`mime`,`file` and `date`. Defaults to `text,url`. See [output](#output).
46+
* `--jsonl` Produce JSON Lines on stdout instead of writing to files per language.
4647
* `--pdfpass` WARC file where PDF records will be stored
48+
* `--robotstxtpass` WARC file where robots.txt related records will be stored
49+
* `--encode-urls` Escape non-ascii characters that appear in the record URL with `%dd` encoding.
50+
* `--multilang` Detect multiple languages in the document, and split the document accordingly. Only supported with CLD2 classifier.
4751
* `--paragraph-identification` print the paragraph identifier for each sentence extracted from the HTML
48-
* `--classifier` classifier to use: `cld2` or `fasttext`.
49-
* `--fasttext-model` path to FastText model for fasttext classifier.
52+
* `--classifier` classifier to use: `cld2` or `fasttext`. When `fasttext` is used, one also has to specify a model using `--fasttext-model`.
53+
* `--fasttext-model` path to FastText model for fasttext classifier. Models can be any [FastText language identification model](https://fasttext.cc/docs/en/language-identification.html) such as [OpenLID lid201-model.ftz](https://github.com/laurieburchell/open-lid-dataset#quantised-model)
5054
* `--tag-filters` file containing filters that are used to eliminate matching documents
5155
* `--invert-tag-filters` output only documents that match the filter
5256
* `--url-filters` file containing regular expressions that match urls of documents to eliminate
@@ -61,6 +65,39 @@ warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]
6165

6266
Lines beginning with `#` and empty lines are ignored. Any invalid filter will raise a warning message, but will not prevent other filters from being read.
6367

68+
## Output
69+
When used with `--output`/`-o` (with optionally `--files`/`-f`), warc2text will
70+
produce the following directory structure at the path specified by `--output`:
71+
72+
- `./{lang}/text.gz` will contain the plain text per document as base64 encoded lines. E.g. `gzip -cd en/text.gz | head -n5 | tail -n1 | base64 -d` will give you the 5th document's text.
73+
- `./{lang}/url.gz` contains [the crawled URL](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-target-uri) for each record.
74+
- `./{lang}/mime.gz` contains the mimetype as reported by the crawled server
75+
- `./{lang}/html.gz` contains lines of base64 encoded HTML as returned by the server. For ePub, MS Office or ODF files this is the extracted XML.
76+
- `./{lang}/file.gz` contains the `{filename}:{offset}:{length}` pointer to the warc archive the record was extracted from. `{offset}` and `{length}` are of the compressed data, e.g. `tail -c+{offset} < {filename} | head -c{length} | gzip -cd` will give you the original record.
77+
- `./{lang}/date.gz` gives you the original crawl date/time as reported by the crawler. [This should be a UTC timestamp](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-date-mandatory).
78+
79+
In every file, each line corresponds to the same record. E.g. the fifth line in `text.gz` and fifth line in `url.gz` together give you the text and url for a single record.
80+
81+
The `{lang}` part of the path is determined by the classifier (see `--classifier`) and may be a two-letter or three-letter code depending on the classifier used. See [this list](https://github.com/CLD2Owners/cld2/blob/b56fa78a2fe44ac2851bae5bf4f4693a0644da7b/internal/generated_language.cc#L647-L1262) for CLD2.
82+
83+
When using `--jsonl`, the output is instead a single JSON record per line, with the following keys (always in this order):
84+
```ts
85+
{
86+
f: string, # filename of warc file (same as the `{filename}` part in `file.gz`)
87+
o: number, # byte offset of record in warc file (same as `{offset}` in `file.gz`)
88+
s: number, # warc file record size (same as `{size}` in `file.gz`)
89+
rs: number, # byte size of record payload (uncompressed)
90+
ps: number, # byte size of text only payload (so compare this against `rs` and you should get amount of HTML removed)
91+
l: string, # identified language by classifier
92+
u: string, # url
93+
c: string, # content type as reported by the HTTP response header (or warc record header if that isn't present)
94+
ts: string, # crawl date/time as reported by the crawler
95+
p: string, # plain text
96+
}
97+
```
98+
99+
More keys might be added in the future (e.g. the raw HTML is not included now) and you should not expect the order of the keys to stay the same between different versions of warc2text.
100+
64101
## Included dependencies
65102
HTML Tokenizer by [c-smile](https://www.codeproject.com/Articles/14076/Fast-and-Compact-HTML-XML-Scanner-Tokenizer)
66103

src/bilangwriter.cc

Lines changed: 79 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -3,30 +3,26 @@
33
#include "util/exception.hh"
44
#include <cassert>
55
#include <string>
6+
#include <iomanip>
7+
#include <boost/json.hpp>
8+
69

710
namespace warc2text{
811

9-
GzipWriter::GzipWriter() {
10-
dest = nullptr;
11-
compressed = 0;
12-
s.zalloc = nullptr;
13-
s.zfree = nullptr;
14-
s.opaque = nullptr;
15-
int ret = deflateInit2(&s, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 31, 8, Z_DEFAULT_STRATEGY);
16-
assert(ret == Z_OK);
17-
buf = new unsigned char[BUFFER_SIZE];
12+
GzipWriter::GzipWriter()
13+
: dest(nullptr),
14+
buf(new unsigned char[BUFFER_SIZE]) {
15+
//
1816
}
1917

2018
GzipWriter::~GzipWriter() {
21-
if (dest) {
22-
this->compress("", 0, Z_FINISH);
23-
deflateEnd(&s);
24-
std::fclose(dest);
25-
}
19+
if (is_open())
20+
close();
2621
delete[] buf;
2722
}
2823

2924
void GzipWriter::compress(const char *in, std::size_t size, int flush) {
25+
assert(is_open());
3026
if (size == 0 && flush == Z_NO_FLUSH) return;
3127
s.avail_in = size;
3228
s.next_in = (Bytef *) in;
@@ -39,7 +35,7 @@ namespace warc2text{
3935
s.next_out = buf;
4036
ret = deflate(&s, flush);
4137
assert(ret == Z_OK || ret == Z_STREAM_END); // Z_STREAM_END only happens if flush == Z_FINISH
42-
compressed = BUFFER_SIZE - s.avail_out;
38+
std::size_t compressed = BUFFER_SIZE - s.avail_out;
4339
//written = std::fwrite(buf, 1, compressed, dest);
4440
std::fwrite(buf, 1, compressed, dest);
4541
// TODO error handling
@@ -52,47 +48,68 @@ namespace warc2text{
5248
void GzipWriter::open(const std::string& filename) {
5349
dest = std::fopen(filename.c_str(), "wb");
5450
UTIL_THROW_IF(!dest, util::ErrnoException, "while creating " << filename);
51+
s.zalloc = nullptr;
52+
s.zfree = nullptr;
53+
s.opaque = nullptr;
54+
int ret = deflateInit2(&s, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 31, 8, Z_DEFAULT_STRATEGY);
55+
assert(ret == Z_OK);
56+
}
57+
58+
void GzipWriter::close() {
59+
compress("", 0, Z_FINISH);
60+
deflateEnd(&s);
61+
std::fclose(dest);
62+
dest = nullptr;
5563
}
5664

5765
void GzipWriter::write(const char* text, std::size_t size) {
58-
this->compress(text, size, Z_NO_FLUSH);
66+
compress(text, size, Z_NO_FLUSH);
5967
}
6068

6169
void GzipWriter::writeLine(const char* text, std::size_t size) {
62-
this->compress(text, size, Z_NO_FLUSH);
63-
this->compress("\n", 1, Z_NO_FLUSH);
70+
compress(text, size, Z_NO_FLUSH);
71+
compress("\n", 1, Z_NO_FLUSH);
6472
}
6573

6674
void GzipWriter::writeLine(const std::string& text) {
67-
this->compress(text.c_str(), text.size(), Z_NO_FLUSH);
68-
this->compress("\n", 1, Z_NO_FLUSH);
75+
compress(text.c_str(), text.size(), Z_NO_FLUSH);
76+
compress("\n", 1, Z_NO_FLUSH);
6977
}
7078

7179
bool GzipWriter::is_open(){
7280
return dest != nullptr;
7381
}
7482

75-
void BilangWriter::write(const std::string& lang, const std::string& b64text, const std::string& url, const std::string& mime, const std::string& b64html) {
76-
GzipWriter* gzurl = &url_files[lang];
77-
GzipWriter* gztext = &text_files[lang];
78-
GzipWriter* gzmime = nullptr;
79-
GzipWriter* gzhtml = nullptr;
80-
if (output_files.count("mime") == 1) gzmime = &(mime_files[lang]);
81-
if (output_files.count("html") == 1) gzhtml = &(html_files[lang]);
82-
if (!gzurl->is_open()) {
83-
// if one file does not exist, the rest shouldn't either
84-
std::string path = folder + "/" + lang;
85-
util::createDirectories(path);
86-
gzurl->open(path + "/url.gz");
87-
gztext->open(path + "/text.gz");
88-
if (gzmime != nullptr) gzmime->open(path + "/mime.gz");
89-
if (gzhtml != nullptr) gzhtml->open(path + "/html.gz");
90-
}
83+
LangWriter::LangWriter(const std::string& path, const std::unordered_set<std::string>& output_files) {
84+
util::createDirectories(path);
85+
86+
if (output_files.count("url"))
87+
url_file.open(path + "/url.gz");
88+
if (output_files.count("text"))
89+
text_file.open(path + "/text.gz");
90+
if (output_files.count("mime"))
91+
mime_file.open(path + "/mime.gz");
92+
if (output_files.count("html"))
93+
html_file.open(path + "/html.gz");
94+
if (output_files.count("file"))
95+
file_file.open(path + "/file.gz");
96+
if (output_files.count("date"))
97+
date_file.open(path + "/date.gz");
98+
}
9199

92-
gzurl->writeLine(url);
93-
gztext->writeLine(b64text);
94-
if (gzmime != nullptr) gzmime->writeLine(mime);
95-
if (gzhtml != nullptr) gzhtml->writeLine(b64html);
100+
void LangWriter::write(Record const &record, std::string const &chunk) {
101+
if (url_file.is_open())
102+
url_file.writeLine(record.getURL());
103+
if (mime_file.is_open())
104+
mime_file.writeLine(record.getHTTPcontentType());
105+
if (file_file.is_open())
106+
file_file.writeLine(record.getFilename() + ":" + std::to_string(record.getOffset()) + ":" + std::to_string(record.getSize()));
107+
if (date_file.is_open())
108+
date_file.writeLine(record.getWARCdate());
109+
if (html_file.is_open())
110+
html_file.writeLine(util::encodeBase64(record.getPayload()));
111+
if (text_file.is_open())
112+
text_file.writeLine(util::encodeBase64(chunk));
96113
}
97114

98115
std::string get_paragraph_id(const std::string& text) {
@@ -111,23 +128,33 @@ namespace warc2text{
111128
}
112129

113130
void BilangWriter::write(const Record& record, bool paragraph_identification) {
114-
std::string base64text;
115-
std::string base64html;
116-
117-
if (output_files.count("html") == 1)
118-
util::encodeBase64(record.getPayload(), base64html);
119-
120131
for (const auto& it : record.getTextByLangs()) {
121-
std::string payload = it.second;
132+
std::string chunk = it.second;
122133

123-
if (paragraph_identification) {
124-
payload = get_paragraph_id(payload);
125-
}
134+
if (paragraph_identification)
135+
chunk = get_paragraph_id(chunk);
126136

127-
util::encodeBase64(payload, base64text);
128-
this->write(it.first, base64text, record.getURL(), record.getHTTPcontentType(), base64html);
137+
auto writer_it = writers.try_emplace(it.first, folder + "/" + it.first, output_files);
138+
writer_it.first->second.write(record, chunk);
129139
}
130140
}
131141

142+
void JSONLinesWriter::write(const Record& record, [[maybe_unused]] bool paragraph_identification) {
143+
// JSON lines format (https://jsonlines.org)
144+
for (auto &&chunk : record.getTextByLangs()) {
145+
out_ << boost::json::value{
146+
{"f", boost::json::string(record.getFilename())},
147+
{"o", boost::json::value(record.getOffset())},
148+
{"s", boost::json::value(record.getSize())},
149+
{"rs", boost::json::value(record.getPayload().size())},
150+
{"ps", boost::json::value(chunk.second.size())},
151+
{"l", boost::json::string(chunk.first)},
152+
{"u", boost::json::string(record.getURL())},
153+
{"c", boost::json::string(record.getHTTPcontentType())},
154+
{"ts", boost::json::string(record.getWARCdate())},
155+
{"p", boost::json::string(chunk.second)},
156+
} << "\n";
157+
}
158+
}
132159
}
133160

src/bilangwriter.hh

Lines changed: 47 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -3,65 +3,84 @@
33

44
#include <unordered_map>
55
#include <unordered_set>
6+
#include <ostream>
67
#include "record.hh"
78
#include "zlib.h"
89

910
namespace warc2text {
1011

12+
/**
13+
* Generic interface for writing records to some form of output.
14+
*/
15+
class RecordWriter {
16+
public:
17+
virtual void write(const Record& record, bool paragraph_identification = false) = 0;
18+
virtual ~RecordWriter() = default;
19+
};
20+
21+
/**
22+
* Writer used by BilangWriter to write a single compressed file
23+
* (i.e. a column for a specific language)
24+
*/
1125
class GzipWriter {
1226
private:
1327
FILE* dest;
1428
z_stream s{};
1529
unsigned char* buf;
16-
std::size_t compressed;
1730
void compress(const char* in, std::size_t size, int flush);
1831

1932
public:
2033
GzipWriter();
2134
~GzipWriter();
2235
void open(const std::string& filename);
36+
void close();
2337
void write(const char* text, std::size_t size);
2438
void writeLine(const char* text, std::size_t size);
2539
void writeLine(const std::string& text);
2640
bool is_open();
2741
static const std::size_t BUFFER_SIZE = 4096;
2842
};
2943

30-
class BilangWriter {
44+
/**
45+
* Writes records to a specific folder for a specific language.
46+
*/
47+
class LangWriter {
48+
private:
49+
GzipWriter url_file;
50+
GzipWriter mime_file;
51+
GzipWriter text_file;
52+
GzipWriter html_file;
53+
GzipWriter file_file;
54+
GzipWriter date_file;
55+
public:
56+
LangWriter(const std::string& folder, const std::unordered_set<std::string>& output_files);
57+
void write(const Record& record, const std::string &chunk);
58+
};
59+
60+
class BilangWriter : public RecordWriter {
3161
private:
3262
std::string folder;
33-
std::unordered_map<std::string, GzipWriter> url_files;
34-
std::unordered_map<std::string, GzipWriter> mime_files;
35-
std::unordered_map<std::string, GzipWriter> text_files;
36-
std::unordered_map<std::string, GzipWriter> html_files;
3763
std::unordered_set<std::string> output_files;
38-
39-
void write(const std::string& lang, const std::string& b64text, const std::string& url, const std::string& mime, const std::string& b64html);
40-
64+
std::unordered_map<std::string, LangWriter> writers;
4165
public:
42-
explicit BilangWriter(const std::string& folder) :
43-
folder(folder),
44-
url_files(),
45-
mime_files(),
46-
text_files(),
47-
html_files(),
48-
output_files({}) // url and text are mandatory regardless
49-
{};
50-
51-
explicit BilangWriter(const std::string& folder, const std::unordered_set<std::string>& output_files) :
52-
folder(folder),
53-
url_files(),
54-
mime_files(),
55-
text_files(),
56-
html_files(),
57-
output_files(output_files)
58-
{};
59-
60-
void write(const Record& record, bool paragraph_identification = false);
66+
BilangWriter(const std::string& folder, const std::unordered_set<std::string>& output_files = {})
67+
: folder(folder)
68+
, output_files(output_files)
69+
{
70+
//
71+
};
6172

73+
virtual void write(const Record& record, bool paragraph_identification = false);
6274
};
6375

76+
class JSONLinesWriter : public RecordWriter {
77+
private:
78+
std::ostream &out_;
79+
public:
80+
explicit JSONLinesWriter(std::ostream &out) : out_(out) {};
6481

82+
virtual void write(const Record& record, bool paragraph_identification = false);
83+
};
6584
}
6685

6786
#endif

0 commit comments

Comments
 (0)