You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*`--files`/`-f` list of output files separated by commas (and without `.gz`); `text` and `url` are always written, while `mime` and `html` are optional
45
+
*`--files`/`-f` list of output files separated by commas (and without `.gz`); Options are `text`,`html`,`url`,`mime`,`file` and `date`. Defaults to `text,url`. See [output](#output).
46
+
*`--jsonl` Produce JSON Lines on stdout instead of writing to files per language.
46
47
*`--pdfpass` WARC file where PDF records will be stored
48
+
*`--robotstxtpass` WARC file where robots.txt related records will be stored
49
+
*`--encode-urls` Escape non-ascii characters that appear in the record URL with `%dd` encoding.
50
+
*`--multilang` Detect multiple languages in the document, and split the document accordingly. Only supported with CLD2 classifier.
47
51
*`--paragraph-identification` print the paragraph identifier for each sentence extracted from the HTML
48
-
*`--classifier` classifier to use: `cld2` or `fasttext`.
49
-
*`--fasttext-model` path to FastText model for fasttext classifier.
52
+
*`--classifier` classifier to use: `cld2` or `fasttext`. When `fasttext` is used, one also has to specify a model using `--fasttext-model`.
53
+
*`--fasttext-model` path to FastText model for fasttext classifier. Models can be any [FastText language identification model](https://fasttext.cc/docs/en/language-identification.html) such as [OpenLID lid201-model.ftz](https://github.com/laurieburchell/open-lid-dataset#quantised-model)
50
54
*`--tag-filters` file containing filters that are used to eliminate matching documents
51
55
*`--invert-tag-filters` output only documents that match the filter
52
56
*`--url-filters` file containing regular expressions that match urls of documents to eliminate
Lines beginning with `#` and empty lines are ignored. Any invalid filter will raise a warning message, but will not prevent other filters from being read.
63
67
68
+
## Output
69
+
When used with `--output`/`-o` (with optionally `--files`/`-f`), warc2text will
70
+
produce the following directory structure at the path specified by `--output`:
71
+
72
+
-`./{lang}/text.gz` will contain the plain text per document as base64 encoded lines. E.g. `gzip -cd en/text.gz | head -n5 | tail -n1 | base64 -d` will give you the 5th document's text.
73
+
-`./{lang}/url.gz` contains [the crawled URL](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-target-uri) for each record.
74
+
-`./{lang}/mime.gz` contains the mimetype as reported by the crawled server
75
+
-`./{lang}/html.gz` contains lines of base64 encoded HTML as returned by the server. For ePub, MS Office or ODF files this is the extracted XML.
76
+
-`./{lang}/file.gz` contains the `{filename}:{offset}:{length}` pointer to the warc archive the record was extracted from. `{offset}` and `{length}` are of the compressed data, e.g. `tail -c+{offset} < {filename} | head -c{length} | gzip -cd` will give you the original record.
77
+
-`./{lang}/date.gz` gives you the original crawl date/time as reported by the crawler. [This should be a UTC timestamp](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-date-mandatory).
78
+
79
+
In every file, each line corresponds to the same record. E.g. the fifth line in `text.gz` and fifth line in `url.gz` together give you the text and url for a single record.
80
+
81
+
The `{lang}` part of the path is determined by the classifier (see `--classifier`) and may be a two-letter or three-letter code depending on the classifier used. See [this list](https://github.com/CLD2Owners/cld2/blob/b56fa78a2fe44ac2851bae5bf4f4693a0644da7b/internal/generated_language.cc#L647-L1262) for CLD2.
82
+
83
+
When using `--jsonl`, the output is instead a single JSON record per line, with the following keys (always in this order):
c: string, # contenttypeas reported by the HTTP response header (or warc record header if that isn't present)
94
+
ts: string, # crawl date/time as reported by the crawler
95
+
p: string, # plain text
96
+
}
97
+
```
98
+
99
+
More keys might be added in the future (e.g. the raw HTML is not included now) and you should not expect the order of the keys to stay the same between different versions of warc2text.
100
+
64
101
## Included dependencies
65
102
HTML Tokenizer by [c-smile](https://www.codeproject.com/Articles/14076/Fast-and-Compact-HTML-XML-Scanner-Tokenizer)
0 commit comments