@@ -62,7 +62,7 @@ PARALLEL_SORT_THREADS=2
6262# echo -e "com.opus\ncom.opera\nco.mopus\nco.mopera" | shuf | LC_ALL=C sort
6363# This requirement is met by the output of the cc-pyspark job.
6464#
65- # 2 the second problem stems from the fact that a hyphen (valid in host and
65+ # 2 The second problem stems from the fact that a hyphen (valid in host and
6666# subdomain names) is sorted before the dot:
6767# ac.gov
6868# ac.gov.ascension
@@ -86,7 +86,37 @@ PARALLEL_SORT_THREADS=2
8686# a trailing dot:
8787# ac.gov.ascension-island
8888# ac.gov.ascension
89-
89+ #
90+ # 3 The public suffix list adds a further issue: there are multi-part suffixes,
91+ # such as "co.uk" (or "uk.co" in reverse domain name notation). And the suffixes
92+ # of a multi-part suffix can be public suffixes themselves: also "uk" is a public
93+ # suffix. But they do not need to. For example: "no" and "os.hordaland.no" are
94+ # in the public suffix list but "hordaland.no" is not. In this situation,
95+ # adding a trailing dot does not even guarantee that all hosts of a domain under
96+ # a public suffix is in a contiguous block:
97+ #
98+ # $> cat hordaland.txt
99+ # no.hordaland
100+ # no.hordaland-teater
101+ # no.hordaland.os
102+ # no.hordaland.os.bibliotek
103+ # no.hordaland.oygarden
104+ # no.hordalandfolkemusikklag
105+ #
106+ # $> cat hordaland.txt | sed 's/$/./' | LC_ALL=C sort
107+ # no.hordaland-teater.
108+ # no.hordaland.
109+ # no.hordaland.os.
110+ # no.hordaland.os.bibliotek.
111+ # no.hordaland.oygarden.
112+ # no.hordalandfolkemusikklag.
113+ #
114+ # The host names "no.hordaland." and "no.hordaland.oygarden." both
115+ # are under the domain ""no.hordaland" (public suffix is "no").
116+ #
117+ # Please see https://github.com/commoncrawl/cc-webgraph/issues/3
118+ # for further details.
119+ #
90120
91121export LC_ALL=C
92122
0 commit comments