Skip to content

Commit ddf59fd

Browse files
Host2DomainGraph: complete documentation about host name sorting
Add information about issue #3 requiring buffering when folding using the public suffix list.
1 parent b6380bb commit ddf59fd

1 file changed

Lines changed: 32 additions & 2 deletions

File tree

src/script/host2domaingraph.sh

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ PARALLEL_SORT_THREADS=2
6262
# echo -e "com.opus\ncom.opera\nco.mopus\nco.mopera" | shuf | LC_ALL=C sort
6363
# This requirement is met by the output of the cc-pyspark job.
6464
#
65-
# 2 the second problem stems from the fact that a hyphen (valid in host and
65+
# 2 The second problem stems from the fact that a hyphen (valid in host and
6666
# subdomain names) is sorted before the dot:
6767
# ac.gov
6868
# ac.gov.ascension
@@ -86,7 +86,37 @@ PARALLEL_SORT_THREADS=2
8686
# a trailing dot:
8787
# ac.gov.ascension-island
8888
# ac.gov.ascension
89-
89+
#
90+
# 3 The public suffix list adds a further issue: there are multi-part suffixes,
91+
# such as "co.uk" (or "uk.co" in reverse domain name notation). And the suffixes
92+
# of a multi-part suffix can be public suffixes themselves: also "uk" is a public
93+
# suffix. But they do not need to. For example: "no" and "os.hordaland.no" are
94+
# in the public suffix list but "hordaland.no" is not. In this situation,
95+
# adding a trailing dot does not even guarantee that all hosts of a domain under
96+
# a public suffix is in a contiguous block:
97+
#
98+
# $> cat hordaland.txt
99+
# no.hordaland
100+
# no.hordaland-teater
101+
# no.hordaland.os
102+
# no.hordaland.os.bibliotek
103+
# no.hordaland.oygarden
104+
# no.hordalandfolkemusikklag
105+
#
106+
# $> cat hordaland.txt | sed 's/$/./' | LC_ALL=C sort
107+
# no.hordaland-teater.
108+
# no.hordaland.
109+
# no.hordaland.os.
110+
# no.hordaland.os.bibliotek.
111+
# no.hordaland.oygarden.
112+
# no.hordalandfolkemusikklag.
113+
#
114+
# The host names "no.hordaland." and "no.hordaland.oygarden." both
115+
# are under the domain ""no.hordaland" (public suffix is "no").
116+
#
117+
# Please see https://github.com/commoncrawl/cc-webgraph/issues/3
118+
# for further details.
119+
#
90120

91121
export LC_ALL=C
92122

0 commit comments

Comments
 (0)