Skip to content

Questions on statistics #23

@tfmorris

Description

@tfmorris

I've been trying to wrap my head around the overall process and understand the numbers associated. The questions below are things that I can't figure out:

  • Why are the CleanEval results different for the Java & Python implementations if it's the same algorithm?
  • The Phase 1 stats are inconsistent. The text says 22 hours, but the pasted log says 10.5 hrs.
  • The Phase 1 log says there were 34901 map tasks, which is suspiciously close to the number of files in the CC-MAIN-2016-07 crawl, not the 2015-48 crawl. Are these stats for a different crawl than the others?
  • Phase 1 mapper output records is 1.1 billion which is significantly lower than the 1.73B (or 1.82B) URLs listed for the crawl. That seems like too big a difference to be accounted for by content type filters (or is my perception wrong?). Is it known what factors contribute to this delta?
  • The paper says that the there were only ~1% duplicates in the Common Crawl, but the Phase 2 reducer (exact duplicates filter) appears to have only output 39% of the input records (ie it filtered 60%+). Am I misunderstanding the stats or is this the actual number of exact duplicates.
  • The Phase 1 stats seem to indicate that a significant amount (40%) of time was spent in the shuffle phase, but it doesn't look like the reducer actually does anything. Could Phase 1 be implemented as a map only job? Conversely, could Phase 1 & Phase 2 be merged so that the reducer actually does useful work?
  • The Phase 3 Step 3 stats for Tuples Creation (36 hrs, 7104 normalized instance hours) seem to indicate that very few instances were used for this phase. Is that an accurate observation? Would more instances reduce the elapsed time?
  • Are there stats on how many near-duplicate documents were eliminated in Phase 3/4?

Thanks for any answers/insights you can offer!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions