-
Notifications
You must be signed in to change notification settings - Fork 1
Problems
This page contains problems. (ETA: this is referring to the DEQ portion of the dataset, the only one we're working with so far)
-
There are 534 (~2% of emails) From:'s that have no sender. For most of these (407) it is because the values were displaced below the keys in the headers (ex: deq01_Part316, textfile). (The remaining emptys are not a problem because the name is just separated from the key by a new line, this is easily fixable.) As is evident in the example, this is also a problem for the receiver, timestamp, and subject. I spent some time seeing if there was an easy coding fix and I don't think there is, because there is a lot of variation in how many keys there are (e.g. some have attachments, Ccs, some don't). Since there aren't a huge number of these, this might be something to have students do manually in the future.
-
There are several textfiles, primarily in deq01 it appears, where the body text was not OCR'd. For example, if you compare deq01_Part162 in the drive with the corresponding textfile, the body text is not OCR'd. Our theory about this is that the body text was not OCR'd because it is too thin a font. (It is not the color because on the third page, the thicker blue font /is/ OCR'd.) To fix this, we may have to play with the OCR sensitivity.
-
deq09 appears not to have OCR'd at all. The only text the deq09 textfiles contain is in this one. The others are blank.
soon I will organize the pages into a kind of table of contents/outline below.