Currently scrapers/profiles.go also does the parsing which does not match our design.
Here is what I am proposing
- Update scrapers/profiles.go to save .html files similar to scrapers/coursebook.go
- add /professors to outDir
- save profiles as {fist}-{last}.html
- Create a parser/profiles.go
- Copy all of the parsing logic into here, modified to use goquery instead of chromedp
- Update flags in main.go
- Bonus
- Add resume support to scraper
- Add a unit test for the parser
- Side effects
- parser.go uses
utils.GetAllFilesWithExtension which would create an issue if the proposed /poffessors is added so we might consider scraping coursebook into outDir/coursebook/... instead.
Sample dir structure:
outDir (ie data)
├───coursebook
│ ├───24f
│ │ └───cp_acct
│ │ acct2301.001.24f.html
│ │ acct2301.002.24f.html
│ │ ...
│ │ ...
└───professors
first-last.html
...
I haven't worked with the profiles scraper very much but there does not seem to be any technical reason why this should not be possible.
If this is added as a task I don't mind working on it but if someone is interested feel free.
Currently scrapers/profiles.go also does the parsing which does not match our design.
Here is what I am proposing
utils.GetAllFilesWithExtensionwhich would create an issue if the proposed/poffessorsis added so we might consider scraping coursebook intooutDir/coursebook/...instead.I haven't worked with the profiles scraper very much but there does not seem to be any technical reason why this should not be possible.
If this is added as a task I don't mind working on it but if someone is interested feel free.