Skip to content

thiborose/alector_corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

alector_corpus

⚠️ I do not own this corpus ⚠️

More information at https://alectorsite.wordpress.com/corpus/ .

I made this repository public since the corpus is hardly accessible online.

Setup

Developed with ubuntu. You will need to have installed:

  • Firefox Browser
  • gecko driver
  • selenium python package

You will also need to be registered on alector's website.

Installation of gecko driver:

  1. Download and extract the latest release (https://github.com/mozilla/geckodriver/releases). Example :
    • wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux64.tar.gz
    • tar -xvzf geckodriver-v0.29.1-linux64.tar.gz
  2. Make the file executable: chmod +x geckodriver
  3. Create a folder where your geckodriver application will remain. Example:
    • mkdir /lib/geckodriver/
  4. Move the file to this newly created folder. Example:
    • mv geckodriver /lib/geckodriver/geckodriver
  5. Add the folder to PATH. Example:
    • PATH=$PATH:/lib/geckodriver/

Running the scraping script

Execute python scrape_alector.py. Give your credentials when prompted, and voilà!

Acknowledgements

Núria Gala, Anaïs Tack, Ludivine Javourey-Drevet, Thomas François, Johannes C. Ziegler, Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers. Proceedings of the 12th Language Resources and Evaluation Conference. [aclweb]

About

Alector text-simplification corpus + web-scraping script.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages