As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.
This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements :ref:`dataset-label`.
- Klimova Margarita Andreevna - linguistic track lecturer
- Lyashevskaya Olga Nikolaevna - linguistic track lecturer
- Demidovskij Alexander Vladimirovich - technical track lecturer
- Uraev Dmitry Yurievich - technical track practice lecturer
- Zharikov Egor Igorevich - technical track expert
- Nurtdinova Sofia Alekseevna - technical track assistant
- Podpryatova Anna Sergeevna - technical track assistant
- Klimov Andrey Petrovich - technical track assistant
- Evgrafova Anna Sergeevna - technical track assistant
- Scraper:
- Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
- Deadline: May, 11.
- Format: each student works in their own PR.
- Dataset volume: 100 articles.
- Design document: :ref:`scraper-label`.
- Pipeline:
- Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
- Deadline: TBD.
- Format: each student works in their own PR.
- Dataset volume: 100 articles.
- Design document: :ref:`pipeline-label`.
| Date | Lecture topic | Important links |
|---|---|---|
| 06.04.2024 | Lecture: Introduction to technical track. 3rd party libraries. | N/A |
| 13.04.2024 | Lecture: Headers. HTML structure. | Листинг. |
| 13.04.2024 | Seminar: Local setup. Choose website. | N/A |
| 20.04.2024 | Lecture: Search in HTML page. | Листинг. |
| 20.04.2024 | Seminar: requests: install, API. | N/A |
You can find a more complete summary from lectures in :ref:`ctlr-lectures-label`.
| Module | Description | Component | Need to get |
|---|---|---|---|
| pathlib | working with file paths | scraper | 4 |
| requests | downloading web pages | scraper | 4 |
| BeautifulSoup4 | finding information on web pages | scraper | 4 |
| lxml | optional parsing HTML | scraper | 6 |
datetime |
working with dates | scraper | 6 |
json |
working with json text format | scraper, pipeline | 4 |
| spacy_udpipe | module for morphological analysis | pipeline | 6 |
| networkx | working with graphs | pipeline | 10 |
Software solution is built on top of three components:
- scraper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
- pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
- article.py - a module for article abstraction to encapsulate low-level manipulations with the article.