Technical Track of Computer Tools for Linguistic Research (2025/2026)

As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements :ref:`dataset-label`.

Instructors:

Klimova Margarita Andreevna - linguistic track lecturer
Lyashevskaya Olga Nikolaevna - linguistic track lecturer
Demidovskij Alexander Vladimirovich - technical track lecturer
Uraev Dmitry Yurievich - technical track practice lecturer
Zharikov Egor Igorevich - technical track expert
Nurtdinova Sofia Alekseevna - technical track assistant
Podpryatova Anna Sergeevna - technical track assistant
Klimov Andrey Petrovich - technical track assistant
Evgrafova Anna Sergeevna - technical track assistant

Project Timeline

Scraper:
1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
2. Deadline: May, 11.
3. Format: each student works in their own PR.
4. Dataset volume: 100 articles.
5. Design document: :ref:`scraper-label`.
Pipeline:
1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
2. Deadline: TBD.
3. Format: each student works in their own PR.
4. Dataset volume: 100 articles.
5. Design document: :ref:`pipeline-label`.

Lectures history

Date	Lecture topic	Important links
06.04.2024	Lecture: Introduction to technical track. 3rd party libraries.	N/A
13.04.2024	Lecture: Headers. HTML structure.	Листинг.
13.04.2024	Seminar: Local setup. Choose website.	N/A
20.04.2024	Lecture: Search in HTML page.	Листинг.
20.04.2024	Seminar: requests: install, API.	N/A

You can find a more complete summary from lectures in :ref:`ctlr-lectures-label`.

Technical solution

Module	Description	Component	Need to get
pathlib	working with file paths	scraper	4
requests	downloading web pages	scraper	4
BeautifulSoup4	finding information on web pages	scraper	4
lxml	optional parsing HTML	scraper	6
`datetime`	working with dates	scraper	6
`json`	working with json text format	scraper, pipeline	4
spacy_udpipe	module for morphological analysis	pipeline	6
networkx	working with graphs	pipeline	10

Software solution is built on top of three components:

scraper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
article.py - a module for article abstraction to encapsulate low-level manipulations with the article.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
.vscode		.vscode
admin_utils		admin_utils
core_utils		core_utils
docs		docs
lab_5_scraper		lab_5_scraper
lab_6_pipeline		lab_6_pipeline
seminars		seminars
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
project_config.json		project_config.json
pydoctest.json		pydoctest.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_qa.txt		requirements_qa.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical Track of Computer Tools for Linguistic Research (2025/2026)

Instructors:

Project Timeline

Lectures history

Technical solution

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Technical Track of Computer Tools for Linguistic Research (2025/2026)

Instructors:

Project Timeline

Lectures history

Technical solution

Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages