Skip to content

fipl-hse/2025-2-level-ctlr

Repository files navigation

Technical Track of Computer Tools for Linguistic Research (2025/2026)

As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements :ref:`dataset-label`.

Instructors:

Project Timeline

  1. Scraper:
    1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
    2. Deadline: May, 11.
    3. Format: each student works in their own PR.
    4. Dataset volume: 100 articles.
    5. Design document: :ref:`scraper-label`.
  2. Pipeline:
    1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
    2. Deadline: TBD.
    3. Format: each student works in their own PR.
    4. Dataset volume: 100 articles.
    5. Design document: :ref:`pipeline-label`.

Lectures history

Date Lecture topic Important links
06.04.2024 Lecture: Introduction to technical track. 3rd party libraries. N/A
13.04.2024 Lecture: Headers. HTML structure. Листинг.
13.04.2024 Seminar: Local setup. Choose website. N/A
20.04.2024 Lecture: Search in HTML page. Листинг.
20.04.2024 Seminar: requests: install, API. N/A

You can find a more complete summary from lectures in :ref:`ctlr-lectures-label`.

Technical solution

Module Description Component Need to get
pathlib working with file paths scraper 4
requests downloading web pages scraper 4
BeautifulSoup4 finding information on web pages scraper 4
lxml optional parsing HTML scraper 6
datetime working with dates scraper 6
json working with json text format scraper, pipeline 4
spacy_udpipe module for morphological analysis pipeline 6
networkx working with graphs pipeline 10

Software solution is built on top of three components:

  1. scraper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
  2. pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
  3. article.py - a module for article abstraction to encapsulate low-level manipulations with the article.

Resources

  1. Academic performance
  2. Media websites list
  3. Python programming course from previous semester
  4. Scraping tutorials (Russian)
  5. Scraping tutorials (English)
  6. Useful documentation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages