The goal is the implementation of a parallel program in Java which simulates the operation of a news aggregator. The program will process a large volume of articles, organize them by categories and languages, and generate statistics and aggregated reports. The final result will be a set of text files that reflect the information extracted from the input articles in a structured and deterministic manner.
The volume of information available online is very large, being permanently generated by social networks, press websites, and various digital platforms. While access to these data is easier than ever, their abundance also brings difficulties: the content is often repetitive, hard to organize, and difficult to follow in a coherent form. News aggregators offer a solution to this challenge. They collect articles from various sources and present them in a structured format, adapted to the users' need to quickly browse relevant topics. At the same time, they can highlight topics of major interest, such as public health or politics.
The project aims to reproduce, on a smaller scale, the functionality of such an aggregator. The goal is to illustrate how a large volume of articles can be processed in parallel, how they can be grouped according to categories, and how clear and consistent reports can be generated, in a manner similar to real applications that support organized access to information.
My personal goal regarding this project was:
- To practice their programming skills using Java Threads;
- To practice the decomposition of a problem described in natural language into subproblems that can be executed in parallel;
- To practice the decision-making process for identifying a scalable parallel solution, by approaching the proposed problem.
For the implementation of the project, a program written in the Java programming language will be used. The main goal is the parallel processing of a set of news articles located in .json files.
Each file contains several articles, represented in JSON format, with multiple fields. For this project, I will use the values of the keys uuid, title, author, url, text, published, language, and categories
A central objective of the assignment is the use of parallel programming to work with a large volume of articles.