-
Notifications
You must be signed in to change notification settings - Fork 1
MOTS Architecture
For the purpose of implementing state-of-the-art methods and facilitating new methods implementation, we use the pipeline decomposed as in figure.
Core of our system, a summarization method is decomposed in four steps. These four steps follow the classic way of computing extractive or semi-extractive summaries. First, a method must define what a token is and how it is identified in the system. This is the index building step. Then, the method must express how a sentence is represented. This is the sentence characteristics building step. Next, it can score the sentences (not every extractive summarization method requires sentences to be scored). This is the sentence scoring step. Finally, sentences have to be selected to appear in the summary. This is the selection method step.
A method can have multiple IndexBuilder, CharacteristicBuilder and ScoringMethod to easily combine different approaches. Each method builds a summary that can be scored using the ROUGE package, encapsulated in our system.
So MOTS system consists in reading a "multicorpus" and apply a summarization model to compute summaries. A summarization model is composed of three steps, each of them decomposable and modular:
- Preprocess: Apply preprocessing to each text from the multicorpus: Cutting raw text into sentences/words, lemmatization, stemming, stop words filtering, POS tagging;
-
Process : At this step, one or more methods are applied to summarize every single corpus. A summarization method is decomposed as:
- IndexBuilder (5 implemented): index generation for words/n-grams;
- CharacteristicBuilder (8 implemented): generation of sentence characteristics based on the index;
- ScoringMethod (4 implemented): sentence scoring based on the characteristics;
- SelectionMethod (5 implemented): sentence selection according to sentence scores and/or characteristics;
- Post-processing (currently in development) : here can be implemented methods that are processed after sentence selection, for example sentence re-ordering, pronominal reference resolution, sentence compression...
The architecture we propose comes after a reflexion about summarization and how we can handle modularity in our tool. The examples given above are not exhaustive, and summarization can be managed in our tool by many other ways. For example, one could argue that sentence compression is not a post but a pre-processing. In fact, that is true in many ways as sentence compression can be used prior to sentence extraction in order to create several versions of each sentence and come closer to an abstractive system and our tool allows it. Moreover, one could say that re-ordering is not a post-processing but should be integrated at the very core of a selection method (for example by including a coherence score). This also is allowed in our tool, as SelectionMethod returns an ordered list of sentences.
We decompose a summarization method in four steps (or atomic processings) : IndexBuilder, CharacteristicBuilder, ScoringMethod, SelectionMethod. All steps are independent one to another and the communication between them is done by the Process class that controls their execution and compatibility. The atomic processings have their input and output specified via inheritance of ParameterizedMethod (see Figure) and the implementation of some interfaces in the list above :
- IndexBasedIn/Out
- SentenceCharacteristicBasedIn/Out
- QueryBasedIn/Out
- ListClusterbasedIn/Out
- ScoreBasedIn/Out
These interfaces make the Process class able to use java methods to adapt input and output between each atomic processing.
All atomic processings are independent and follow compatibility rules, so our system architecture is completely modular.