Corpus-Viewer-Addons provides a set of distributed services to analyze texts from large document corpora in a fast and easy way.
- NLP Toolkit
- A set of Topic Models trained from internal corpora and packaged as Restful APIs
- A repository focused on search and explore documents from their topic distributions
-
Install Docker-Engine and Docker-Compose
-
Clone this repo and move into
src/test/docker/directory.git clone https://github.com/cbadenes/corpus-viewer-addons.git
- Move into
src/test/docker/nlp/endirectory. - Run the service by:
docker-compose up -d - You should be able to monitor the progress by:
docker-compose logs -f
- The above command runs two services: DBpedia Spotlight and librAIry NLP-EN, and uses the settings specified within
docker-compose.yml. - The HTTP Restful-API should be available at:
http://localhost:7777/en - More info here
- Move into
src/test/docker/modelsdirectory. - Run the models by:
docker-compose up -d - You should be able to monitor the progress by:
docker-compose logs -f
- The above command runs a web service for each model and uses the settings specified within
docker-compose.yml. - The HTTP Restful-APIs should be available at:
http://localhost:800[x]/model
A Topic Model is described by the following services:
/settings: read the metadata of the model./topics: lists the topics discovered by the model./topics/{id}: get topic details./topics/{id}/words: explore a topic from its word distribution./topics/{id}/neighbours: explore a topic from its concurrence topics./inferences: build a vector with the topic distributions from a given text.
- Move into
src/test/docker/solrdirectory. - Create collections by:
create.sh - Manage service by:
start.sh,stop.shandclean.shscripts" - You should be able to monitor the progress by:
docker solr logs -f
- The above command runs a Solr service and uses the settings specified within
docker-compose.yml. - The HTTP Admin console should be available at:
http://localhost:8983
In order to be able to index documents it is necessary to have the topic distributions (doctopics) along with the texts.
You can download the following corpus: CORDIS, Wikipedia or Patstat, by executing the following script:
```
./download-corpora.sh
```
Once executed, the /corpora folder is created with the texts and vectors associated for each corpus.
Then you can run the unit tests to load data (src/test/java/load) into the repository:
LoadDocuments: Saves texts and meta-information of each document in thedocumentscollectionLoadDocTopics: Saves the topic distribution of each document in thedoctopicscollectionLoadTopicDocs: Saves the topic info of each model in thetopicdocscollection
There are multiple ways to explore the content of the corpus. You can take a look at the queries we've defined in src/test/java/query.
You can use the following to cite this work:
@inproceedings{Badenes-Olmedo:2017:DTM:3103010.3121040,
author = {Badenes-Olmedo, Carlos and Redondo-Garcia, Jos{\'e} Luis and Corcho, Oscar},
title = {Distributing Text Mining Tasks with librAIry},
booktitle = {Proceedings of the 2017 ACM Symposium on Document Engineering},
series = {DocEng '17},
year = {2017},
isbn = {978-1-4503-4689-4},
pages = {63--66},
numpages = {4},
url = {http://doi.acm.org/10.1145/3103010.3121040},
doi = {10.1145/3103010.3121040},
acmid = {3121040},
publisher = {ACM},
keywords = {data integration, large-scale text analysis, nlp, scholarly data, text mining},
}