Skip to content
Stanislaw Jastrzebski edited this page May 19, 2014 · 3 revisions

Basic design

Mantis Shrimp is as separate from Ocean as possible. It is a generic worker framework and currently there is implemented on top of this architecture a tagging system.

It is modelled as set of Akka actors (very popular Scala library for distributed programming) that pull things from RabbitMQ . For the current iteration I will use NER tagger. Modules that are present in the current architecture

  • Mantis Node - node in the system

  • Mantis Tagger - Akka Actor - can receive news to tag, or be configure to pull himself from Kafka (for speed purposes!). It will push tags to Kafka or send to requester.

  • Mantis News Fetcher - python script that will pull things from Kafka and push into Neo4j. Note: decomposition Kafka/Neo4j is important because we want to use Kafka to fast queries/inserts, like user statistics

  • Mantis Master - Akka Actor with registered Mantis Nod (weson't do much for now)

  • Mantis News Dumper - not ready yet (lionfish issues)

Architecture

All nodes are connected in a tree, see exemplary conf files from mantis_shrimp directory to get a feel of what is going on

Future ideas for tagging

The world of NLP is currently thiriving with new ideas. I would like to use knowledge graphs (Freebase, DBPedia) in our application. Also many deeplearning solutions are widely available (see word2vec - just one example of a great neural language model).

The first thing to do would be document classification using:

  • Knowledge Graph

  • Simple algorithm already present in some library like LDA (maybe Mahout, Vobbit from Microsoft)

Clone this wiki locally