Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
107 changes: 53 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,20 @@ Taxonomies and document classification
--------------------------------------

Clade taxonomies have a tree structure, with a single top-level category (e.g.
in the example data, "Social Psychology"). There is no distinction between
in the example data, "Social Psychology"). There is no distinction between
parent and child nodes (except that the former has children) and the hierachical
structure of the taxonomy is completely orthogonal from the node data. The
structure may be freely edited.

Each node represents a category, which is represented by a set of "keywords"
(words or phrases) which should be present in a document belonging to that
category. Not all the keywords have to be present - they are joined with
Boolean OR rather than AND. A document may belong to multiple categories,
(words or phrases) which should be present in a document belonging to that
category. Not all the keywords have to be present - they are joined with
Boolean OR rather than AND. A document may belong to multiple categories,
which are ranked according to standard Solr (TF-IDF) scoring. It is
also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of
the documents already in the category. This feature is currently slow as it
the documents already in the category. This feature is currently slow as it
uses the standard Solr MoreLikeThis component to analyse a large number of
documents. We plan to improve this for a future release by writing a custom
Solr plugin.
Expand All @@ -43,20 +43,20 @@ export the document categorisation to an XML or CSV file.
Installation prerequisites
--------------------------

- Java 6 or 7
- Java 6 or later

- Python 2.6 or 2.7
- Python 3

- Solr 4.6 (also tested with 4.8.1)
http://www.apache.org/dyn/closer.cgi/lucene/solr/4.6.2
Other versions may also work, but we have not tested them with Clade.

- Python modules
sunburnt: http://opensource.timetric.com/sunburnt/installation.html
scorched: https://github.com/lugensa/scorched
lxml: http://pypi.python.org/pypi/lxml/
httplib2: http://code.google.com/p/httplib2/downloads/list
On Windows you can download binaries of httplib2 from http://www.lfd.uci.edu/~gohlke/pythonlibs/
pyjq: https://github.com/doloopwhile/pyjq
httplib2


Download the latest versions from these sites and follow the installation
instructions. The system has been developed on Linux and the following instructions
Expand All @@ -69,32 +69,32 @@ as appropriate):

$ export CFLAGS=-Qunused-arguments
$ export CPPFLAGS=-Qunused-arguments
$ sudo -E pip install lxml
$ sudo -E pip install lxml

-------------------
Windows quick start
-------------------

On Windows, you can use the demo.bat file to run all the various parts of Clade with the
supplied example data. You will need to edit paths.bat to match your configuration first.
supplied example data. You will need to edit paths.bat to match your configuration first.

The demo_noclassify.bat file also runs Clade, but doesn't repeat the initial classification
The demo_noclassify.bat file also runs Clade, but doesn't repeat the initial classification
to avoid duplicates. If you need it, cleanup.bat clears all the indexes.

------------
Example data
------------

The Clade distribution includes an example taxonomy and documents derived
from Wikipedia and on the topic of Social Psychology. The taxonomy is
The Clade distribution includes an example taxonomy and documents derived
from Wikipedia and on the topic of Social Psychology. The taxonomy is
provided as a CSV file to illustrate how to import existing taxonomies.

To import the taxonomy, run on the command line:

$ python classify.py import data/socpsy.csv

on Windows:

C:\> python classify.py import data\socpsy.csv

This will create a Python data structure, and pickle it as data/tax.
Expand All @@ -103,61 +103,61 @@ To import the documents, first copy the Clade Solr configuration files into your
Solr home, e.g.:

$ cp -f clade/solr-conf/* apache-solr-4.6.2/example/solr/collection1/conf

on Windows:

C:\> copy clade\solr-conf\*.* apache-solr-4.6.2\example\solr\collection1\conf

Then [re]start Solr, e.g.:

$ cd apache-solr-4.6.2/example
$ cd apache-solr-4.6.2/example
$ java -jar start.jar &

on Windows:

C:\> cd apache-solr-4.6.2\example
C:\> java -jar start.jar &

Now start the Stanford Named Entity Recognition server, which is used to pull
names, places etc. out of the source data:
$ cd stanford-ner-2011-09-14

$ cd stanford-ner-2018-10-16
$ ./server.sh &

or on Windows:
C:\ cd stanford-ner-2011-09-14

C:\ cd stanford-ner-2018-10-16
C:\ server.bat

Finally, add the example documents (which are provided as plaintext files):

$ python classify.py textdir data/socpsy-pages

on Windows:

C:\> python classify.py textdir data\socpsy-pages
This will output the name of each file as it is processed, and will take a
few minutes to complete.

This will output the name of each file as it is processed, and will take a
few minutes to complete.


--------------
Running the UI
--------------

The Clade UI is implemented as a web application. To start it, run:

$ python server.py

on Windows:

C:\> python server.py

Then point a browser at
Then point a browser at

http://localhost:8080/


----------------------
Taxonomy import/export
----------------------
Expand All @@ -169,9 +169,9 @@ optionally the document categorisations) to CSV or XML.
To import from CSV:

$ python classify.py import taxonomy.csv

on Windows:

C:\> python classify.py import taxonomy.csv

To import from XML, use the .xml file extension. The extension must be either .csv
Expand All @@ -180,22 +180,22 @@ or .xml and is used to determine the type of file read.
To export to CSV:

$ python classify.py export taxonomy.csv

on Windows:

C:\> python classify.py export taxonomy.csv

To export to XML, use the .xml file extension. Again, the extension determines the
type of file written. In the case of exported to XML, document classifications may
be written to the file by specifying the number of document ids that should be
exported per category, e.g.:

$ python classify.py export taxonomy.xml 100

on Windows:

C:\> python classify.py export taxonomy.xml 100


-----------
UI Controls
Expand All @@ -206,14 +206,14 @@ The Clade UI has two modes: Taxonomy and Document. It starts up in the former.
Taxonomy mode
-------------

The page is divided into two halves: the taxonomy tree on the left, and
The page is divided into two halves: the taxonomy tree on the left, and
information about the current selected category on the right. Clade can support
multiple taxonomies, which can be selected from the drop-down in the upper left.
Selecting a taxonomy will load it into the tree view, where nodes can be
expanded or collapsed, selected and manipulated.

The taxonomy tree display has three small icons in the upper right. The +
icon causes a new category to be added as a child of the currently selected
The taxonomy tree display has three small icons in the upper right. The +
icon causes a new category to be added as a child of the currently selected
node. The "pen" icon allows the current node name to be edited, and the X
icon deletes the current node (with no warning!)

Expand All @@ -240,7 +240,7 @@ category, showing the document ID, the title, the current rank and the previous
rank (before changing the active keywords). To view a document, click the title,
which will switch the UI into document mode.

To add a new taxonomy, make sure the taxonomy drop-down has no taxonomy
To add a new taxonomy, make sure the taxonomy drop-down has no taxonomy
selected, then click the Create button. Enter a name for the new taxonomy
and click OK. You can then create the taxonomy by adding nodes to the root
node. All changes will be immediately saved to the data/tax file.
Expand All @@ -262,12 +262,11 @@ Document mode
-------------

Again, in document mode the page has a left section and a right section. On the
left is a list of taxonomy categories which match the document, ranked in
decreasing score order. The document ID and full text are displayed on the
left is a list of taxonomy categories which match the document, ranked in
decreasing score order. The document ID and full text are displayed on the
right, together with a list of keywords from the selected category. The document
text is highlighted for matching keywords.

To go back to the taxonomy display, click the Back button in the upper right.
Don't use your browser back button - it won't work, as the UI is implemented
Don't use your browser back button - it won't work, as the UI is implemented
in Javascript/AJAX.

Loading