TinasoftPytextminer/README at master · moma/TinasoftPytextminer · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
Thanks for using Tinasoft Pytextminer

Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/moma/tinasoft.desktop/

A text-mining python 2.6 module producing bottom-up semantic network production.
It uses :
- NLTK, the natural language processing toolkit (http://www.nltk.org/),
- SQLite3, the embedded database library (http://sqlite.org),
- Twisted web server (http://twistedmatrix) with jsonpickle as a serializer (http://jsonpickle.github.com),
- Numpy for n-dimensionnal arrays processing (http://numpy.scipy.org/),
- pyTenjin for graph gexf files export (http://www.kuwata-lab.com/tenjin/)

Classical task are :
- multiple kinds of source file support
- extraction of key-phrases (ngrams) using various simple Natural Language Processing methods (stopwords, part-of-speech tagging, stemming, etc)
- creation of document/corpus/ngram graphs databases
- key-phrases cooccurrences calculation on a corpus basis
- production of graphs of multiple entities and multiple relations (hybrid storage into GEXF files, http://gexf.net, and into sqlite database)
- an httpserver exposing the API, sending json results

This software is part of TINA, an European Union FP7 coordination action - FP7-ICT-2009-C :
 - http://tinasoft.eu/
The software implements scientific results by David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA SENS, http://jph.cointet.free.fr).

SOURCE CODE REPOSITORY

    https://forge.iscpif.fr/projects/tinasoft-pytextminer
    http://github.com/moma/TinasoftPytextminer

AUTHORS

- Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France)
    julian bilcke <julian.bilcke (at) iscpif (dot) fr>
    david chavalarias <david.chavalarias (at) polytechnique (dot) edu>
    jean philippe cointet <jphcoi (at) yahoo (dot) fr>
    elias showk <elishowk (at) nonutc (dot) fr>

MAINTAINER

    elias showk <elishowk (at) nonutc (dot) fr>

DOCUMENTATION, SUPPORT AND FEEDBACK

    http://tinasoft.eu/ (project homepage)
    https://forge.iscpif.fr/projects/tinasoft-pytextminer (software development)

PYTEXTMINER AS A USER

    Download standalone packages from http://tinasoft.eu

    DEVELOPER DOCUMENTATION

        http://tina.csregistry.org/tinauserdoc

PYTEXTMINER AS A DEVELOPER


    * we provide a http server exposing the main API from the TinaApp class
    * alternatively, the apitests.py script provides examples to properly use the TinaApp class methods

    - get the source code :

    https://forge.iscpif.fr/projects/tinasoft-pytextminer/repository
    OR
    git clone https://sources.iscpif.fr/tinasoft.pytextminer.git

    PYTHON : you'll need Python 2.6 interpreter : http://python.org/

    INSTALL THE PYTHON PACKAGE

        $ sudo python setup.py install
        or
        $ sudo python setup.py develop

    Dependencies should be checked : numpy, nltk, twisted, jsonpickle, tenjin, pyyaml

    OTHERWISE MANUALLY INSTALL PYTHON DEPENDENCIES

        - they're listed in setup.py

    DOWNLOAD NLTK DATA

    You'll need to install manually required nltk corpus data
        $ export NLTK_DATA="your/path/to/TinasoftPytextminer/shared/nltk_data"
        $ python
        > import nltk.download()
        Downloader> d punkt
        Downloader> d brown
        Downloader> d conll2000

    on MS WINDOWS:

            $ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data"
            $ PATH C:\Python26;%PATH%
            $ python apitests.py ... (see usage)

        - finally open your web browser at http://localhost:8888 (no internet connection needed)

    GNU/LINUX (and probable UNIX-like systems)
        - use the standalone freezed httpserver software

            $ export NLTK_DATA=shared/nltk_data
            $ python apitests.py ... (see usage)

    DEVELOPER DOCUMENTATION

        http://tina.csregistry.org/tinadevdoc

CONFIGURATION

    config_*.yaml are a YAML configuration files.
    The main application (TinaApp class) searches it during init, its path is a required parameter

    GUIDELINES

    - declare each column name of your csv file into the corresponding field name of the configuration file
    - not declared columns will be ignored by the software
    - here are possible required and optional entries :

        #### REQUIRED
        titleField: document title
        contentField: document content
        authorField: document acronyme
        corpusNumberField: corpus number
        docNumberField: document number
        ##### optional
        index1Field: document index 1
        index2Field: document index 2
        dateField: document publication date
        keywordsField: document keywords

    - check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar"
    - "minSize", and "maxSize" means the length of n-grams extracted
    - all other fields are the script configuration, or the default values for testing purpose

    WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML

SOURCE FILES DIRECTORY

    - "source_files" is dedicated to the storage of your source files
    - these files are used during indexation and extraction steps of the workflow
    - given an existing file name in this directory, the software will be able to read it

TESTED OPERATING SYSTEMS

    Tinasoft Pytextminer was tested on the following platforms:

        GNU/Linux (amd4, i386) with Python 2.6
        Windows XP (32bit) with Python 2.6
        Mac OS X >= 10.6

COPYRIGHT AND LICENSE

Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr)

    This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by

    the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/gpl.html>.