-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
166 lines (112 loc) · 6.28 KB
/
README
File metadata and controls
166 lines (112 loc) · 6.28 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
Thanks for using Tinasoft Pytextminer
Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/moma/tinasoft.desktop/
A text-mining python 2.6 module producing bottom-up semantic network production.
It uses :
- NLTK, the natural language processing toolkit (http://www.nltk.org/),
- SQLite3, the embedded database library (http://sqlite.org),
- Twisted web server (http://twistedmatrix) with jsonpickle as a serializer (http://jsonpickle.github.com),
- Numpy for n-dimensionnal arrays processing (http://numpy.scipy.org/),
- pyTenjin for graph gexf files export (http://www.kuwata-lab.com/tenjin/)
Classical task are :
- multiple kinds of source file support
- extraction of key-phrases (ngrams) using various simple Natural Language Processing methods (stopwords, part-of-speech tagging, stemming, etc)
- creation of document/corpus/ngram graphs databases
- key-phrases cooccurrences calculation on a corpus basis
- production of graphs of multiple entities and multiple relations (hybrid storage into GEXF files, http://gexf.net, and into sqlite database)
- an httpserver exposing the API, sending json results
This software is part of TINA, an European Union FP7 coordination action - FP7-ICT-2009-C :
- http://tinasoft.eu/
The software implements scientific results by David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA SENS, http://jph.cointet.free.fr).
SOURCE CODE REPOSITORY
https://forge.iscpif.fr/projects/tinasoft-pytextminer
http://github.com/moma/TinasoftPytextminer
AUTHORS
- Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France)
julian bilcke <julian.bilcke (at) iscpif (dot) fr>
david chavalarias <david.chavalarias (at) polytechnique (dot) edu>
jean philippe cointet <jphcoi (at) yahoo (dot) fr>
elias showk <elishowk (at) nonutc (dot) fr>
MAINTAINER
elias showk <elishowk (at) nonutc (dot) fr>
DOCUMENTATION, SUPPORT AND FEEDBACK
http://tinasoft.eu/ (project homepage)
https://forge.iscpif.fr/projects/tinasoft-pytextminer (software development)
PYTEXTMINER AS A USER
Download standalone packages from http://tinasoft.eu
DEVELOPER DOCUMENTATION
http://tina.csregistry.org/tinauserdoc
PYTEXTMINER AS A DEVELOPER
* we provide a http server exposing the main API from the TinaApp class
* alternatively, the apitests.py script provides examples to properly use the TinaApp class methods
- get the source code :
https://forge.iscpif.fr/projects/tinasoft-pytextminer/repository
OR
git clone https://sources.iscpif.fr/tinasoft.pytextminer.git
PYTHON : you'll need Python 2.6 interpreter : http://python.org/
INSTALL THE PYTHON PACKAGE
$ sudo python setup.py install
or
$ sudo python setup.py develop
Dependencies should be checked : numpy, nltk, twisted, jsonpickle, tenjin, pyyaml
OTHERWISE MANUALLY INSTALL PYTHON DEPENDENCIES
- they're listed in setup.py
DOWNLOAD NLTK DATA
You'll need to install manually required nltk corpus data
$ export NLTK_DATA="your/path/to/TinasoftPytextminer/shared/nltk_data"
$ python
> import nltk.download()
Downloader> d punkt
Downloader> d brown
Downloader> d conll2000
on MS WINDOWS:
$ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data"
$ PATH C:\Python26;%PATH%
$ python apitests.py ... (see usage)
- finally open your web browser at http://localhost:8888 (no internet connection needed)
GNU/LINUX (and probable UNIX-like systems)
- use the standalone freezed httpserver software
$ export NLTK_DATA=shared/nltk_data
$ python apitests.py ... (see usage)
DEVELOPER DOCUMENTATION
http://tina.csregistry.org/tinadevdoc
CONFIGURATION
config_*.yaml are a YAML configuration files.
The main application (TinaApp class) searches it during init, its path is a required parameter
GUIDELINES
- declare each column name of your csv file into the corresponding field name of the configuration file
- not declared columns will be ignored by the software
- here are possible required and optional entries :
#### REQUIRED
titleField: document title
contentField: document content
authorField: document acronyme
corpusNumberField: corpus number
docNumberField: document number
##### optional
index1Field: document index 1
index2Field: document index 2
dateField: document publication date
keywordsField: document keywords
- check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar"
- "minSize", and "maxSize" means the length of n-grams extracted
- all other fields are the script configuration, or the default values for testing purpose
WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML
SOURCE FILES DIRECTORY
- "source_files" is dedicated to the storage of your source files
- these files are used during indexation and extraction steps of the workflow
- given an existing file name in this directory, the software will be able to read it
TESTED OPERATING SYSTEMS
Tinasoft Pytextminer was tested on the following platforms:
GNU/Linux (amd4, i386) with Python 2.6
Windows XP (32bit) with Python 2.6
Mac OS X >= 10.6
COPYRIGHT AND LICENSE
Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/gpl.html>.