Topic modelling by DonggeLiu · Pull Request #188 · mediacloud/backend

DonggeLiu · 2017-08-27T11:07:37Z

No description provided.

tokenize articles

…ic_modelling

…r yet) 2. A path helper to assit import 3. modified token_pool to make it compatible with LDA model

… into topic_modelling

…ic_modelling

1. Made every variable and method priavte if possible 2. Reformatted code with Pycharm shortcut 3. Added tests for TokenPool (works well) and ModelGensim (does work due to 'no module named XXX' problem when model_gensim is calling its abstract parent) 4. Decoupled token_pool and model_* 5. Used if __name__ == '__main__' to give a simple demonstration on how to use each mehtod Model_* 1. Renamed mode_lda.py and model_lda2.py to model_gensim.py (which uses the Gensim package) and model_lda.py (which uses the LDA package) 2. Added a abstract parent class TopicModel.py 3. Moved some code from summarise() to add_stories() (a. better structure of code; b. improved performance) 4. Changed some constants to function arguments (e.g. total_topic_num, iteration_num, etc.) TokenPool 1. Added mc_root_path() when locating the stopwords file 2. Modified query in token pool: 1. added "INNER JOIN stories WHERE language='en'" to guarantee all stories are in English 2. added "LIMIT" and corresponding "SELECT DISTINCT ... ORDER BY..." to guarantee only fetch the required number of stroies (thus improves performance) 3. added "OFFSET" 3. Restructured token_pool.py, so that the stories are traversed only once (thus improves performance) 4. Decoupled DB from token_pool.py 5. Replace regex tokenization with nltk.tokenizer 6. Added nltk.stem.WordNetLemmatizer to lemmatize (which gives a better result than stemming) tokens

…ic_modelling

…model_lda

The result of this algorithm is similar but slightly different from the LDA model + It allows multiple topics for each story

…ic_modelling

… into topic_modelling

2. renamed a few methods/variables due to the change of functionalities

…d based on a few points

…ihood comparisons

…ter efficiency and performance I will combine these two later

…en combined with tune_with_polynomial

…ic_modelling

… and cache dependencies

This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)

2. improve performance based on empirical results

added more comments

DonggeLiu and others added 30 commits June 29, 2017 10:43

Create token_pool.py

e24f3b7

tokenize articles

added the file created last time

9535b81

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

934da4b

…ic_modelling

1. Two LDA model (with different package, not sure which one is bette…

2a8a0f2

…r yet) 2. A path helper to assit import 3. modified token_pool to make it compatible with LDA model

Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…

e888805

… into topic_modelling

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

a23aa13

…ic_modelling

1. Define types for parameters and return values

83a31a7

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

ced8bb4

…ic_modelling

isolate import gensim to see if it causes failure #3839

943c696

verifying the reason of errors

3db49ee

reformat the output of model_gensim to make it in the same format as …

06d1d37

…model_lda

1. updated tests according to the changes I made in model_gensim.py

e027dad

added tests for model_lda.py

336c0d8

trying to fix the 'module' object has no attribute 'plugin' problem

178226b

reference topic_model module with full path

ebc4715

Merge branch 'master' into topic_modelling

39c5e8c

added the requirement for sklearn, which supports the NMF algorithm

716fe91

Added msg for each assertion

f66ead6

added msg for each assertion

2d6c12d

added model_nmf.py to model topics with the NMF algorithm

6c50ed2

The result of this algorithm is similar but slightly different from the LDA model + It allows multiple topics for each story

test cases for model_nmf.py

679fef0

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

3ab2124

…ic_modelling

Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…

025dece

… into topic_modelling

sorted requirements.txt in alphabetical order

61517d1

cache WordNet

36817b9

install the WordNet via NLTK

b5562ad

relocate test files

e6b126c

remove uncessary files after test suits relocation

c93fe63

1. removed josn serialization after fetching sentences from database

730a4e9

2. renamed a few methods/variables due to the change of functionalities

DonggeLiu added 30 commits August 13, 2017 16:34

a finder that can identify the max/min points of a polynomial compute…

d1129a6

…d based on a few points

added two methods tune_*() to find out the optimal number of topics

4d5b9e4

removed some print()s and rewrote evaluation()

8e77ed4

added more test cases on checking the accuracy of the model via likel…

809aad7

…ihood comparisons

improved polynomial tuning algorithm

f819366

no longer test tune_with_iteration as polynomial has a sigificant bet…

9869ca8

…ter efficiency and performance I will combine these two later

larger sample for Travis to test against

e185dd0

modify tests accroding to change in sample_stories.txt

3545e0e

use smaller sample size so that Travis will not fail

7816ec8

do not test limit if limit is not specified

94ebc24

improved tune with polynomial algorithm

c1c257e

removed uncessary tune_with_iteration as its advantage/feature has be…

6d09265

…en combined with tune_with_polynomial

fixed the algorithm of optimal point finder

2479107

removed useless codes

51dd0ec

Merge branch 'master' of github.com:berkmancenter/mediacloud into top…

620afb4

…ic_modelling

Disable unit tests temporarily for Travis to have a chance to compile…

5ead4f2

… and cache dependencies

Cache WordNet of NLTK

0fb4e4a

set test cases back

87efd01

revert the changes made on .travis.yml

6ea203b

added more story samples

b675559

new commits from git pull origin master

8753442

removed unnecessary code to keep higher level of accuracy

e39415b

changed sample file name

a674d26

this sample file has been replaced by 3 files with different size

6267f72

This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)

use a smaller sample to test on Travis due to limit restriction

d4e9d48

1. break large block of codes up to more funcitons

0c3f7ee

2. improve performance based on empirical results

remove uncessary code

4c12748

restructured tests to reduce running time

720dd7a

further improvements on the code structure

97afc48

added more comments

remove redudent code

016d01c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic modelling#188

Topic modelling#188
DonggeLiu wants to merge 106 commits intomediacloud:masterfrom
DonggeLiu:topic_modelling

DonggeLiu commented Aug 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DonggeLiu commented Aug 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants