Embeddings process

cleanup: delete the following files to ensure that they are re computed
- html_corpus.txt.gz ... contains the extracted training xpaths
- html_vocabulary.cvs.gz ... the vocabulary to id mapping
- html_corpus.bin.gz ... the binary version (translated using the vocabulary) of the html xpath corpus
generate a file with the Xpath representations using generate-html-corpus-texts.py. the corresponding XPaths are stored in html_corpus.txt.gz.
use triinput.py to generate (a) the html vocabulary file and (b) the html corpus file as well as the corresponding training corpora.
use trilearn.py for training the embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
biinput.py		biinput.py
bilearn.py		bilearn.py
bitest.py		bitest.py
encode.py		encode.py
generate-html-corpus-texts.py		generate-html-corpus-texts.py
html_vocabulary.cvs.gz		html_vocabulary.cvs.gz
triinput.py		triinput.py
trilearn.py		trilearn.py
tritest.py		tritest.py
visualize-training-data.py		visualize-training-data.py

Provide feedback