Skip to content

Suggestions for tutorial

Mariana Montes edited this page May 31, 2018 · 12 revisions
  1. We could add that the directory with the corpus can have directories inside.

  2. In section 1.4., the variable freqDict is actually itemDict from previous snippets. It should be the same name or there should be a clarification. Also, the instruction to filter the frequency dictionary outside python is deprecated because it assumes the old format (word \t freq) instead of the json format.

  3. I would add some warning on how the itemDict (or freqDict) will be used in the workflow, so it's easier to plan how they are going to be cut and why. My suggestions is something on the lines of: The length of these files will determine the size of the matrices to be generated along the workflow. Their content will define the information available in them. For the collocation matrix, you will need two files (which could of course be the same): one for target words (rows of the collocation matrix) and one for context words (columns of the collocation matrix). If the length of the target words file is A and the length of context words is B, the size of the collocation matrix will be AxB. That said:

  • When you generate a cosine similarity matrix, it will compare the rows, that is, the target words. The size of the cosine similarity matrix will then be AxA.
  • When you get the token level matrix, you will use the target words as context features in the first step, but the features of the final matrix will be the context words. If you have C tokens, then, you will first generate a matrix of CxA and then end up with a matrix of CxB (B-dimensional token vectors). The cosine similarity matrix at vector level will be CxC. If your matrices are too large, the computing time might be too long. It would seem that the time it takes to generate the collocation matrix depends on the number of rows more than the columns (a 10kx5k matrix took me almost an hour, a 5kx10k matrix took me half). But the longer your vectors are, the more features you can compare. You might want to take this to account while filtering the frequency wordlists.
  1. When you generate a wordlist, if the output directory doesn't exist it will be created during that generation. So if you create a wordlist and THEN update the output directory in the settings and it doesn't exist beforehand, it won't find it.

  2. In section 1.5., after importing ColFreqManager, we need from typetoken.Dict import CollocDict.

  3. We should add specific instructions for reloading files that we can save. From the corresponding issue: There are classes for different formats of matrices. WCMatrix is for collocate frequency matrix and pmi matrix. WWMatrix is for row-wise cos similarity matrix and similarity rank matrix. CCMatrix is for column-wise cos similarity matrix and similarity rank matrix. And for the token level there are also TCMatrix, TVMatrix. To load these matrices, use following codes.

>>> filename = "/path/to/output/xxx.wcmx.freq.json"
>>> freqMTX = WCMatrix.load(filename, fmt='freq')
>>> filename = "/path/to/output/xxx.wcmx.pmi.json"
>>> pmiMTX = WCMatrix.load(filename, fmt='pmi')
>>> meta_fname = "/path/to/output/xxx.wwmx.cos.meta"
>>> npy_fname = "/path/to/output/xxx.wwmx.cos.npy"
>>> cosMTX = WWMatrix.load(meta_fname, npy_fname, fmt='cos')

Please remember to specify the fmt when loading matrices. This fmt (actually format in Matrix class) attribute is necessary when you save matrices by default name. (ALTHOUGH when I tried it with TCMatrix, it rejected the fmt argument) Before using WCMatrix, WWMatrix, CCMatrix to load matrices, remember to import them:

>>> from typetoken.Matrix import WCMatrix
  1. In section 2.1. we reuse good old itemDict (renamed freqDict in section 1.4). It would be useful to remind the user to load it, because it's likely to have been done much earlier. It's used to get the frequency of the types for the token vectors, so it doesn't need to be the original itemDict, as long as it includes that type.
from typetoken.Dict import FrequencyDict
filename = ...
freqDict = FrequencyDict.load(filename, encoding = settings[encoding])
  1. From Issue #10 Loading Matrices: the methods make_tc_weight_matrix() and make_token_vector are now class methods, so the following changes:
tcWeightMatrix = TokVecManager.make_tc_weight_matrix(tcPosMatrix, twMatrix)
tokvecs = TokVecManager.make_token_vector(soccMTX, tcWeightMTX)

with tcPosMatrix being equal to tcMATX: generated by tvman.make_token_context_matrix() or loaded through TCMatrix.load(filename, encoding='utf-8', fmt='position').

  1. In section 2.3., the explanation of the weighting operation is misleading. We are not multiplying the token-by-context boolean matrix with a transposed type weight matrix, and if we were we would be expecting the columns of the former to match the rows of the latter, and that's not the case. Here's what I suggest for explanation (to be refined in something more clear if necessary). This derives from my understanding of the make_tc_weight_matrix method in Manager.py, I hope it's correct. The method make_tc_weight_matrix() takes the token context position matrix (tcPosMatrix) and the transposed type-weight matrix (twMTX) as arguments. The columns of twMTX are the context features of tcPosMatrix, and the types of the tokens we selected must be among the rows of twMTX. The number of rows of twMTX does not matter, because those that don't match the types of the tokens are simply ignored. For each token in tcPosMatrix, this method selects the type from the rows of twMTX and multiplies the value of each cell of the token's row by the corresponding value in the type's row of twMTX. The result is a matrix with the same size as tcPosMatrix but weighted values (tcWeightMTX). (Later, in the transition between tcWeightMTX and tokvecs) The rows of the second order co-occurrence matrix (soccMTX) should be the columns of tcPosMatrix. Thus, the make_token_vector() method, which takes soccMTX and tcWeightMTX as arguments, multiplies these two matrices, and the result is a matrix with as many rows as tcWeightMTX and tcPosMatrix (the tokens) and as many columns as soccMTX (the context words of the pmi matrix).

  2. The size of a matrix is not very important when computing cosine similarity, but it is for saving the file. A token vector matrix with 12 thousand rows (like I got with 'state' from ten years of COHA) takes quite a while to save, so in this case it's useful to make a sample of the matrix and store that. The method sample on a TVMatrix (e.g. tvsample = tokvecs.sample()) creates a random sample of 10% of the rows of the original matrix. That can be adjusted with the argument percent (default value 0.1).

  3. The file resulting from tokvecs.save() (same for its samples) can be loaded with TVMatrix.load(filename, fmt='pmi'). First it will be necessary to from typetoken.Matrix import TVMatrix.

  4. (From the API): To save the token level cosine similarity matrix, choose a name (if we have several for different subsets, we might want to add that to the filename) without an extension, and feed it to the save() method.

>>> filename = '{}/{}.tvmx.cos'.format(output_path, corpus_name)
>>> tokvecCos.save(filename=filename)
Saving matrix...
Stored in files:
	/home/mariana/COHA-II/COHA.tvmx.cos.meta
	/home/mariana/COHA-II/COHA.tvmx.cos.npy

Clone this wiki locally