Skip to content

incorrect number of training data read while training syntactic HMM #7

@GoogleCodeExporter

Description

@GoogleCodeExporter
Hello,
I was trying to train syntactic HMM on my data. My training data contains 10050 
parallel sentences with parsed target trees. 

wc output of my training data
-------------------------------
   10050   284765  1599230 corpus.en
   10050   804959  4284275 corpus.entrees
   10050   228873  5058993 corpus.ta
   30150  1318597 10942498 total


When I run the alignment, the logfile indicate that there are only 9811 
sentences read instead of 10050.  Here is what I am seeing in the logfile. 
Eventually after the training, I am seeing alignment only for 9811 sentences. 

PS: I don't have any testing data. My test data directories are empty. I have 
attached my config file too.

main() {
  Execution directory: en-ta/alignment_models/berkeley/lc_tok_10000_S
  Preparing Training Data
  Unknown number of training, 0 test
  Training models: 2 stages {
    Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
      Initializing forward model [7.9s, cum. 7.9s]
      Initializing reverse model [5.2s, cum. 13s]
      Joint Train: 9811 sentences, jointly {
        Iteration 1/5 {
          Sentence 1/9811
          Sentence 2/9811
          Sentence 3/9811
          Sentence 169/9811
          Sentence 3304/9811
          Sentence 7650/9811
          Log-likelihood 1 = -1337616.882
          Log-likelihood 2 = -1336443.902
          ... 9805 lines omitted ...
        } [20s, cum. 20s]

pls, let me know if I am missing something.

Original issue reported on code.google.com by loganath...@gmail.com on 2 Aug 2013 at 10:10

Attachments:

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions