Skip to content

jz95/Exploiting-Document-Substructure-in-Neural-MT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploiting Predictable Document Sub-structure in NMT

Intro

Many documents follows a strong and regular structure in terms of the content, such as encyclopedia articles, scientific papers, medical manuals, etc. For those documents with predictable substructure, we may observe frequent word repetition, resemblance in sentence syntax, similar subtopic within one particular part. Translation for such documents can exploit these similarities. Here we focus on the scenario of Chinese(ZH)-to-English(EN) translation for Wikipedia biography articles, and compare the translation performance of the structure-aware and structure-agnostic models.

Motivation

Wikipedia biography articles are a typical document genre with predictable substructure and a relatively strict writing paradigm: a central entity (whom the biography is describing) and subtopics like early life, career, death and family life, etc. Each content zone is organised chronologically and has a general pattern in the word choice, sentence syntax, etc.

The writing for Wikipedia biography articles often follows a template, and the content zone structure can be roughly defined as follows:

  • Overview: a general introduction, or a summary of people's life.
  • Early years: childhood, youth.
  • Career: professional career.
  • Later years and death
  • Legacy and memorial
  • Family and personal life
  • Influence and evaluation: people's social influence, unique contribution, or the criticisms, controversies they've caused.
  • Honours and awards
  • Works: any form of publications, such as albums for singers, books for writers.

Therefore given a block of texts (i.e. a section/passage), probably we can find a corresponding subtopic group for it, which is a natrual text classification problem. Motivated by this, we incorporate a neural text classifier into NMT, and expect NMT could benifit from such multi-tasking learning paradigm.

Model Architecutre

The model consists of an Seq2Seq Attentional NMT and HAN (Hierarchical Attention Network) text classifier. The main task is still a sentence to sentence neural machine translation, and the auxiliary task is a text classification problem (i.e. given a section/passage in the article, predicit its subtopic group). The hidden state within the HAN model is involeved in the computation of logits for predicting the next target word in NMT. The advantages for this architecutre are as follows:

  • By informing NMT using the representation for the whole section of source, we widen the context from only one sentence towards the whole section, thus may achieve a better lexical cohesion in translation.
  • A specialised text classifier is used here to have better ability in extracting the features from the source sentences.
  • It's also a good way to leverage monolingual ZH data, which can reinforce the source language understanding in the encoder.

Usage

You may un cmd ./src/train.py -h for a detailed helper, or check the shell experimental scripts in the dir scripts. Please see the following examples for a qucik start:

Train the baseline, i.e. a seq2seq attentional nmt model.

./src/train.py\
    --mode baseline-nmt \
    --batch_size 100 \
    --dictionaries /path/to/source/vocab.json /path/to/target/vocab.json \
    --datasets /path/to/source/corpus /path/to/target/corpus \
    --embedding_size 500 \
    --word_rnn_enc_depth 2 \
    --word_rnn_state_size 1024
    --dec_depth 4

Train an HAN model for text classification.

./src/train.py\
    --mode clf \
    --batch_size 24 \
    --dictionaries /path/to/source/vocab.json /path/to/cls/to/int.json \
    --datasets /path/to/corpus /path/to/cls/label \
    --embedding_size 500 \
    --word_rnn_enc_depth 2 \
    --word_rnn_state_size 1024 \
    --sent_rnn_enc_depth 2 \
    --sent_rnn_state_size 512 \
    --sec_repr_size 256 

Train a joint-multi-task-leanring model.

./src/train.py\
    --mode joint \
    --batch_size 24 \
    --dictionaries /path/to/source/vocab.json /path/to/target/vocab.json 
    --aux_dictionary /path/to/cls/to/int.json \
    --datasets /path/to/source/corpus /path/to/target/corpus \
    --aux_datasets /path/to/corpus /path/to/cls/label \
    --embedding_size 500 \
    --word_rnn_enc_depth 2 \
    --word_rnn_state_size 1024 \
    --sent_rnn_enc_depth 2 \
    --sent_rnn_state_size 512 \
    --sec_repr_size 256 

Do translation using a trained model.

./src/translate.py\
    -m /path/to/trained/nmt/model \
    -i /path/to/your/source/corpus \
    -o translation.txt \
    -k 12 \
    -v

Performance

TBC

Nematus

This project is based on nematus v0.4, specifically the commit 75a168d247e50a746a717be0ac514e7c314246d3. The support for Transformer Model, Server Translator, MAP training, rescore, ensemble is removed for simplicity.

About

A multi-task learning approach for structured document translation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors