Exploiting Predictable Document Sub-structure in NMT

Intro

Many documents follows a strong and regular structure in terms of the content, such as encyclopedia articles, scientific papers, medical manuals, etc. For those documents with predictable substructure, we may observe frequent word repetition, resemblance in sentence syntax, similar subtopic within one particular part. Translation for such documents can exploit these similarities. Here we focus on the scenario of Chinese(ZH)-to-English(EN) translation for Wikipedia biography articles, and compare the translation performance of the structure-aware and structure-agnostic models.

Motivation

Wikipedia biography articles are a typical document genre with predictable substructure and a relatively strict writing paradigm: a central entity (whom the biography is describing) and subtopics like early life, career, death and family life, etc. Each content zone is organised chronologically and has a general pattern in the word choice, sentence syntax, etc.

The writing for Wikipedia biography articles often follows a template, and the content zone structure can be roughly defined as follows:

Overview: a general introduction, or a summary of people's life.
Early years: childhood, youth.
Career: professional career.
Later years and death
Legacy and memorial
Family and personal life
Influence and evaluation: people's social influence, unique contribution, or the criticisms, controversies they've caused.
Honours and awards
Works: any form of publications, such as albums for singers, books for writers.

Therefore given a block of texts (i.e. a section/passage), probably we can find a corresponding subtopic group for it, which is a natrual text classification problem. Motivated by this, we incorporate a neural text classifier into NMT, and expect NMT could benifit from such multi-tasking learning paradigm.

Model Architecutre

The model consists of an Seq2Seq Attentional NMT and HAN (Hierarchical Attention Network) text classifier. The main task is still a sentence to sentence neural machine translation, and the auxiliary task is a text classification problem (i.e. given a section/passage in the article, predicit its subtopic group). The hidden state within the HAN model is involeved in the computation of logits for predicting the next target word in NMT. The advantages for this architecutre are as follows:

By informing NMT using the representation for the whole section of source, we widen the context from only one sentence towards the whole section, thus may achieve a better lexical cohesion in translation.
A specialised text classifier is used here to have better ability in extracting the features from the source sentences.
It's also a good way to leverage monolingual ZH data, which can reinforce the source language understanding in the encoder.

Usage

You may un cmd ./src/train.py -h for a detailed helper, or check the shell experimental scripts in the dir scripts. Please see the following examples for a qucik start:

Train the baseline, i.e. a seq2seq attentional nmt model.

./src/train.py\
    --mode baseline-nmt \
    --batch_size 100 \
    --dictionaries /path/to/source/vocab.json /path/to/target/vocab.json \
    --datasets /path/to/source/corpus /path/to/target/corpus \
    --embedding_size 500 \
    --word_rnn_enc_depth 2 \
    --word_rnn_state_size 1024
    --dec_depth 4

Train an HAN model for text classification.

./src/train.py\
    --mode clf \
    --batch_size 24 \
    --dictionaries /path/to/source/vocab.json /path/to/cls/to/int.json \
    --datasets /path/to/corpus /path/to/cls/label \
    --embedding_size 500 \
    --word_rnn_enc_depth 2 \
    --word_rnn_state_size 1024 \
    --sent_rnn_enc_depth 2 \
    --sent_rnn_state_size 512 \
    --sec_repr_size 256

Train a joint-multi-task-leanring model.

./src/train.py\
    --mode joint \
    --batch_size 24 \
    --dictionaries /path/to/source/vocab.json /path/to/target/vocab.json 
    --aux_dictionary /path/to/cls/to/int.json \
    --datasets /path/to/source/corpus /path/to/target/corpus \
    --aux_datasets /path/to/corpus /path/to/cls/label \
    --embedding_size 500 \
    --word_rnn_enc_depth 2 \
    --word_rnn_state_size 1024 \
    --sent_rnn_enc_depth 2 \
    --sent_rnn_state_size 512 \
    --sec_repr_size 256

Do translation using a trained model.

./src/translate.py\
    -m /path/to/trained/nmt/model \
    -i /path/to/your/source/corpus \
    -o translation.txt \
    -k 12 \
    -v

Performance

TBC

Nematus

This project is based on nematus v0.4, specifically the commit 75a168d247e50a746a717be0ac514e7c314246d3. The support for Transformer Model, Server Translator, MAP training, rescore, ensemble is removed for simplicity.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
imgs		imgs
scripts		scripts
src		src
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploiting Predictable Document Sub-structure in NMT

Intro

Motivation

Model Architecutre

Usage

Performance

Nematus

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploiting Predictable Document Sub-structure in NMT

Intro

Motivation

Model Architecutre

Usage

Performance

Nematus

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages