Basic Statistical Machine Translation

Overview

This project implements a basic SMT system for Burmese grapheme-to-phoneme (G2P) and phoneme-to-grapheme (P2G) translation using established SMT toolkits such as Moses, GIZA++, and MGIZA, along with custom preprocessing scripts for data normalization and preparation.

The workflow covers the full SMT pipeline from data normalization and SGM file generation to training, decoding, and evaluation. Multiple experiments are documented with varying settings. BLEU scores and translation outputs are systematically logged and summarized for comparative analysis.

Error-Free PBSMT Workflow

Use Ubuntu-native Perl v5.38
Install Moses, GIZA++, and MGIZA
Optionally configure paths in
- config.baseline, generate_configs.pl,
- generate_sgms.pl, src2sgm.pl, and ref2sgm.pl
Create config.baseline for two translation directions using generate_configs.pl
Create SGMs using generate_sgms.pl
Copy the mkcls file from the giza-pp/mkcls-v2/ folder to the giza-pp/GIZA++-v2/ folder
Install ImageMagick and Graphviz using sudo

Edit ubuntu-17.04/moses/scripts/generic/mteval-v13a.pl:

line 950: \p{Line_Break} # replace this
line 950: \p{Line_Break=Hyphen} # with this

Experiments

SMT run1 (run1,2,3 in pbsmt_v1 and pbsmt_v2 => baseline/run 1+2+3): syl-normalization, 5-gram, max-phrase 5, prune 001, GIZA
SMT run2 (run4 in pbsmt_v3 => baseline2/run 1): new syl-normalization, 5-gram, max-phrase 5, prune 001, GIZA
SMT run3 (run5 in pbsmt_v3 => baseline2/run 2): new syl-normalization, 3-gram, max-phrase 5, prune 001, GIZA
SMT run4 (run6 in pbsmt_v4 => baseline2/run 3): new syl-normalization, 3-gram, max-phrase 3, prune 001, GIZA
SMT run5 (run7 in pbsmt_v4 => baseline2/run 4): new syl-normalization, 3-gram, max-phrase 3, prune 000, GIZA
SMT run6 (run8 in pbsmt_v4 => baseline2/run 5): new syl-normalization, 3-gram, max-phrase 3, prune 001, MGIZA

Summary: presentation slides

Results:

Dataset

Sayar's G2P dataset

File Structure

/
...
├── img/
├── notebooks/
├── summary/
│
├── baseline/           # run1 to run3 with data v1
│   ├── logs/
│   ├── my-ph/
│   │   ├── corpus/
│   │   ├── evaluation/
│   │   ├── lm/
│   │   ├── model/
│   │   ├── steps/
│   │   ├── training/
│   │   └── tuning/
│   └── ph-my/
│       └── ...
├── baseline2/          # run4 to run6 with data v2
│   └── ...
│
├── data/
│   ├── g2p-par/                    # originally Sayar's
│   ├── cleaned/                    # preprocessed data (version 1)
│   │   ├── ...
│   │   └── test-sgm/
│   │       ├── generate_sgms.pl    # originally Sayar's
│   │       ├── src2sgm.pl          # originally Sayar's
│   │       └── ref2sgm.pl          # originally Sayar's
│   ├── cleaned_2/                  # preprocessed data (version 2)
│   │   └── ...
│   └── logs/
│
├── syl-normalizer/     # originally Sayar's # modified to merge with previous token for athat (်) cases
│
├── config.baseline     # originally Sayar's # modified paths # uncomment multi-bleu
├── generate_configs.pl # originally Sayar's # modified paths
└── run-baseline.pl     # originally Sayar's # modified paths

References

Note

This project was done for educational purposes as an assignment for the AI Engineering Fundamentals class taught by Sayar Ye Kyaw Thu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic Statistical Machine Translation

Overview

Error-Free PBSMT Workflow

Experiments

Dataset

File Structure

References

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
baseline		baseline
baseline2		baseline2
data		data
img		img
notebooks		notebooks
summary		summary
syl-normalizer		syl-normalizer
.gitignore		.gitignore
README.md		README.md
config.baseline		config.baseline
config.baseline2		config.baseline2
generate_configs.pl		generate_configs.pl
presentation_slides.pdf		presentation_slides.pdf
requirements.txt		requirements.txt
run-baseline.pl		run-baseline.pl

Folders and files

Latest commit

History

Repository files navigation

Basic Statistical Machine Translation

Overview

Error-Free PBSMT Workflow

Experiments

Dataset

File Structure

References

Note

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages