Alternate Approach

Varnam has a tokenizer that converts Malayalam (or any other Indian lang) text to manglish patterns. While learning, Varnam makes a database of such patterns -> word :
```
Pattern | Word ID | Learned

"mal" "77156" "0"
"mala"  "228" "1"
"mala"  "1586"  "1"
"mala"  "5434"  "1"
"mala"  "50134" "1"
"mala"  "57521" "0"
"malaa" "50134" "1"
"malaa" "57521" "0"
"malaagha"  "7784"  "1"
"malaaghama"  "82823" "0"
"malaaghamaa" "82823" "0"
"malaaghamaar"  "25013" "1"
"malaaghamar" "25013" "1"
"malaak"  "102229"  "1"
"malaaka" "24048" "1"
"malaaka" "43013" "1"
```

This makes the database huge in size. Varnam makes malayalam suggestions from this database (the learnings database) looking up pattern. If it can't find one, uses the tokenizer to make word.

I want to know why this approach wasn't chosen @navaneeth :

* No need of a `pattern => word` DB (learnings file). Instead, just need a **word dictionary**.
* Add more patterns to VST (Varnam Symbol Table). Prioritized letters `n => ന, ണ`. Capitalized `N` will always give `ണ`. So `pani` will give suggestions in priority : `പനി, പണി`. Currently if only the learnings DB has `pani` assigned to both words will give the different outputs.
* When an input say `pani` is given to varnam, it should tokenize to `പനി` and `പണി` using just VST, and then look up the word dictionary to find words starting with `പനി` and `പണി` and give additional suggestions.
* For english words like "Cricket", the tokenization will give bad results, in such cases we can maybe use a `pattern => word` DB like the current learnings DB.

By doing so, the size of the learnings database can be reduced a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate Approach #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternate Approach #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions