Skip to content

Unsupervised Segmentation Learning #255

@akolonin

Description

@akolonin

The goals of Unsupervised Segmentation Learning (USL) are:

  1. Unsupervised learn of lexicon and tokenisation for languages like Chinese
  2. Unsupervised learn for sentence splitting for languages like Chinese
  3. Identification of primary (elementary) patterns in symbolic streams of data

Study setup:

  1. Collect training set of either A) Chinese texts OR B) numeric and symbolic streams of data for specific domain
  2. Set up lexicon (library of compound symbols) and sentence/series breaking patterns for either A) Chinese texts OR B) numeric and symbolic streams of data for specific domain
  3. Implement POC of unsupervised tokeniser capable to infer lexicon (library of compound symbols) from training set and assess F1 matching inferred data against reference data
  4. Implement POC of unsupervised sentence/series splitter capable to infer breaking patterns and chunk the stream of tokens/symbols using inferred and reference patterns, assess F1 matching inferred breakdowns based on the both.

Note: it may be considered that both "tokenisation" and "sentence splitting" are both parts of the same "segmentation" solution so there should be just one solution solving both problems depending on the specified number of "segmentation layers".

Metadata

Metadata

Assignees

Labels

doingIn progressenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions