Skip to content

Continuous Parsing of Unsegmented Streams of Symbols #256

@akolonin

Description

@akolonin

Specification
This task address complication of the Unsupervised Language Learning (ULL) problem (specifically, Unsupervised Text Parsing) with the following conditions:

  1. The input (training) streams of symbols is not segmented into sentences, it is rather continuous stream of symbols (complication).
  2. Each of the symbol in stream is assumed to like a word in English or a glyph in Chinese, so we assume that English word tokenisation is done, and there is no need to assemble words from series of Chinese glyphs (simplification).
  3. Output parses may be limited my maximum numbers of symbols limiting the size of chunk of symbolic sequence covered by the parse, the limit may vary from 3 (minimum reasonable parsed phrase) to entire number of symbols in the whole sequence (complication).
  4. The parses must not have crossed links, so the phrases corresponded to parse trees should not overlap.
  5. The parses should be adjacent, so it is preferable to not to have gaps between phrases corresponding to the parses, but still the gaps may be present so the parses may be covering the entire sequence loosely, so some of the symbols may be not included in any parses.
  6. The specific parsing algorithm is not pre-determined by this specification, so it may be MST-Parser based on context-unspecific mutual information (MI) or context-specific contextual information (CI) or it can be any unsupervised/statistical parser capable to build parse trees.

Conceptual Design Proposal

  1. The parsing is being done in sliding window, so each position of the sliding window over the input stream of symbols may provide its own parse (which may cover all words in the windows, or part of them or may not exist at all), which may be selected from multiple candidate parses (like in case of the MST-Parser).
  2. Any parser (including MST-Parser) may be used to get the parse within the window, relying on either MI collected across the entire stream or CI specific to given position of the sliding window.
  3. The parser should provide the one parse (which may be the best across multiple candidates for given position of the sliding window) along with metric of the parse fitness.
  4. The parse fitness metric should primarily account for percentage of words parsed within the sentence, so the parses covering more words are unconditional winners over the parses covering less words. Besides that, the parses covering same number of words could have adjustments to the metric based on some other factors (like denomination by parse cost normalised to range 0.0-1.0).
  5. During the parsing process, for every window position, advancing the positions from the beginning, keep recording the fitness metrics per each parse.
  6. Given each position of sliding window provides its own parse, the parses from different positions would overlap, so there should be some process of selecting the best non-crossing parses. There could be at least two strategies for that, call them husty and reluctant, as discussed further.
  7. The husty strategy assumes that we want to select the parse as soon as possible, trying to avoid leaving the gaps, potentially sacrificing the reliability of the parses. That means - during the parsing - if the next successive parse fitness is greater or equal than the one of the previous parse, add this parse to resulting set. If the two parses overlap, discard the previous parse from the result set.
  8. The reluctant strategy rather assumes that we want to get parser to be as reliable as possible even if there are gaps left, sacrificing the coverage of the input stream and leaving more gaps. That means - during the parsing - if the next successive parse fitness is greater than the one of the previous parse and the parses do not overlap, add this parse to resulting set. If the two parses overlap, skip the current successive parse.
  9. Leaving the gaps may be OK in case if the stream is know to have noisy segments of nonsense series of symbols.

Implementation and Testing Roadmap

  1. Try to implement this approach for any of the corpora currently being used for ULL project, such as POC-English ( http://langlearn.singularitynet.io/data/cleaned/English/EnglishPOC/EnglishPOC.txt ), Child-Directed Speech ( http://langlearn.singularitynet.io/data/cleaned/English/ChildDirectedSpeech/capital/ ) and Gutenberg Children ( http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_tokenized_LG5.5.1/ or http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/MSL25-2019JUL01/ - TBD).
  2. Implement parser option to ignore line breaks between sentences so the tokenised stream of words is fed up into the parser (and mutual information counter/extractor, if needed) continuously.
  3. Evaluate the parser feeding them into Grammar Learner and evaluating the PA & F1 based on the obtained grammar against known baseline numbers.
  4. All of the above can be done independently for at least four different kinds of parsers: A) LG-English (Link Grammar parses based on manually created English Link Grammar Dictionary), B) conventional plain vanilla OpenCog MST-Parser, C) DNN-based CI-counting with OpenCog MST-Parser, D) DNN-based constituency parses somehow converted to dependency parses.
  5. Later on, it may be applied for non-English streams of symbols from other sources.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions