forked from opencog/language-learning
-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
doingIn progressIn progress
Description
The goal of the challenge is to have unsupervisedly trained parser to create parses approximating "expected" English parses to the best extent - using cleaned Gutenberg Children corpus data as an input and Link Grammar English parses in three forms as a reference.
Input:
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
(that is "cleaned" Gutenberg Children corpus data tokenized with Link Grammar English tokenization rules)
References:
- http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/LG5.5.1/capital/parses/
(the above is "bronze standard" - the corpus above parsed with Link Grammar English dictionary, with tokenization done in slightly different way which can be ignored when comparing results) - http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_fullyParsed.ull
(the above is "silver standard" - the previous parses gathered in one file, with all sentence parses selected i one file, where all sentences are 100% parsed with Link Grammar English dictionary and have no any direct speech fragments) - http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GC_LGEnglish_noQuotes_manual.ull
(the above is "gold standard" - the previous parses with 200+ sentences randomly selected and reviewed by human with the links validated)
Requirements:
- The unsupervisedly trained parser should be trained on the input corpus following the same tokenization, assuming the space is word separator and double linefeed is sentence separator.
- The unsupervisedly trained parser should be trained on sentence basis, with no mutual impact from adjacent sentences
- The output parses for each of the reference files should have file names identical to those in the reference data
- The lower/capital case should be ignored as evaluation process will be ignoring the cases
- If the parser provides parses in "phrase structure grammar" (PSG) structure (linking words as well as compound phrases, like http://demo.chaoticlanguage.com/), unlikely to "link grammar" structure (linking only words), the "dependency-grammar" parses should somehow converted to "link grammar" structure
- The sample code for writing parses in ULL format used by reference parses is provided as follows:
- Scheme: https://github.com/singnet/learn/blob/1b7220f066866e9ada13c96376ab7f87ee53a1aa/run-poc/redefine-mst-parser.scm#L148
- Java: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gram/main/LexStructor.java#L548
- The links from LEFT-WALL in the expected parses may be ignored and not produced because links from LEFT-WALL and links to ending period will be not involved in evaluation of the results.
Other information:
- Sample parser code in Scheme https://github.com/singnet/learn
- Sample parser code in Java can be found here: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/gram/main/LexStructor.java#L649
Metadata
Metadata
Assignees
Labels
doingIn progressIn progress