Skip to content

Hybrid sequential-MST parser #217

@akolonin

Description

@akolonin

Implement "hybrid" parser blending sequential information and MI, so the extend of blending could be made configurable, with "maximum sequential" mode producing "sequential parse" and "maximum MI" mode producing "plain MST-Parses with no account for distance".

There are two perspectives:
A) As Ben Goertzel has suggested, ("use both the sequential parse and some fancier hierarchical parse as inputs to clustering and grammar learning? I.e. don't throw out the information of simple before-and-after co-occurrence, but augment it with information from the statistically inferred dependency parse tree") we can be simply (I guess) have it implemented in existing MST-Parser given the changes that @glicerico and Claudia have done year ago. That could be tried with "distance_vs_MI" blending parameter in the MST-Parser code which accounts for word-to-word distance. So that if the distance_vs_MI=1.0 we would get "sequential parses", distance_vs_MI=0.0 would produce "Pure MST-Parses", distance_vs_MI=0.7 would provide "English parses", distance_vs_MI=0.5 would provide "Russian parses".

B) As Ben Goertzel further wrote:

I don't think we want an arithmetic average of distance and MI, maybe more like

f(1) = C >1
f(1) > f(2) > f(3) > f(4)
f(4) = f(5) = ... = 1

and then

f(distance) * MI

i.e. maybe we count the MI significantly more if the distance is
small... but if MI is large and distance is large, we still count the
MI a lot...

(of course the decreasing function f becomes the thing to tune here...)

The task can be broken down to subtasks:
1) Implement configurable blending of sequential and MI information using approach A) or B) or combination of the two above.
2) Implement unit test ensuring that it can provide either sequential or MST or hybrid parses on small corpus like POC-English or POC-Turtle.
3) Study F1 of the parses based o Gutenberg Children corpus and see if we can find configuration outperforming "sequential parses".
4) Extend study 3) using both traditional MI and DNN-MI.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions