Incorrect identification of sentence boundaries, if a sentence stops with a term that ends with a DOT #13916
Unanswered
dfch
asked this question in
Help: Other Questions
Replies: 1 comment 1 reply
-
|
I examined the structure of the incorrectly divided sentences and found this procedure:
Below you find some rough code that identifies the abbreviation that is an "end-of-sentence". This also works, when SpaCy identifies the Identify if an abbreviation is an "end-of-sentence" token: def is_possible_eos(abbrev: Token, min_token: int, max_token: int) -> bool:
assert isinstance(abbrev, Token), type(abbrev)
assert isinstance(min_token, int) and 0 <= min_token
assert isinstance(max_token, int) and 0 <= max_token
# When the abbreviation is not within both tokens, it is not end-of-sentence.
if False == min_token < abbrev.i < max_token:
return False
# When there is a token dependency after our abbreviation, it is not end-of-sentence.
return not abbrev.head.i > abbrev.i
def find_eos_token_index(abbrevs: list[Token], min_token: int, max_token: int) -> int | None:
assert isinstance(abbrevs, list), type(abbrevs)
assert isinstance(min_token, int) and 0 <= min_token
assert isinstance(max_token, int) and 0 <= max_token
candidates: dict[int, bool] = {}
for abbrev in abbrevs:
candidates[abbrev.i] = is_possible_eos(abbrev, min_token, max_token)
if not any(c for c in candidates.values() if True == c):
print("Nothing found. Trying reverse order.")
candidates.clear()
for abbrev in abbrevs:
candidates[abbrev.i] = is_possible_eos(abbrev, max_token, min_token)
if not any(c for c in candidates.values() if True == c):
print("No candidates found.")
return None
result = max(candidates, key=candidates.get)
return resultGet root, new_root and abbreviations: doc: Doc = ... # This is the SpaCy Doc object.
sent: Span = ... # This is an icorrectly identified sentence and is part of the `doc`.
DOT = "."
abbrevs = [token for token in sent if token.text != DOT and token.text.endswith(DOT)]
if not any(abbrevs) or 0 >= len(abbrevs):
return None
roots = [token for token in sent if token.dep_ == "ROOT"]
if not any(roots) or 1 != len(roots):
return None
root = roots[0]
assert isinstance(root, Token), type(root)
candidates = [t for t in sent if t.head.i == root.i and t.pos_ in ("VERB", "AUX") and t.i != root.i]
if not any(candidates) or 1 != len(candidates):
return None
candidate = candidates[0]
print(f"NewRoot: '{(candidate.i, candidate.text)}'")
print(f"Root: '{(root.i, root.text)}'")
print(f"Abbreviations: '{[(t.i, t.text) for t in abbrevs]}'.")
eos_token_i = find_eos_token_index(abbrevs, min_token=candidate.i, max_token=root.i)
eos_token = doc[eos_token_i]
print(f"Found eos: [{eos_token_i}] '{eos_token.text}'") |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there. I have a questions regarding sentence boundary detection in spaCy v3.8.0. Thank you for help and support.
(This is my first post to this community. I try to follow the rules, guidelines and common sense, but if there is anything that I can improve, please let me know and I will try to improve. Thank you for your understanding.)
Introduction
en_core_web_trfmodel. But I see the same behaviour with theen_core_web_smmodel.Observation
.. There is no separate punctuation mark that identifies the end of a sentence. This is correct "American English" (if I understand CMOS, AP correctly).butandandidentify asCCONJonly when aPRONfollows anand(but not abut) I see the incorrect identification. See these examples that follow.,after the conjunction, spaCy sometimes identifies the sentences correctly.Correct sentence boundaries with these examples
Two sentences:
The store opens at 9 a.m. The store closes at 4 p.m.The store opens at 9 a.m. And the store closes at 4 p.m.The store opens at 9 a.m. It closes at 4 p.m.London is the capital of the U.K. They have the Tower of London.London is the capital of the U.K. It has the Tower of London.London is the capital of the U.K. London has the Tower of London.London is the capital of the U.K. But it is not the capital of Scotland.London is the capital of the U.K. And, it has the Tower of London.London is the capital of the U.K. The U.K. has many inhabitants.London is the capital of the U.K. And the U.K. has many inhabitants.Incorrect sentence boundaries with these examples
One sentence:
The store opens at 9 a.m. And it closes at 4 p.m.London is the capital of the U.K. And they have the Tower of London.London is the capital of the U.K. And it has the Tower of London.London is the capital of the U.K. Very often, it is rainy there.London is the capital of the U.K. Very often, it rains there.London is the capital of the U.K. The U.K. is very rainy.London is the capital of the U.K. And the U.K. is very rainy.Questions
Thank you for your help and feedback!
Detailed Output
Here is the detailed output with tokenisation and dependencies with the examples I gave before.
INCORRECT: 'The store opens at 9 a.m. And it closes at 4 p.m.'
spaCy identifies:
a.m.asNOUNAndasCCONJitasPRON.CORRECT: 'The store opens at 9 a.m. And the store closes at 4 p.m.'
spaCy identifies:
a.m.asNOUNAndasCCONJstoreasNOUN.I use the term
the storefrom the first sentence one more time.CORRECT: 'The store opens at 9 a.m. The store closes at 4 p.m.'
spaCy identifies:
a.m.asNOUNstoreasNOUN.I use the term
the storefrom the first sentence one more time.CORRECT: 'The store opens at 9 a.m. It closes at 4 p.m.'
spaCy identifies:
a.m.asNOUNItasPRON.CORRECT: 'London is the capital of the U.K. They have the Tower of London.'
spaCy identifies:
U.K.asPROPNTheyasPRON.INCORRECT: 'London is the capital of the U.K. And they have the Tower of London.'
spaCy identifies:
U.K.asPROPNAndasCCONJtheyasPRON.CORRECT: 'London is the capital of the U.K. London has the Tower of London.'
spaCy identifies:
U.K.asPROPNLondonasPROPN.I use
Londonfrom the first sentence one more time.CORRECT: 'London is the capital of the U.K. It has the Tower of London.'
spaCy identifies:
U.K.asPROPNItasPRON.INCORRECT: 'London is the capital of the U.K. And it has the Tower of London.'
spaCy identifies:
U.K.asPROPNAndasCCONJitasPRON.CORRECT: 'London is the capital of the U.K. But it is not the capital of Scotland.'
spaCy identifies:
U.K.asPROPNButasCCONJitasPRON.Additional Information
spaCy info
Initialization and Helper
This is the cell for the initialization of spaCy:
This is the helper function that shows the sentence boundaries:
Beta Was this translation helpful? Give feedback.
All reactions