Incorrect identification of sentence boundaries, if a sentence stops with a term that ends with a DOT #13916

dfch · 2026-02-02T09:22:58Z

dfch
Feb 2, 2026

Hi there. I have a questions regarding sentence boundary detection in spaCy v3.8.0. Thank you for help and support.

(This is my first post to this community. I try to follow the rules, guidelines and common sense, but if there is anything that I can improve, please let me know and I will try to improve. Thank you for your understanding.)

Introduction

I use spaCy with the en_core_web_trf model. But I see the same behaviour with the en_core_web_sm model.
I examine text that is in "American English".

Observation

I have text with two sentences. The last word in the first sentence contains a term that stops with a .. There is no separate punctuation mark that identifies the end of a sentence. This is correct "American English" (if I understand CMOS, AP correctly).
In some conditions (but not always) spaCy incorrectly divides the tokens into sentences.
This specially occurs when a coordinate conjunction follows the first sentence. Although but and and identify as CCONJ only when a PRON follows an and (but not a but) I see the incorrect identification. See these examples that follow.
When I put a , after the conjunction, spaCy sometimes identifies the sentences correctly.
SpaCy also incorrectly identifies sentences with other terms than a conjunction. For example

Correct sentence boundaries with these examples

Two sentences:

The store opens at 9 a.m. The store closes at 4 p.m.
The store opens at 9 a.m. And the store closes at 4 p.m.
The store opens at 9 a.m. It closes at 4 p.m.
London is the capital of the U.K. They have the Tower of London.
London is the capital of the U.K. It has the Tower of London.
London is the capital of the U.K. London has the Tower of London.
London is the capital of the U.K. But it is not the capital of Scotland.
London is the capital of the U.K. And, it has the Tower of London.
London is the capital of the U.K. The U.K. has many inhabitants.
London is the capital of the U.K. And the U.K. has many inhabitants.

Incorrect sentence boundaries with these examples

One sentence:

The store opens at 9 a.m. And it closes at 4 p.m.
London is the capital of the U.K. And they have the Tower of London.
London is the capital of the U.K. And it has the Tower of London.
London is the capital of the U.K. Very often, it is rainy there.
London is the capital of the U.K. Very often, it rains there.
London is the capital of the U.K. The U.K. is very rainy.
London is the capital of the U.K. And the U.K. is very rainy.

Questions

Can you tell if I must expect this behaviour? Is this behaviour unknown?
Is there a procedure to prevent this behaviour?
Do you need more information?

Thank you for your help and feedback!

Detailed Output

Here is the detailed output with tokenisation and dependencies with the examples I gave before.

INCORRECT: 'The store opens at 9 a.m. And it closes at 4 p.m.'

spaCy identifies:

one sentence
a.m. as NOUN
And as CCONJ
it as PRON.

Original text: 'The store opens at 9 a.m. And it closes at 4 p.m.'.

--- Sentence Boundaries ---

Sentence [0]: 'The store opens at 9 a.m. And it closes at 4 p.m.'
  Start token: 0, End token: 11
  Tokens: ['The', 'store', 'opens', 'at', '9', 'a.m.', 'And', 'it', 'closes', 'at', '4', 'p.m.']
[0] [0] 'The' [DET] [the] <--'det'-- 'store' [NOUN].
[0] [1] 'store' [NOUN] [store] <--'nsubj'-- 'opens' [VERB].
	'The' [DET]
[0] [2] 'opens' [VERB] [open] <--'ROOT'-- 'opens' [VERB].
	'store' [NOUN]
	'at' [ADP]
	'closes' [VERB]
[0] [3] 'at' [ADP] [at] <--'prep'-- 'opens' [VERB].
	'a.m.' [NOUN]
[0] [4] '9' [NUM] [9] <--'nummod'-- 'a.m.' [NOUN].
[0] [5] 'a.m.' [NOUN] [a.m.] <--'pobj'-- 'at' [ADP].
	'9' [NUM]
[0] [6] 'And' [CCONJ] [and] <--'cc'-- 'closes' [VERB].
[0] [7] 'it' [PRON] [it] <--'nsubj'-- 'closes' [VERB].
[0] [8] 'closes' [VERB] [close] <--'conj'-- 'opens' [VERB].
	'And' [CCONJ]
	'it' [PRON]
	'at' [ADP]
[0] [9] 'at' [ADP] [at] <--'prep'-- 'closes' [VERB].
	'p.m.' [NOUN]
[0] [10] '4' [NUM] [4] <--'nummod'-- 'p.m.' [NOUN].
[0] [11] 'p.m.' [NOUN] [p.m.] <--'pobj'-- 'at' [ADP].
	'4' [NUM]

--- Token-level Sentence Information ---
Sentence start at token [0]: 'The'.

CORRECT: 'The store opens at 9 a.m. And the store closes at 4 p.m.'

spaCy identifies:

two sentences
a.m. as NOUN
And as CCONJ
store as NOUN.

I use the term the store from the first sentence one more time.

Original text: 'The store opens at 9 a.m. And the store closes at 4 p.m.'.

--- Sentence Boundaries ---

Sentence [0]: 'The store opens at 9 a.m.'
  Start token: 0, End token: 5
  Tokens: ['The', 'store', 'opens', 'at', '9', 'a.m.']
[0] [0] 'The' [DET] [the] <--'det'-- 'store' [NOUN].
[0] [1] 'store' [NOUN] [store] <--'nsubj'-- 'opens' [VERB].
	'The' [DET]
[0] [2] 'opens' [VERB] [open] <--'ROOT'-- 'opens' [VERB].
	'store' [NOUN]
	'at' [ADP]
[0] [3] 'at' [ADP] [at] <--'prep'-- 'opens' [VERB].
	'a.m.' [NOUN]
[0] [4] '9' [NUM] [9] <--'nummod'-- 'a.m.' [NOUN].
[0] [5] 'a.m.' [NOUN] [a.m.] <--'pobj'-- 'at' [ADP].
	'9' [NUM]

Sentence [1]: 'And the store closes at 4 p.m.'
  Start token: 6, End token: 12
  Tokens: ['And', 'the', 'store', 'closes', 'at', '4', 'p.m.']
[1] [0] 'And' [CCONJ] [and] <--'cc'-- 'closes' [VERB].
[1] [1] 'the' [DET] [the] <--'det'-- 'store' [NOUN].
[1] [2] 'store' [NOUN] [store] <--'nsubj'-- 'closes' [VERB].
	'the' [DET]
[1] [3] 'closes' [VERB] [close] <--'ROOT'-- 'closes' [VERB].
	'And' [CCONJ]
	'store' [NOUN]
	'at' [ADP]
[1] [4] 'at' [ADP] [at] <--'prep'-- 'closes' [VERB].
	'p.m.' [NOUN]
[1] [5] '4' [NUM] [4] <--'nummod'-- 'p.m.' [NOUN].
[1] [6] 'p.m.' [NOUN] [p.m.] <--'pobj'-- 'at' [ADP].
	'4' [NUM]

--- Token-level Sentence Information ---
Sentence start at token [0]: 'The'.
Sentence start at token [6]: 'And'.

CORRECT: 'The store opens at 9 a.m. The store closes at 4 p.m.'

spaCy identifies:

two sentences
a.m. as NOUN
store as NOUN.

I use the term the store from the first sentence one more time.

Original text: 'The store opens at 9 a.m. The store closes at 4 p.m.'.

--- Sentence Boundaries ---

Sentence [0]: 'The store opens at 9 a.m.'
  Start token: 0, End token: 5
  Tokens: ['The', 'store', 'opens', 'at', '9', 'a.m.']
[0] [0] 'The' [DET] [the] <--'det'-- 'store' [NOUN].
[0] [1] 'store' [NOUN] [store] <--'nsubj'-- 'opens' [VERB].
	'The' [DET]
[0] [2] 'opens' [VERB] [open] <--'ROOT'-- 'opens' [VERB].
	'store' [NOUN]
	'at' [ADP]
[0] [3] 'at' [ADP] [at] <--'prep'-- 'opens' [VERB].
	'a.m.' [NOUN]
[0] [4] '9' [NUM] [9] <--'nummod'-- 'a.m.' [NOUN].
[0] [5] 'a.m.' [NOUN] [a.m.] <--'pobj'-- 'at' [ADP].
	'9' [NUM]

Sentence [1]: 'The store closes at 4 p.m.'
  Start token: 6, End token: 11
  Tokens: ['The', 'store', 'closes', 'at', '4', 'p.m.']
[1] [0] 'The' [DET] [the] <--'det'-- 'store' [NOUN].
[1] [1] 'store' [NOUN] [store] <--'nsubj'-- 'closes' [VERB].
	'The' [DET]
[1] [2] 'closes' [VERB] [close] <--'ROOT'-- 'closes' [VERB].
	'store' [NOUN]
	'at' [ADP]
[1] [3] 'at' [ADP] [at] <--'prep'-- 'closes' [VERB].
	'p.m.' [NOUN]
[1] [4] '4' [NUM] [4] <--'nummod'-- 'p.m.' [NOUN].
[1] [5] 'p.m.' [NOUN] [p.m.] <--'pobj'-- 'at' [ADP].
	'4' [NUM]

--- Token-level Sentence Information ---
Sentence start at token [0]: 'The'.
Sentence start at token [6]: 'The'.

CORRECT: 'The store opens at 9 a.m. It closes at 4 p.m.'

spaCy identifies:

two sentences
a.m. as NOUN
It as PRON.

Original text: 'The store opens at 9 a.m. It closes at 4 p.m.'.

--- Sentence Boundaries ---

Sentence [0]: 'The store opens at 9 a.m.'
  Start token: 0, End token: 5
  Tokens: ['The', 'store', 'opens', 'at', '9', 'a.m.']
[0] [0] 'The' [DET] [the] <--'det'-- 'store' [NOUN].
[0] [1] 'store' [NOUN] [store] <--'nsubj'-- 'opens' [VERB].
	'The' [DET]
[0] [2] 'opens' [VERB] [open] <--'ROOT'-- 'opens' [VERB].
	'store' [NOUN]
	'at' [ADP]
[0] [3] 'at' [ADP] [at] <--'prep'-- 'opens' [VERB].
	'a.m.' [NOUN]
[0] [4] '9' [NUM] [9] <--'nummod'-- 'a.m.' [NOUN].
[0] [5] 'a.m.' [NOUN] [a.m.] <--'pobj'-- 'at' [ADP].
	'9' [NUM]

Sentence [1]: 'It closes at 4 p.m.'
  Start token: 6, End token: 10
  Tokens: ['It', 'closes', 'at', '4', 'p.m.']
[1] [0] 'It' [PRON] [it] <--'nsubj'-- 'closes' [VERB].
[1] [1] 'closes' [VERB] [close] <--'ROOT'-- 'closes' [VERB].
	'It' [PRON]
	'at' [ADP]
[1] [2] 'at' [ADP] [at] <--'prep'-- 'closes' [VERB].
	'p.m.' [NOUN]
[1] [3] '4' [NUM] [4] <--'nummod'-- 'p.m.' [NOUN].
[1] [4] 'p.m.' [NOUN] [p.m.] <--'pobj'-- 'at' [ADP].
	'4' [NUM]

--- Token-level Sentence Information ---
Sentence start at token [0]: 'The'.
Sentence start at token [6]: 'It'.

CORRECT: 'London is the capital of the U.K. They have the Tower of London.'

spaCy identifies:

two sentences
U.K. as PROPN
They as PRON.

Original text: 'London is the capital of the U.K. They have the Tower of London.'.

--- Sentence Boundaries ---

Sentence [0]: 'London is the capital of the U.K.'
  Start token: 0, End token: 6
  Tokens: ['London', 'is', 'the', 'capital', 'of', 'the', 'U.K.']
[0] [0] 'London' [PROPN] [London] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ROOT'-- 'is' [AUX].
	'London' [PROPN]
	'capital' [NOUN]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]

Sentence [1]: 'They have the Tower of London.'
  Start token: 7, End token: 13
  Tokens: ['They', 'have', 'the', 'Tower', 'of', 'London', '.']
[1] [0] 'They' [PRON] [they] <--'nsubj'-- 'have' [VERB].
[1] [1] 'have' [VERB] [have] <--'ROOT'-- 'have' [VERB].
	'They' [PRON]
	'Tower' [PROPN]
	'.' [PUNCT]
[1] [2] 'the' [DET] [the] <--'det'-- 'Tower' [PROPN].
[1] [3] 'Tower' [PROPN] [Tower] <--'dobj'-- 'have' [VERB].
	'the' [DET]
	'of' [ADP]
[1] [4] 'of' [ADP] [of] <--'prep'-- 'Tower' [PROPN].
	'London' [PROPN]
[1] [5] 'London' [PROPN] [London] <--'pobj'-- 'of' [ADP].
[1] [6] '.' [PUNCT] [.] <--'punct'-- 'have' [VERB].

--- Token-level Sentence Information ---
Sentence start at token [0]: 'London'.
Sentence start at token [7]: 'They'.

INCORRECT: 'London is the capital of the U.K. And they have the Tower of London.'

spaCy identifies:

one sentence
U.K. as PROPN
And as CCONJ
they as PRON.

Original text: 'London is the capital of the U.K. And they have the Tower of London.'.

--- Sentence Boundaries ---

Sentence [0]: 'London is the capital of the U.K. And they have the Tower of London.'
  Start token: 0, End token: 14
  Tokens: ['London', 'is', 'the', 'capital', 'of', 'the', 'U.K.', 'And', 'they', 'have', 'the', 'Tower', 'of', 'London', '.']
[0] [0] 'London' [PROPN] [London] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ccomp'-- 'have' [VERB].
	'London' [PROPN]
	'capital' [NOUN]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]
[0] [7] 'And' [CCONJ] [and] <--'cc'-- 'have' [VERB].
[0] [8] 'they' [PRON] [they] <--'nsubj'-- 'have' [VERB].
[0] [9] 'have' [VERB] [have] <--'ROOT'-- 'have' [VERB].
	'is' [AUX]
	'And' [CCONJ]
	'they' [PRON]
	'Tower' [PROPN]
	'.' [PUNCT]
[0] [10] 'the' [DET] [the] <--'det'-- 'Tower' [PROPN].
[0] [11] 'Tower' [PROPN] [Tower] <--'dobj'-- 'have' [VERB].
	'the' [DET]
	'of' [ADP]
[0] [12] 'of' [ADP] [of] <--'prep'-- 'Tower' [PROPN].
	'London' [PROPN]
[0] [13] 'London' [PROPN] [London] <--'pobj'-- 'of' [ADP].
[0] [14] '.' [PUNCT] [.] <--'punct'-- 'have' [VERB].

--- Token-level Sentence Information ---
Sentence start at token [0]: 'London'.

CORRECT: 'London is the capital of the U.K. London has the Tower of London.'

spaCy identifies:

two sentences
U.K. as PROPN
London as PROPN.

I use London from the first sentence one more time.

Original text: 'London is the capital of the U.K. London has the Tower of London.'.

--- Sentence Boundaries ---

Sentence [0]: 'London is the capital of the U.K.'
  Start token: 0, End token: 6
  Tokens: ['London', 'is', 'the', 'capital', 'of', 'the', 'U.K.']
[0] [0] 'London' [PROPN] [London] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ROOT'-- 'is' [AUX].
	'London' [PROPN]
	'capital' [NOUN]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]

Sentence [1]: 'London has the Tower of London.'
  Start token: 7, End token: 13
  Tokens: ['London', 'has', 'the', 'Tower', 'of', 'London', '.']
[1] [0] 'London' [PROPN] [London] <--'nsubj'-- 'has' [VERB].
[1] [1] 'has' [VERB] [have] <--'ROOT'-- 'has' [VERB].
	'London' [PROPN]
	'Tower' [PROPN]
	'.' [PUNCT]
[1] [2] 'the' [DET] [the] <--'det'-- 'Tower' [PROPN].
[1] [3] 'Tower' [PROPN] [Tower] <--'dobj'-- 'has' [VERB].
	'the' [DET]
	'of' [ADP]
[1] [4] 'of' [ADP] [of] <--'prep'-- 'Tower' [PROPN].
	'London' [PROPN]
[1] [5] 'London' [PROPN] [London] <--'pobj'-- 'of' [ADP].
[1] [6] '.' [PUNCT] [.] <--'punct'-- 'has' [VERB].

--- Token-level Sentence Information ---
Sentence start at token [0]: 'London'.
Sentence start at token [7]: 'London'.

CORRECT: 'London is the capital of the U.K. It has the Tower of London.'

spaCy identifies:

two sentences
U.K. as PROPN
It as PRON.

Original text: 'London is the capital of the U.K. It has the Tower of London.'.

--- Sentence Boundaries ---

Sentence [0]: 'London is the capital of the U.K.'
  Start token: 0, End token: 6
  Tokens: ['London', 'is', 'the', 'capital', 'of', 'the', 'U.K.']
[0] [0] 'London' [PROPN] [London] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ROOT'-- 'is' [AUX].
	'London' [PROPN]
	'capital' [NOUN]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]

Sentence [1]: 'It has the Tower of London.'
  Start token: 7, End token: 13
  Tokens: ['It', 'has', 'the', 'Tower', 'of', 'London', '.']
[1] [0] 'It' [PRON] [it] <--'nsubj'-- 'has' [VERB].
[1] [1] 'has' [VERB] [have] <--'ROOT'-- 'has' [VERB].
	'It' [PRON]
	'Tower' [PROPN]
	'.' [PUNCT]
[1] [2] 'the' [DET] [the] <--'det'-- 'Tower' [PROPN].
[1] [3] 'Tower' [PROPN] [Tower] <--'dobj'-- 'has' [VERB].
	'the' [DET]
	'of' [ADP]
[1] [4] 'of' [ADP] [of] <--'prep'-- 'Tower' [PROPN].
	'London' [PROPN]
[1] [5] 'London' [PROPN] [London] <--'pobj'-- 'of' [ADP].
[1] [6] '.' [PUNCT] [.] <--'punct'-- 'has' [VERB].

--- Token-level Sentence Information ---
Sentence start at token [0]: 'London'.
Sentence start at token [7]: 'It'.

INCORRECT: 'London is the capital of the U.K. And it has the Tower of London.'

spaCy identifies:

two sentences
U.K. as PROPN
And as CCONJ
it as PRON.

Original text: 'London is the capital of the U.K. And it has the Tower of London.'.

--- Sentence Boundaries ---

Sentence [0]: 'London is the capital of the U.K. And it has the Tower of London.'
  Start token: 0, End token: 14
  Tokens: ['London', 'is', 'the', 'capital', 'of', 'the', 'U.K.', 'And', 'it', 'has', 'the', 'Tower', 'of', 'London', '.']
[0] [0] 'London' [PROPN] [London] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ccomp'-- 'has' [VERB].
	'London' [PROPN]
	'capital' [NOUN]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]
[0] [7] 'And' [CCONJ] [and] <--'cc'-- 'has' [VERB].
[0] [8] 'it' [PRON] [it] <--'nsubj'-- 'has' [VERB].
[0] [9] 'has' [VERB] [have] <--'ROOT'-- 'has' [VERB].
	'is' [AUX]
	'And' [CCONJ]
	'it' [PRON]
	'Tower' [PROPN]
	'.' [PUNCT]
[0] [10] 'the' [DET] [the] <--'det'-- 'Tower' [PROPN].
[0] [11] 'Tower' [PROPN] [Tower] <--'dobj'-- 'has' [VERB].
	'the' [DET]
	'of' [ADP]
[0] [12] 'of' [ADP] [of] <--'prep'-- 'Tower' [PROPN].
	'London' [PROPN]
[0] [13] 'London' [PROPN] [London] <--'pobj'-- 'of' [ADP].
[0] [14] '.' [PUNCT] [.] <--'punct'-- 'has' [VERB].

--- Token-level Sentence Information ---
Sentence start at token [0]: 'London'.

CORRECT: 'London is the capital of the U.K. But it is not the capital of Scotland.'

spaCy identifies:

two sentences
U.K. as PROPN
But as CCONJ
it as PRON.

Original text: 'London is the capital of the U.K. But it is not the capital of Scotland.'.

--- Sentence Boundaries ---

Sentence [0]: 'London is the capital of the U.K.'
  Start token: 0, End token: 6
  Tokens: ['London', 'is', 'the', 'capital', 'of', 'the', 'U.K.']
[0] [0] 'London' [PROPN] [London] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ROOT'-- 'is' [AUX].
	'London' [PROPN]
	'capital' [NOUN]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]

Sentence [1]: 'But it is not the capital of Scotland.'
  Start token: 7, End token: 15
  Tokens: ['But', 'it', 'is', 'not', 'the', 'capital', 'of', 'Scotland', '.']
[1] [0] 'But' [CCONJ] [but] <--'cc'-- 'is' [AUX].
[1] [1] 'it' [PRON] [it] <--'nsubj'-- 'is' [AUX].
[1] [2] 'is' [AUX] [be] <--'ROOT'-- 'is' [AUX].
	'But' [CCONJ]
	'it' [PRON]
	'not' [PART]
	'capital' [NOUN]
	'.' [PUNCT]
[1] [3] 'not' [PART] [not] <--'neg'-- 'is' [AUX].
[1] [4] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[1] [5] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[1] [6] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'Scotland' [PROPN]
[1] [7] 'Scotland' [PROPN] [Scotland] <--'pobj'-- 'of' [ADP].
[1] [8] '.' [PUNCT] [.] <--'punct'-- 'is' [AUX].

--- Token-level Sentence Information ---
Sentence start at token [0]: 'London'.
Sentence start at token [7]: 'But'.

Additional Information

spaCy info

============================== Info about spaCy ==============================

spaCy version    3.8.11
Location         C:\src\spacey\venv\Lib\site-packages\spacy
Platform         Windows-11-10.0.26100-SP0
Python version   3.13.10
Pipelines        de_dep_news_trf (3.8.0), en_core_web_sm (3.8.0), en_core_web_trf (3.8.0), fr_dep_news_trf (3.8.0)

Initialization and Helper

This is the cell for the initialization of spaCy:

import spacy
from spacy import displacy
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_trf")

This is the helper function that shows the sentence boundaries:

def print_doc_info(doc) -> None:
    print(f"Original text: '{text}'.\n")
    print("--- Sentence Boundaries ---")
    for i, sent in enumerate(doc.sents):
        print()
        print(f"Sentence [{i}]: '{sent.text}'")
        print(f"  Start token: {sent[0].i}, End token: {sent[-1].i}")
        print(f"  Tokens: {[token.text for token in sent]}")
        sent_tokens = [token for token in sent]
        for c, t in enumerate(sent_tokens):
            print(f"[{i}] [{c}] '{t.text}' [{t.pos_}] [{t.lemma_}] <--'{t.dep_}'-- '{t.head.text}' [{t.head.pos_}].")
            for child_token in t.children:
                print(f"\t'{child_token.text}' [{child_token.pos_}]")

    print()
    print("--- Token-level Sentence Information ---")
    for token in doc:
        if token.is_sent_start:
            print(f"Sentence start at token [{token.i}]: '{token.text}'.")

    displacy.render(doc, style="dep", jupyter=True, page=False)

dfch · 2026-02-03T07:25:26Z

dfch
Feb 3, 2026
Author

I examined the structure of the incorrectly divided sentences and found this procedure:

We process each sentence that SpaCy identified.
If there is no term that is not a . and stops with a ., then there is nothing to do.
We get a list of abbreviations.
Identity the ROOT token and record its index.
Identify a VERB or AUX new_root token that connects to the ROOT token and record its index.
If there is more than one token that obeys this condition we cannot identify the sentence boundary.
We process each abbreviation.
If that abbreviation token index is not larger than new_root index and smaller than root token index (or not larger than root token index and smaller than new_root index), then this abbreviation is not an "end-of-sentence" token.
If that abbreviation has a dependency to a toker with a higher token index, then this abbreviation is not an "end-of-sentence" token.
We select the abbreviation that has the largest token index, that obeys these conditions.

Below you find some rough code that identifies the abbreviation that is an "end-of-sentence". This also works, when SpaCy identifies the CCONJ as part of the first sentence:

text = bad = """L.C. is the capital of the U.K. And the U.K. is very rainy."""

Sentence [0]: 'L.C. is the capital of the U.K. And the U.K. is very rainy.'
  Start token: 0, End token: 13
  Tokens: ['L.C.', 'is', 'the', 'capital', 'of', 'the', 'U.K.', 'And', 'the', 'U.K.', 'is', 'very', 'rainy', '.']
[0] [0] 'L.C.' [PROPN] [L.C.] <--'nsubj'-- 'is' [AUX].
[0] [1] 'is' [AUX] [be] <--'ROOT'-- 'is' [AUX].
	'L.C.' [PROPN]
	'capital' [NOUN]
	'And' [CCONJ]
	'is' [AUX]
[0] [2] 'the' [DET] [the] <--'det'-- 'capital' [NOUN].
[0] [3] 'capital' [NOUN] [capital] <--'attr'-- 'is' [AUX].
	'the' [DET]
	'of' [ADP]
[0] [4] 'of' [ADP] [of] <--'prep'-- 'capital' [NOUN].
	'U.K.' [PROPN]
[0] [5] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [6] 'U.K.' [PROPN] [U.K.] <--'pobj'-- 'of' [ADP].
	'the' [DET]
[0] [7] 'And' [CCONJ] [and] <--'cc'-- 'is' [AUX].
[0] [8] 'the' [DET] [the] <--'det'-- 'U.K.' [PROPN].
[0] [9] 'U.K.' [PROPN] [U.K.] <--'nsubj'-- 'is' [AUX].
	'the' [DET]
[0] [10] 'is' [AUX] [be] <--'conj'-- 'is' [AUX].
	'U.K.' [PROPN]
	'rainy' [ADJ]
	'.' [PUNCT]
[0] [11] 'very' [ADV] [very] <--'advmod'-- 'rainy' [ADJ].
[0] [12] 'rainy' [ADJ] [rainy] <--'acomp'-- 'is' [AUX].
	'very' [ADV]
[0] [13] '.' [PUNCT] [.] <--'punct'-- 'is' [AUX].

Identify if an abbreviation is an "end-of-sentence" token:

def is_possible_eos(abbrev: Token, min_token: int, max_token: int) -> bool:

    assert isinstance(abbrev, Token), type(abbrev)
    assert isinstance(min_token, int) and 0 <= min_token
    assert isinstance(max_token, int) and 0 <= max_token

    # When the abbreviation is not within both tokens, it is not end-of-sentence.
    if False == min_token < abbrev.i < max_token:
        return False

    # When there is a token dependency after our abbreviation, it is not end-of-sentence.
    return not abbrev.head.i > abbrev.i

def find_eos_token_index(abbrevs: list[Token], min_token: int, max_token: int) -> int | None:

    assert isinstance(abbrevs, list), type(abbrevs)
    assert isinstance(min_token, int) and 0 <= min_token
    assert isinstance(max_token, int) and 0 <= max_token

    candidates: dict[int, bool] = {}
    for abbrev in abbrevs:
        candidates[abbrev.i] = is_possible_eos(abbrev, min_token, max_token)

    if not any(c for c in candidates.values() if True == c):
        print("Nothing found. Trying reverse order.")
        candidates.clear()
        for abbrev in abbrevs:
            candidates[abbrev.i] = is_possible_eos(abbrev, max_token, min_token)

    if not any(c for c in candidates.values() if True == c):
        print("No candidates found.")
        return None
    
    result = max(candidates, key=candidates.get)
    return result

Get root, new_root and abbreviations:

    doc: Doc = ...  # This is the SpaCy Doc object.
    sent: Span = ...  # This is an icorrectly identified sentence and is part of the `doc`.

    DOT = "."
    abbrevs = [token for token in sent if token.text != DOT and token.text.endswith(DOT)]
    if not any(abbrevs) or 0 >= len(abbrevs):
        return None

    roots = [token for token in sent if token.dep_ == "ROOT"]
    if not any(roots) or 1 != len(roots):
        return None
    root = roots[0]
    assert isinstance(root, Token), type(root)

    candidates = [t for t in sent if t.head.i == root.i and t.pos_ in ("VERB", "AUX") and t.i != root.i]
    if not any(candidates) or 1 != len(candidates):
        return None
    candidate = candidates[0]
    print(f"NewRoot: '{(candidate.i, candidate.text)}'")
    print(f"Root: '{(root.i, root.text)}'")

    print(f"Abbreviations: '{[(t.i, t.text) for t in abbrevs]}'.")

    eos_token_i = find_eos_token_index(abbrevs, min_token=candidate.i, max_token=root.i)
    eos_token = doc[eos_token_i]
    print(f"Found eos: [{eos_token_i}] '{eos_token.text}'")

1 reply

dfch Feb 12, 2026
Author

Here is an update to what I documented before. Both models en_core_web_sm and en_core_web_trf have different results for how they find sentence boundaries with abbreviations. The number in the tuple is the number of actual sentences that spaCy detects.

en_core_web_sm (the smaller model) has less problems for this example text to correctly identify two sentences instead of incorrectly identify only one sentence.

`en_core_web_trf`

@parameterized.expand([
    ("abbrev_noun", "The store opens at 9 a.m. The store closes at 4 p.m.", 2),
    ("abbrev_cconj_noun", "The store opens at 9 a.m. And the store closes at 4 p.m.", 2),
    ("abbrev_pron", "The store opens at 9 a.m. It closes at 4 p.m.", 2),
    ("propn_pron_pl", "London is the capital of the U.K. They have the Tower of London.", 2),
    ("propn_pron_sg", "London is the capital of the U.K. It has the Tower of London.", 2),
    ("propn_propn", "London is the capital of the U.K. London has the Tower of London.", 2),
    ("propn_cconj_pron_but", "London is the capital of the U.K. But it is not the capital of Scotland.", 2),
    ("propn_cconj_pron_and", "London is the capital of the U.K. And, it has the Tower of London.", 2),
    ("propn_propn_uk", "London is the capital of the U.K. The U.K. has many inhabitants.", 2),
    ("propn_cconj_propn", "London is the capital of the U.K. And the U.K. has many inhabitants.", 2),
])
def test_sentences_correct(self, name, text, expected):

    doc = self.nlp(text)

    self.assertEqual(expected, len(list(doc.sents)), name)
    print([(token.text, token.pos_, token.dep_) for token in doc])

@parameterized.expand([
    ("abbrev_cconj_pron", "The store opens at 9 a.m. And it closes at 4 p.m.", 1),
    ("propn_cconj_pron1_pl", "London is the capital of the U.K. And they have the Tower of London.", 1),
    ("propn_cconj_pron1_sg", "London is the capital of the U.K. And it has the Tower of London.", 1),
    ("propn_adv1", "London is the capital of the U.K. Very often, it is rainy there.", 1),
    ("propn_adv2", "London is the capital of the U.K. Very often, it rains there.", 1),
    ("propn_det_propn", "London is the capital of the U.K. The U.K. is very rainy.", 1),
    ("propn_cconj_propn", "London is the capital of the U.K. And the U.K. is very rainy.", 1),
])
def test_sentence_incorrect(self, name, text, expected):

    doc = self.nlp(text)

    self.assertEqual(expected, len(list(doc.sents)), name)
    print([(token.text, token.pos_, token.dep_) for token in doc])

`en_core_web_sm`

@parameterized.expand([
    ("abbrev_noun", "The store opens at 9 a.m. The store closes at 4 p.m.", 2),
    ("abbrev_cconj_noun", "The store opens at 9 a.m. And the store closes at 4 p.m.", 1),
    ("abbrev_pron", "The store opens at 9 a.m. It closes at 4 p.m.", 1),
    ("propn_pron_pl", "London is the capital of the U.K. They have the Tower of London.", 2),
    ("propn_pron_sg", "London is the capital of the U.K. It has the Tower of London.", 2),
    ("propn_propn", "London is the capital of the U.K. London has the Tower of London.", 1),
    ("propn_cconj_pron_but", "London is the capital of the U.K. But it is not the capital of Scotland.", 2),
    ("propn_cconj_pron_and", "London is the capital of the U.K. And, it has the Tower of London.", 2),
    ("propn_propn_uk", "London is the capital of the U.K. The U.K. has many inhabitants.", 2),
    ("propn_cconj_propn", "London is the capital of the U.K. And the U.K. has many inhabitants.", 2),
])
def test_sentences_correct(self, name, text, expected):

    doc = self.nlp(text)

    self.assertEqual(expected, len(list(doc.sents)), name)
    print([(token.text, token.pos_, token.dep_) for token in doc])

@parameterized.expand([
    ("abbrev_cconj_pron", "The store opens at 9 a.m. And it closes at 4 p.m.", 2),
    ("propn_cconj_pron1_pl", "London is the capital of the U.K. And they have the Tower of London.", 2),
    ("propn_cconj_pron_sg", "London is the capital of the U.K. And it has the Tower of London.", 2),
    ("propn_adv1", "London is the capital of the U.K. Very often, it is rainy there.", 1),
    ("propn_adv2", "London is the capital of the U.K. Very often, it rains there.", 1),
    ("propn_det_propn", "London is the capital of the U.K. The U.K. is very rainy.", 2),
    ("propn_cconj_propn", "London is the capital of the U.K. And the U.K. is very rainy.", 2),
])
def test_sentence_incorrect(self, name, text, expected):

    doc = self.nlp(text)

    self.assertEqual(expected, len(list(doc.sents)), name)
    print([(token.text, token.pos_, token.dep_) for token in doc])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect identification of sentence boundaries, if a sentence stops with a term that ends with a DOT #13916

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Incorrect identification of sentence boundaries, if a sentence stops with a term that ends with a DOT #13916

Uh oh!

Uh oh!

dfch Feb 2, 2026

Introduction

Observation

Correct sentence boundaries with these examples

Incorrect sentence boundaries with these examples

Questions

Detailed Output

INCORRECT: 'The store opens at 9 a.m. And it closes at 4 p.m.'

CORRECT: 'The store opens at 9 a.m. And the store closes at 4 p.m.'

CORRECT: 'The store opens at 9 a.m. The store closes at 4 p.m.'

CORRECT: 'The store opens at 9 a.m. It closes at 4 p.m.'

CORRECT: 'London is the capital of the U.K. They have the Tower of London.'

INCORRECT: 'London is the capital of the U.K. And they have the Tower of London.'

CORRECT: 'London is the capital of the U.K. London has the Tower of London.'

CORRECT: 'London is the capital of the U.K. It has the Tower of London.'

INCORRECT: 'London is the capital of the U.K. And it has the Tower of London.'

CORRECT: 'London is the capital of the U.K. But it is not the capital of Scotland.'

Additional Information

spaCy info

Initialization and Helper

Replies: 1 comment · 1 reply

Uh oh!

dfch Feb 3, 2026 Author

Uh oh!

Uh oh!

dfch Feb 12, 2026 Author

en_core_web_trf

en_core_web_sm

dfch
Feb 2, 2026

Replies: 1 comment 1 reply

dfch
Feb 3, 2026
Author

dfch Feb 12, 2026
Author

`en_core_web_trf`

`en_core_web_sm`