Some Heading
This is a simple HTML article.
Text can be nested in
html tags. Multiple whitespaces
are collapsed, but we try to keep linebreaks (\n). Sentences can
span multiple
lines.
We should be able to split sentences that contain: 1. multiple dots
and interpunctation 2. lists and other things ... but still be just
one sentence.
However, this sentencing does not have to be perfect, e.g.
deteciting some artifacts as sentences should still be ok.
With the TextRank algorithm and other plausibility checks
(e.g. POS checking with spaCy) we should be able to filter
these.
Here's a list of things, let's see how this split: