The goal of Text Segmentation is to identify boundaries of topic shift in a document. For the legal domain, Text Segmentation is useful to identify and link parts of documents on a given topic, in place of documents as a whole.
Discourse structure studies have shown that a document is usually a mixture of topics and sub-topics. A shift in topics could be noticed with changes in patterns of vocabulary usage. The text units (sentences or paragraphs) making up a segment have to be coherent, i.e., exhibiting strong grammatical, lexical and semantic cohesion.
The University of Luxembourg, together with University of Turin, has developed an approach to text segmentation based on an unsupervised method which also incorporates the use of topics obtained from Linear Dirichlet Allocation (LDA) topic modeling of documents (Adebayo et al., 2016). The method incorporates entity coherence techniques, which allow for the introduction of heuristic rules for boundary decision. Entity mapping is performed across a window of words in order to find out the transition of entities within sentences. The information obtained is used to support the LDA-based boundary detection for proper boundary adjustment. Results indicate that the approach outperforms state-of-the-art systems.