Abstract
We present NUPunkt and CharBoundary, two sentence boundary detection libraries optimized for high-precision, high-throughput processing of legal text in large-scale applications such as due diligence, e-discovery, and legal research. These libraries address the critical challenges posed by legal documents containing specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors. Our experimental evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that NUPunkt achieves 91.1% precision while processing 10 million characters per second with modest memory requirements (432 MB). CharBoundary models offer balanced and adjustable precision-recall tradeoffs, with the large model achieving the highest F1 score (0.782) among all tested methods. Notably, NUPunkt provides a 29-32% precision improvement over general-purpose tools while maintaining exceptional throughput, processing multi-million document collections in minutes rather than hours. Both libraries run efficiently on standard CPU hardware without requiring specialized accelerators. NUPunkt is implemented in pure Python with zero external dependencies, while CharBoundary relies only on scikit-learn and optional ONNX runtime integration for optimized performance. Both libraries are available under the MIT license, can be installed via PyPI, and can be interactively tested at this https URL. These libraries address critical precision issues in retrieval-augmented generation systems by preserving coherent legal concepts across sentences, where each percentage improvement in precision yields exponentially greater reductions in context fragmentation, creating cascading benefits throughout retrieval pipelines and significantly enhancing downstream reasoning quality.
Abstract (translated)
我们介绍了NUPunkt和CharBoundary,这两款专为大规模应用(如尽职调查、电子发现和法律研究)中高精度处理法律文本而优化的句子边界检测库。这些库解决了由含有专门引用、缩写以及复杂句式结构的法律文件对通用句子边界检测器提出的挑战。我们在五个多样化的法律数据集上进行了实验评估,该数据集中包含超过25,000份文档和197,000个注释的句子边界,结果表明NUPunkt在每秒处理1千万字符的同时实现了高达91.1%的精度,并且内存需求适中(432MB)。CharBoundary模型提供了平衡且可调节的精确度-召回率权衡,在所有测试方法中,大型模型获得了最高的F1分数(0.782)。 特别值得注意的是,NUPunkt在保持卓越吞吐量的同时,相较于通用工具提升了29至32%的精度,可以在几分钟内处理数百万文档集合,而不是几个小时。这两个库都在标准CPU硬件上高效运行,不需要专用加速器。NUPunkt完全用纯Python编写,并且没有任何外部依赖项,而CharBoundary仅依赖于scikit-learn和可选的ONNX运行时集成以实现优化性能。 两个库均采用MIT许可协议发布,可以通过PyPI安装,并在[此链接](https://this-url.com)提供交互式测试。这些库通过保留句子间的连贯法律概念来解决检索增强生成系统中的关键精确度问题,每个百分比的精度提升都会产生指数级减少的上下文碎片化,从而在整个检索管道中带来连锁效应并显著提高下游推理质量。
URL
https://arxiv.org/abs/2504.04131