Splitting Performance

Performance changelog for the Splitting processor. This section covers semantic splitting, layout detection, multi-page handling, and performance optimizations for document preprocessing.

What’s Included

Semantic Splitting: NLP-based (Natural Language Processing) section boundary detection
Layout Detection: Visual structure analysis for mixed-format documents
Multi-Page Handling: Optimization for large documents and batches
Language Support: Improvements for multi-language and mixed-language documents

Recent Updates

2024-12-06 — Semantic Splitter v2

Deployed updated semantic splitting model with improved boundary detection for contracts and legal documents. Section accuracy improved from 91% to 96% on legal benchmark.

Impact: Accuracy

2024-11-24 — Vector Cache Optimization

Implemented vector cache warmup reducing average splitting latency by 2.3× for repeated document structures. Cache invalidation configurable per workflow.

Impact: Latency

2024-11-10 — Scanned PDF Robustness

Enhanced handling of scanned PDFs with skewed pages, variable DPI (Dots Per Inch), and watermarks. Failure rate on degraded scans reduced from 8% to 2%.

Impact: Reliability

2024-10-28 — Multi-Language Boundary Detection

Added support for mixed-language documents with automatic language detection per section. Supports 14 European languages plus Japanese and Chinese.

Impact: Accuracy

2024-10-14 — Per-Segment Telemetry

Added confidence scores and processing time metrics per segment in workflow analytics. Enables identification of problematic document sections.

Impact: UX

2024-09-30 — Retry Logic Enhancement

Improved automatic retry for PDF parsing failures under heavy load. Retry attempts increased from 2 to 5 with exponential backoff.

Impact: Reliability

2024-09-18 — Table Preservation Mode

Added mode to preserve table boundaries during splitting. Tables spanning section breaks are kept intact rather than split mid-row.

Impact: Accuracy

2024-09-04 — Parallel Page Processing

Enabled parallel processing for multi-page documents exceeding 20 pages. Reduces total splitting time by up to 60% for large documents.

Impact: Latency

Compatibility Notes

Semantic Splitter v2 is default for new workflows; v1 available via parameter
Vector cache requires minimum 2GB memory allocation
Multi-language detection adds ~50ms overhead per document

Roadmap (Next Quarter)

Custom section delimiter support (regex-based)
Streaming output for documents over 200 pages
Template-based splitting for standardized document formats