Splitting Performance
Performance changelog for the Splitting processor. This section covers semantic splitting, layout detection, multi-page handling, and performance optimizations for document preprocessing.
What’s Included
- Semantic Splitting: NLP-based (Natural Language Processing) section boundary detection
- Layout Detection: Visual structure analysis for mixed-format documents
- Multi-Page Handling: Optimization for large documents and batches
- Language Support: Improvements for multi-language and mixed-language documents
Recent Updates
2024-12-06 — Semantic Splitter v2
Deployed updated semantic splitting model with improved boundary detection for contracts and legal documents. Section accuracy improved from 91% to 96% on legal benchmark.
- Impact: Accuracy
2024-11-24 — Vector Cache Optimization
Implemented vector cache warmup reducing average splitting latency by 2.3× for repeated document structures. Cache invalidation configurable per workflow.
- Impact: Latency
2024-11-10 — Scanned PDF Robustness
Enhanced handling of scanned PDFs with skewed pages, variable DPI (Dots Per Inch), and watermarks. Failure rate on degraded scans reduced from 8% to 2%.
- Impact: Reliability
2024-10-28 — Multi-Language Boundary Detection
Added support for mixed-language documents with automatic language detection per section. Supports 14 European languages plus Japanese and Chinese.
- Impact: Accuracy
2024-10-14 — Per-Segment Telemetry
Added confidence scores and processing time metrics per segment in workflow analytics. Enables identification of problematic document sections.
- Impact: UX
2024-09-30 — Retry Logic Enhancement
Improved automatic retry for PDF parsing failures under heavy load. Retry attempts increased from 2 to 5 with exponential backoff.
- Impact: Reliability
2024-09-18 — Table Preservation Mode
Added mode to preserve table boundaries during splitting. Tables spanning section breaks are kept intact rather than split mid-row.
- Impact: Accuracy
2024-09-04 — Parallel Page Processing
Enabled parallel processing for multi-page documents exceeding 20 pages. Reduces total splitting time by up to 60% for large documents.
- Impact: Latency
Compatibility Notes
- Semantic Splitter v2 is default for new workflows; v1 available via parameter
- Vector cache requires minimum 2GB memory allocation
- Multi-language detection adds ~50ms overhead per document
Roadmap (Next Quarter)
- Custom section delimiter support (regex-based)
- Streaming output for documents over 200 pages
- Template-based splitting for standardized document formats