Skip to content

Performance

Benchmarks

Test Size Rows Time Rows/s MB/s RSS
Stream 10 MB 9,010 0.043s 211 K 234 22 MB
Stream 50 MB 45,328 0.223s 203 K 224 45 MB
Stream 100 MB 90,384 0.418s 216 K 239 75 MB
To list 10 MB 9,010 0.052s 174 K 192 32 MB
To list 50 MB 45,328 0.249s 182 K 201 98 MB
To list 100 MB 90,384 0.478s 189 K 209 181 MB
Pipeline 10 MB 9,010 0.060s 150 K 166 32 MB
Pipeline 50 MB 45,328 0.295s 154 K 169 96 MB
Pipeline 100 MB 90,384 0.579s 156 K 173 176 MB
DataFrame 10 MB 9,010 0.320s 28 K 31 86 MB
DataFrame 50 MB 45,328 0.538s 84 K 93 152 MB
DataFrame 100 MB 90,384 0.829s 109 K 121 234 MB

pandas is imported lazily — RSS and time for DataFrame mode includes pandas overhead.

Where time goes

  • I/O: ~15% of parse time (kernel read syscalls)
  • XML parsing: ~40% (quick-xml)
  • Dict construction: ~30% (PyDict creation, key/value allocation)
  • Python overhead: ~15% (yield, iteration boundary)

The dominant cost is allocating Python str objects for each field value and constructing dicts. This overhead is inherent to the Python/Rust boundary.

Memory breakdown

┌──────────────────┐
│  I/O buffer      │  2 MB  (reused Vec<u8>)
│  Row dict        │  ~2 KB per row (temporary)
│  RSS floor       │  15 MB (Python interpreter, no pandas)
│  Pipeline stages │  ~20 MB (stage objects, fusion)
└──────────────────┘

RSS scales with file content (22 MB for 10 MB, 75 MB for 100 MB).

Parallel mode

For files >200 MB, parallel mode distributes batches across N workers:

Workers 500 MB file Speedup
1 3.2 s 1.0×
2 1.9 s 1.7×
4 1.1 s 2.9×
8 0.9 s 3.6×

Diminishing returns past 4 workers due to IPC overhead.

Bottleneck analysis

  • Disk I/O: Not the bottleneck for files <1 GB on modern NVMe SSDs
  • Single-threaded Python: Dict allocation is the bottleneck
  • XML parsing: quick-xml is fast; the bottleneck is Python dict construction
  • Rust allocator: Uses system allocator with stable RSS

Recommendations

File size Strategy
< 50 MB Sequential (no overhead)
50–200 MB Sequential or parallel(2)
> 200 MB parallel(4)
> 1 GB parallel(N_CPU), chunksize=20000