Skip to content

Basic Parsing

Opening a file

CrystalXMLSource accepts a file path (string or pathlib.Path):

from crxml import CrystalXMLSource

# path string
src = CrystalXMLSource("report.xml")

# pathlib.Path
from pathlib import Path
src = CrystalXMLSource(Path("report.xml"))

Parameters

Param Type Default Description
source str \| Path Path to CR XML file
row_tag str "Row" XML tag for each record row

The row_tag parameter lets you target a different repeating element if your CR XML uses a non-standard tag name.

Iteration

CrystalXMLSource is iterable. Each row is a dict[str, str]:

for row in CrystalXMLSource("report.xml"):
    print(row["{Report.InvoiceNo}"], row["{Report.Amount}"])

Keys are the FieldName attribute values from the CR XML (e.g. {Report.InvoiceNo}). Values are the raw text of the first <FormattedValue> or <Value> child element.

Schema inspection

Call .schema() to discover fields without consuming the stream:

src = CrystalXMLSource("report.xml")
fields = src.schema()  # list of (key, sample_value) tuples

The source yields rows internally and caches them, so the first batch is not lost. .schema() is safe to call before building a pipeline.

Memory model

The parser streams the file in constant memory. The Rust backend reuses internal buffers across rows and never materializes the full document. RSS scales with file content (22 MB for 10 MB, 75 MB for 100 MB), staying well below file size. pandas is imported lazily — memory climbs only when to_dataframe is called.

CR XML layout detection

Crystal Reports XML stores field values in two patterns:

  • Attribute style: <Field FieldName="{Report.Amount}"><Value>123.45</Value></Field>
  • Element style: <Field><FieldName>{Report.Amount}</FieldName><Value>123.45</Value></Field>
  • Mixed: some fields use attributes, others use child elements

The parser detects both styles automatically, no configuration needed.