Getting Started¶
This guide walks through a complete round-trip: installation, first parse, schema inspection, a simple pipeline, and DataFrame conversion.
Install¶
See Installation for details on building from source and platform support.
Your first source¶
Create a small Crystal Reports XML file and point CrystalXMLSource at it:
Each row is a dict[str, str]. The keys are field names from the CR XML
(e.g. {Report.FieldName}) and the values are the raw text content.
Inspect the schema¶
Use .schema() to see the fields without consuming the stream:
src = CrystalXMLSource("report.xml")
fields = src.schema() # list of (key, sample_value) tuples
for key, sample in fields:
print(f"{key}: {sample!r}")
This is useful for building dynamic pipelines.
Simple pipeline¶
The | operator chains transformation stages. Nothing executes until you
iterate or sink the result:
from crxml import CrystalXMLSource, RenameFields, CastTypes, DropFields
pipe = (
CrystalXMLSource("report.xml")
| RenameFields({
"{Report.InvoiceNo}": "invoice",
"{Report.Customer}": "customer",
"{Report.Amount}": "amount",
})
| CastTypes({"amount": float})
| DropFields("{Report.TaxRate}")
)
for row in pipe:
print(row["invoice"], row["amount"])
Convert to DataFrame¶
This collects all rows into a pandas DataFrame. For large files use
chunksize= to build the DataFrame incrementally (see Sinks).
Next steps¶
- Usage guide, deeper topics: custom stages, parallel mode, branching
- Pipeline API, how
|and lazy evaluation work - Built-in stages, reference for all four stage types
- Performance, benchmarks, memory model, bottlenecks