Round-tripping pandas / polars DataFrames via ADBC
ADBC keeps your data in Arrow from XTDB through to a DataFrame, with no row-by-row decode and no type widening through the Postgres text wire. This guide shows both directions: reading query results straight into pandas / polars, and writing DataFrames back to XTDB.
Type fidelity is the main reason to use this path.
TIMESTAMP WITH TIME ZONE, nullable columns, and dictionary-encoded categoricals all survive intact.
The same round-trip through psycopg or JDBC widens timestamps, turns nullable Arrow arrays into NaN-polluted float columns, and discards the category encoding.
Prerequisites
Section titled “Prerequisites”pip install adbc-driver-flightsql pyarrow pandas polarsdocker run --rm -d --name xtdb -p 5432:5432 -p 9832:9832 ghcr.io/xtdb/xtdb:nightlyAll examples below connect to grpc://localhost:9832.
The runnable companion script exercises every code block in order.
The dataset
Section titled “The dataset”A small events table that stress-tests type fidelity:
_id: required; XTDB’s document key.ts:TIMESTAMP WITH TIME ZONE. psycopg hands this back as a tz-awaredatetime; ADBC hands it back astimestamp[us, tz=UTC], Arrow-native, no intermediate decode.severity: string, stored as a small-cardinality column. Ingested via pandas asdictionary<int8, large_utf8>; polars ingests asdictionary<uint32, large_utf8>. psycopg would widen to plain strings; ADBC returns it asutf8, which clients can re-encode as dictionary if needed (see the dictionary callout).value:float64, always present.label:utf8, nullable. Some rows havenullhere. Through psycopg the column often surfaces asobjectdtype withNonesprinkled in; Arrow models it as a validity bitmap alongside the data buffer.
XTDB → Arrow (reading)
Section titled “XTDB → Arrow (reading)”Connect and fetch a query result as an Arrow table:
import pyarrow as paimport adbc_driver_flightsql.dbapi as flight_sql
with flight_sql.connect("grpc://localhost:9832") as conn: with conn.cursor() as cur: cur.execute("SELECT _id, ts, severity, value, label FROM events ORDER BY ts") arrow_table = cur.fetch_arrow_table()
# arrow_table is a pyarrow.Table; Arrow buffers, no copies yet.print(arrow_table.schema)# _id: string# ts: timestamp[us, tz=UTC]# severity: string# value: double# label: string (nullable=True)Query results stream from XTDB as Arrow batches over gRPC.
fetch_arrow_table() collects them into a single pyarrow.Table, memory-bound by the client; use the raw batch-by-batch path for very large result sets.
Timezone normalisation
Section titled “Timezone normalisation”XTDB FlightSQL currently returns timestamps with timezone key "Z" (Zulu = UTC).
The "Z" key is valid Arrow but not a recognised IANA timezone string, so pandas will raise a ZoneInfoNotFoundError when printing or converting these columns.
Cast to "UTC" before handing the table to pandas or polars:
def normalise_tz(table: pa.Table) -> pa.Table: """Cast timestamp[us, tz=Z] → timestamp[us, tz=UTC] for client compatibility.""" for i, field in enumerate(table.schema): t = field.type if pa.types.is_timestamp(t) and t.tz == "Z": table = table.set_column( i, field.name, table.column(field.name).cast(pa.timestamp(t.unit, tz="UTC")), ) return table
arrow_table = normalise_tz(arrow_table)Arrow → pandas
Section titled “Arrow → pandas”With types_mapper=pd.ArrowDtype (recommended)
Section titled “With types_mapper=pd.ArrowDtype (recommended)”import pandas as pd
df = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)print(df.dtypes)# _id string[pyarrow]# ts timestamp[us, tz=UTC][pyarrow]# severity string[pyarrow]# value double[pyarrow]# label string[pyarrow]types_mapper=pd.ArrowDtype keeps all columns Arrow-backed:
- Zero-copy for all column types including strings: the pandas
ArrowDtypecolumn wraps the Arrow buffer directly, no allocation. - Nulls surface as
pd.NA(notfloat('nan')):df["label"].isna().sum()counts them correctly. - Timestamps carry the UTC timezone as an
ArrowDtype: no lossy intermediate.
Without types_mapper (classic path)
Section titled “Without types_mapper (classic path)”df_classic = arrow_table.to_pandas()print(df_classic.dtypes)# _id object ← strings decoded into Python objects# ts datetime64[us, UTC] ← zero-copy from the Arrow buffer# severity object ← strings decoded, dictionary encoding gone# value float64 ← zero-copy from the Arrow buffer# label object ← nulls decoded as Python NoneNumeric and timestamp columns are zero-copy in the classic path too.
Strings allocate: pandas’s object dtype is Python objects, not Arrow buffers.
Arrow → polars
Section titled “Arrow → polars”polars is Arrow-native, so pl.from_arrow() wraps the Arrow buffers without copying them:
import polars as pl
df = pl.from_arrow(arrow_table)print(df.schema)# Schema({'_id': String, 'ts': Datetime(time_unit='us', time_zone='UTC'),# 'severity': String, 'value': Float64, 'label': String})The polars path is genuinely zero-copy for all column types including strings.
polars stores strings as Arrow LargeUtf8 internally; it wraps the server-side Utf8 buffer directly.
Nulls are modelled as Arrow nulls, not NaN:
df["label"].null_count() # 2; not NaN, not None-in-objectDataFrame → XTDB (ingesting)
Section titled “DataFrame → XTDB (ingesting)”The clean bulk-ingest path (one gRPC round trip, no row shredding) uses adbc_ingest:
import pyarrow as pa
# pandas → Arrowarrow_from_pd = pa.Table.from_pandas(df, preserve_index=False)
# polars → Arrow (near-zero-copy)arrow_from_pl = df_pl.to_arrow()
with flight_sql.connect("grpc://localhost:9832") as conn: with conn.cursor() as cur: cur.adbc_ingest("events", arrow_from_pd, mode="create_append") cur.adbc_ingest("events", arrow_from_pl, mode="create_append")adbc_ingest maps to the FlightSQL CommandStatementIngest command: the whole table travels as one Arrow stream, no per-row round trips.
It’s available on the nightly image and on stable images from 2.2.0 onwards.
Building an ingest-ready Arrow table from pandas
Section titled “Building an ingest-ready Arrow table from pandas”Regardless of the ingest path, here is how to build a clean Arrow table from a pandas DataFrame:
import pandas as pdimport pyarrow as pa
df = pd.DataFrame({ "_id": ["evt-001", "evt-002", "evt-003", "evt-004"], "ts": pd.to_datetime( ["2024-01-15T08:30:00Z", "2024-01-15T09:45:00Z", "2024-01-15T11:00:00Z", "2024-01-15T14:20:00Z"], utc=True, ), # Categorical → Arrow dictionary<int8, large_utf8>. "severity": pd.Categorical( ["INFO", "WARN", "ERROR", "INFO"], categories=["INFO", "WARN", "ERROR"], ), "value": [1.23, 4.56, 7.89, 2.34], # StringDtype with None → Arrow utf8 with validity bitmap (not NaN). "label": pd.array(["startup", None, "threshold", None], dtype=pd.StringDtype()),})
arrow_table = pa.Table.from_pandas(df, preserve_index=False)# Schema:# _id: large_string# ts: timestamp[us, tz=UTC]# severity: dictionary<values=large_string, indices=int8, ordered=0># value: double# label: large_stringKey points:
- Use
pd.StringDtype()for nullable string columns soNonemaps to an Arrow null, not aNaN. Plain Pythonstrin anobjectcolumn would produceNoneorfloat('nan'), whichfrom_pandashandles heuristically. - Use
pd.Categoricalfor low-cardinality string columns:from_pandasconverts it to Arrowdictionary<int8, large_utf8>preserving the encoding. preserve_index=Falsedrops the pandas index from the schema; the resulting table is clean Arrow with no pandas metadata bleeding in.
Building an ingest-ready Arrow table from polars
Section titled “Building an ingest-ready Arrow table from polars”polars .to_arrow() is near-zero-copy:
import polars as plfrom datetime import datetime, timezone
df = pl.DataFrame({ "_id": ["evt-005", "evt-006"], "ts": pl.Series( [datetime(2024, 1, 16, 8, 0, tzinfo=timezone.utc), datetime(2024, 1, 16, 10, 30, tzinfo=timezone.utc)], dtype=pl.Datetime("us", "UTC"), ), "severity": pl.Series(["WARN", "ERROR"], dtype=pl.Categorical), "value": [9.99, 0.01], "label": pl.Series([None, "manual"], dtype=pl.Utf8),})
arrow_table = df.to_arrow()# Schema:# _id: large_string# ts: timestamp[us, tz=UTC]# severity: dictionary<values=large_string, indices=uint32, ordered=0># value: double# label: large_stringpolars pl.Categorical maps to dictionary<uint32, large_utf8>, a wider index than pandas’s int8, but both are valid Arrow dictionary types.
Type fidelity: what survives, what doesn’t
Section titled “Type fidelity: what survives, what doesn’t”| Type | Via pgwire (psycopg) | Via ADBC |
|---|---|---|
TIMESTAMP WITH TIME ZONE | Python datetime (tz-aware) | timestamp[us, tz=UTC], Arrow native, zero-copy into polars / pandas with ArrowDtype |
| Nullable column | object dtype with None | Arrow validity bitmap, proper null_count, no NaN |
DOUBLE / FLOAT | Python float | float64 Arrow buffer, zero-copy into numpy / polars |
INTEGER | Python int | int32 / int64 Arrow buffer, zero-copy |
VARCHAR / TEXT | Python str | utf8 / large_utf8, zero-copy into polars, copy into classic pandas |
Dictionary round-trip caveat
Section titled “Dictionary round-trip caveat”XTDB’s query engine returns utf8 for string columns regardless of how they were ingested.
If you ingest a dictionary<int8, utf8> column and query it back, you’ll get utf8: the data is intact but the encoding is gone.
Re-encode client-side if the downstream computation depends on it:
col = arrow_table.column("severity")if not pa.types.is_dictionary(col.type): arrow_table = arrow_table.set_column( arrow_table.schema.get_field_index("severity"), "severity", col.dictionary_encode(), )Null handling on the pandas ingest path
Section titled “Null handling on the pandas ingest path”pandas uses NaN for missing float values and None for object dtype columns.
pa.Table.from_pandas() maps both to Arrow nulls, but only if the column uses a pandas nullable extension type.
Use pd.StringDtype(), pd.Int64Dtype(), etc. so from_pandas can distinguish a real None from a missing sentinel:
# Correct: nullable Int64Dtype maps None → Arrow null.df["count"] = pd.array([1, None, 3], dtype=pd.Int64Dtype())
# Risky: plain int list with None becomes float64 with NaN.df["count"] = [1, None, 3]The pgwire comparison
Section titled “The pgwire comparison”Same round-trip via psycopg for contrast:
import psycopg
with psycopg.connect("postgresql://xtdb@localhost:5432/xtdb") as pg: # Row-at-a-time INSERT; one round trip per row. pg.autocommit = True with pg.cursor() as cur: for _, row in df.iterrows(): cur.execute( "INSERT INTO events (_id, ts, severity, value) VALUES (%s, %s, %s, %s)", (row["_id"], row["ts"], str(row["severity"]), row["value"]), )
# SELECT returns rows decoded one at a time. cur.execute("SELECT _id, ts, severity, value, label FROM events ORDER BY ts") rows = cur.fetchall() # rows is a list of tuples; ts is a datetime, severity is a str, # nulls are None, no Arrow anywhere, no zero-copy.The ADBC path:
- Ingests the whole table in a single gRPC call via
adbc_ingestvs oneINSERTper row. - Returns Arrow batches vs decoded Python tuples.
- Preserves
timestamp[us, tz=UTC]vs widening todatetime. - Keeps nulls as Arrow null rather than Python
Nonein anobjectcolumn.
For analytical and bulk workloads, the Arrow path avoids a decode-and-re-encode step on every row. For interactive queries and existing Postgres tooling, pgwire is still the right answer.
Summary
Section titled “Summary”- Use
cur.fetch_arrow_table()to collect query results as apyarrow.Table. Callnormalise_tz()to converttz=Z→tz=UTCbefore pandas/polars conversion. - Use
.to_pandas(types_mapper=pd.ArrowDtype)for zero-copy across all column types including strings. - Use
pl.from_arrow(arrow_table)for polars: genuinely zero-copy, nulls modelled correctly. - Use
pa.Table.from_pandas(df, preserve_index=False)to build an Arrow table for ingest. Use nullable extension types (pd.StringDtype(),pd.Int64Dtype()) to preserve nulls. - Use
df.to_arrow()for polars: near-zero-copy. - Use
cur.adbc_ingest(table, arrow_table, mode="create_append")for bulk Arrow ingest over FlightSQL: one round trip, no row shredding. Available onnightlyand stable images from 2.2.0 onwards.