Skip to content

Point-in-time feature extraction for ML training sets

Training models on data that changes over time runs into the as-of-join problem. You label an event at time T, then join feature rows on the entity key. A naive join pulls each entity’s current feature values, including everything that happened after T. The model trains on information from after the label time, scores well offline, and then degrades in production.

Much of what feature stores (Feast, Tecton, Hopsworks) do is to perform this join correctly. In XTDB the join is one SQL query, because every row already carries its valid-time interval. adbc_driver_flightsql returns the result as Arrow, and pyarrow.Table.to_pandas() gives you a model-ready dataframe.

The runnable example for this guide is in docs/.../examples/point-in-time-feature-extraction/main.py.

Loan-default prediction.

Labels. One row per event we want to train on: (customer_id, observed_at, did_default). A label is an event, observed at a single instant, not a state that holds over a span. So we model its valid-time as the instantaneous chronon [observed_at, observed_at + 1µs): observed_at becomes the label’s own _valid_from, and the as-of join below is a clean period-vs-period one.

Features. Per-customer attributes that drift: account_balance, credit_score, employment_status. The truth about a customer on 2024-08-15 is different from the truth about that customer on 2024-11-01, and the model only ever gets to see the former.

We seed three customers with deliberately-shifting histories. c1 starts the year solvent, drifts down through summer, and bottoms out post-default in October. c3 defaults, then recovers in November. A naive join would feed both of those post-default snapshots into training rows for a default that happened in August.

Start a node with the FlightSQL listener:

Terminal window
docker run --rm -d --name xtdb -p 5432:5432 -p 9832:9832 ghcr.io/xtdb/xtdb:nightly
pip install adbc-driver-flightsql pyarrow pandas

Connect:

import adbc_driver_flightsql.dbapi as flight_sql
# autocommit=True so each seed INSERT commits and the joins below read their
# own writes. The dbapi defaults to manual-commit (PEP 249) now that the
# server advertises transaction support, so without this the uncommitted
# seed rows would be invisible to subsequent queries on the same connection.
conn = flight_sql.connect("grpc://localhost:9832", autocommit=True)

Every XTDB row carries a valid-time interval, exposed as _valid_from / _valid_to. Setting _valid_from on INSERT records when each version became true, not when the database happened to learn it.

cur.execute("""
INSERT INTO customer_features
(_id, _valid_from, account_balance, credit_score, employment_status)
VALUES (?, ?, ?, ?, ?)
""", parameters=["c1", date(2024, 1, 1), 1500.00, 680, "employed"])

Re-inserting _id = "c1" with a later _valid_from doesn’t overwrite the prior version: it closes off the previous interval and starts a new one. After ingesting the seed, c1’s history is a step function over valid time:

c1: [2024-01-01 .. 2024-04-01) balance=1500 score=680 employed
[2024-04-01 .. 2024-07-01) balance=2100 score=700 employed
[2024-07-01 .. 2024-10-01) balance=400 score=640 employed
[2024-10-01 .. ∞ ) balance=50 score=580 unemployed

No audit table, no effective_from column, no triggers, no manual interval bookkeeping.

This is what most pipelines start with:

SELECT l._id AS label_id,
l.customer_id,
l._valid_from AS observed_at,
l.did_default,
f.account_balance,
f.credit_score,
f.employment_status
FROM loan_labels AS l
JOIN customer_features AS f
ON f._id = l.customer_id

Without a temporal qualifier, XTDB returns each customer’s current row, so this query gives every label the customer’s feature values as they are today, regardless of when the label was observed.

For our seed:

label_id customer_id observed_at did_default account_balance credit_score employment_status
L1 c1 2024-08-15 True 50.0 580 unemployed
L2 c2 2024-08-15 False 10200.0 770 employed
L3 c3 2024-08-15 True 5500.0 700 employed

Look at L1. The label says c1 defaulted on 2024-08-15. The feature row says c1 has $50 in the bank and is unemployed. Those numbers describe c1 after the default has already destroyed their finances. The model “learns” that low balance + unemployed predicts default: true but tautological. Look at L3. The features for c3 show $5,500 and a 700 credit score: c3’s recovery a quarter after the default we’re trying to predict. Train on this and the model learns the aftermath of a default, not the signals that precede one, so it has nothing useful to go on in production.

The fix is one SQL clause: FOR ALL VALID_TIME on the features table, joined against the label’s valid-time period:

SELECT l._id AS label_id,
l.customer_id,
l._valid_from AS observed_at,
l.did_default,
f.account_balance,
f.credit_score,
f.employment_status
FROM loan_labels FOR ALL VALID_TIME AS l
LEFT JOIN customer_features FOR ALL VALID_TIME AS f
ON f._id = l.customer_id
AND f._valid_time CONTAINS l._valid_time

FOR ALL VALID_TIME opens up every historical version of customer_features. Both sides are valid-time periods, so the join is period-against-period: f._valid_time CONTAINS l._valid_time picks the one feature version per customer whose period contains the label’s event chronon, without spelling out the interval bounds by hand. (observed_at is the label’s _valid_from, so we project it from there.)

Same data, run again:

label_id customer_id observed_at did_default account_balance credit_score employment_status
L1 c1 2024-08-15 True 400.0 640 employed
L2 c2 2024-08-15 False 9500.0 760 employed
L3 c3 2024-08-15 True 10.0 540 unemployed

c1’s August features are now $400 / 640 / employed: the state the customer was actually in before the default. c3’s features are $10 / 540 / unemployed: the dire state preceding the default, not the post-recovery snapshot.

This is one query covering every label, with no Python loop or per-row round trip. XTDB applies the FOR VALID_TIME AS OF bound during the table scan, so each row is read at its as-of-label-time version directly.

Why this is hard without bitemporal storage

Section titled “Why this is hard without bitemporal storage”

The standard alternative is event sourcing: store every change to a customer as an immutable event with a timestamp, then for each label scan the event log up to observed_at and fold the latest state per attribute. You’ll either write that fold in application code (slow, hard to vectorise), bake it into a Spark/Flink job (complex, brittle, your own join-engine to debug), or rent a feature store (Feast/Tecton/Hopsworks) whose primary value-add is doing exactly this.

Each of those routes rebuilds part of what a bitemporal database already does. XTDB is one, with the temporal logic built in and queried through standard SQL:2011 syntax.

fetch_arrow_table() gives a pyarrow.Table; to_pandas() gives the dataframe sklearn wants:

cur.execute(ASOF_SQL)
features_df = cur.fetch_arrow_table().to_pandas()
X = features_df[["account_balance", "credit_score", "employment_status"]]
y = features_df["did_default"].astype(int)

The Arrow path is zero-copy for numeric columns and one allocation for strings. For a one-million-label training set the conversion stays in seconds, not minutes.

If you’d rather skip pandas:

import polars as pl
features_df = pl.from_arrow(cur.fetch_arrow_table())

polars consumes the Arrow batches directly with no intermediate copy.

Per-label valid-time. Labels themselves don’t have to live in XTDB. Land them as an in-memory pyarrow.Table and ingest, or INSERT them directly; the AS-OF join works the same.

Multi-table join. Multiple feature tables, each with their own timeline:

SELECT l._id, l.customer_id, l._valid_from AS observed_at, l.did_default,
cf.credit_score,
acc.account_balance,
emp.status AS employment_status
FROM loan_labels FOR ALL VALID_TIME AS l
LEFT JOIN credit_scores FOR ALL VALID_TIME AS cf ON cf._id = l.customer_id AND cf._valid_time CONTAINS l._valid_time
LEFT JOIN accounts FOR ALL VALID_TIME AS acc ON acc._id = l.customer_id AND acc._valid_time CONTAINS l._valid_time
LEFT JOIN employment_history FOR ALL VALID_TIME AS emp ON emp._id = l.customer_id AND emp._valid_time CONTAINS l._valid_time

Each feature table is joined at its own correct point in time. You can collapse the repetition with a SQL view; the optimiser still pushes each temporal predicate into its respective scan.

Reproducible re-runs. Pin the whole query to a system-time basis:

SETTING DEFAULT SYSTEM_TIME AS OF TIMESTAMP '2024-12-01T00:00:00Z'
SELECT ...

Re-run the training-set extraction six months later, get bit-identical output even if features were corrected in the meantime. The FOR VALID_TIME clause picks the right version in the timeline; SETTING DEFAULT SYSTEM_TIME picks the right timeline. This is the bitemporal pair; see Time in XTDB for the underlying model.

The example script seeds each table in a single adbc_ingest call, including a _valid_from column so every row sets its own valid-time and the customers build a real timeline. _valid_from must be a timestamp column in the Arrow table, not a date. For bulk loads of labels or feature timelines from Parquet or pandas, see Bulk-load Parquet into XTDB via ADBC.

Training-set extraction is a single read query, so transactions don’t enter into it. But if you combine ingestion and querying on the same connection, see Transactions in the reference.