What is the difference between temporal aggregation and window mapping?

Temporal aggregation collapses high-frequency observations into discrete time intervals (e.g. 5-minute bins), reducing cardinality and smoothing noise. Window mapping then binds those aggregated intervals to spatial reference frames — H3 hexagons, road segments, or administrative boundaries — producing a spatiotemporal matrix where each cell has both a geographic footprint and a time-bound metric.

How do I prevent boundary artifacts in fixed temporal windows?

Anchor all windows to a consistent UTC epoch (e.g. midnight) using floor/truncation, not rounding. Apply fractional weighting or trajectory interpolation for events that straddle a window edge. In streaming systems, configure explicit late-event watermarks to avoid double-counting.

Which spatial indexing system — H3, S2, or quadkey — is best for mobility density grids?

H3 is preferred for most mobility workloads because its hexagonal cells have consistent area, six equal-distance neighbors, and fast parent/child resolution. S2 provides finer geographic subdivision for global logistics. Quadkeys (Web Mercator tiles) are best when output must align with tile-based mapping APIs. The key is to choose one system and use it consistently throughout the pipeline.

How should I handle devices with wildly different reporting rates in the same pipeline?

Normalize by reporting frequency before aggregation. Compute a per-device expected ping count for each window based on its historical sample rate, then apply inverse-probability weighting so high-frequency devices do not swamp counts. Flag and quarantine devices whose instantaneous rate deviates more than 3× from their baseline.

Temporal Aggregation & Window Mapping for Movement Data Pipelines

Raw GPS pings, cellular handoffs, and telematics streams are inherently asynchronous: timestamps are irregular, reporting rates vary by hardware, and spatial precision fluctuates with signal conditions. Without a disciplined aggregation and windowing layer, every downstream analysis — congestion scoring, OD matrix construction, dwell detection — produces results that are non-comparable across time periods, non-reproducible across pipeline runs, and non-joinable with road networks or administrative boundaries. This guide details the architectural patterns, implementation strategies, and production-grade considerations required to build robust temporal windowing pipelines for movement data.

Prerequisites & Scope

This page assumes you are working with a Python 3.10+ stack and trajectory data in tabular form (Parquet, CSV, or database tables with device_id, ts_utc (UTC datetime), lat, lon, and optionally speed_kmh). Familiarity with spatiotemporal data foundations — specifically how raw GPS observations are modelled as trajectory objects before aggregation — is assumed.

Core library versions tested:

Library	Minimum version	Role
`polars`	0.20	Lazy temporal windowing, `group_by_dynamic`
`geopandas`	0.14	Spatial joins, GeoDataFrame construction
`h3` (h3-py)	4.0	Hexagonal grid indexing
`pyproj`	3.6	CRS transformations, UTM zone detection
`movingpandas`	0.17	TrajectoryCollection helpers

All distance and velocity calculations throughout this guide use a metric projected CRS (UTM or EPSG:3857). Never compute movement metrics directly in WGS84 (EPSG:4326) — angular degree distances are not uniform across latitudes.

Core Conceptual Model

Temporal aggregation and window mapping are two complementary operations. Understanding their distinct roles prevents the most common pipeline design errors.

Temporal Aggregation

Temporal aggregation collapses a stream of point observations into discrete, non-overlapping (or deliberately overlapping) time intervals. The output is a set of interval-keyed rows, each summarizing the observations that fell within that window: device counts, speed statistics, ping totals, dwell durations. The two primary window types are:

Fixed (tumbling) windows: All intervals have equal duration and share boundaries across the entire dataset. A 15-minute fixed window anchored at UTC midnight produces the same boundaries for every device and every day. Fixed windows are reproducible and easy to join, but they cut cleanly through events that span a boundary.
Adaptive windows: Boundaries shift based on data density, velocity, or state transitions. Dynamic time-binning strategies shorten windows during high-activity bursts and widen them during slow or stationary periods, preserving signal fidelity at both extremes. Adaptive windows require careful design to remain deterministic across re-runs.

A third variant — sliding (rolling) windows — produces overlapping intervals for moving-average style computations. These are handled by rolling statistics for mobility metrics and are distinct from aggregation windows used for spatial joining.

Spatial Window Mapping

Once the temporal dimension is collapsed, each aggregate row carries a representative location (usually a centroid or the first ping in the interval). Window mapping binds that location to a spatial reference frame — a grid cell, a road segment identifier, or an administrative zone polygon. The result is a spatiotemporal matrix: a table indexed by (cell_id, window_start) where every cell has both a geographic footprint and a time-bound metric.

The choice of spatial reference frame has cascading consequences for join performance, hierarchical rollup capability, and interoperability with downstream systems. The three dominant systems for mobility work are:

H3 (Uber): Hexagonal hierarchical grid, resolutions 0–15. Hexagons have six equal-distance neighbors, which makes density smoothing and neighborhood queries fast and geometrically consistent. Preferred for most city-scale mobility work.
S2 (Google): Spherical quadtree subdivision. Cells at any level are approximately square and cover the full globe without distortion. Better than H3 for global logistics pipelines where polar regions matter.
Quadkeys / Web Mercator tiles: Integer-encoded XYZ tile addresses. Fast to compute, universally supported by mapping APIs, but cells grow distorted at high latitudes.

The SVG below illustrates how a stream of raw GPS pings flows through temporal aggregation, then spatial mapping, to produce a spatiotemporal matrix:

Architecture Decision Map

The biggest design choices in a temporal windowing pipeline are not library selections — they are structural decisions that determine correctness and re-computability. The table below captures the main trade-offs:

Decision	Option A	Option B	When to choose A	When to choose B
Window boundary type	Fixed (tumbling)	Adaptive / event-triggered	Joins against fixed schedules, OD matrices, shift-aligned KPIs	High-variance mobility data where fixed bins mask bursts; dynamic time-binning strategies page covers this pattern
Spatial reference frame	H3 hexagonal grid	Road network segments	Density mapping, heatmaps, neighborhood queries	Route-level analytics, speed-per-link, map-matching output
Execution model	Lazy batch (Polars/DuckDB)	Streaming (Flink/Kafka Streams)	Retrospective reprocessing, ML feature generation	Real-time dashboards with sub-minute latency requirements
Timestamp anchor	UTC midnight	Rolling epoch from first record	Deterministic cross-device joins	Single-device debugging and exploratory analysis only
Gap treatment	Leave gaps as NULL	Forward-fill / interpolate	Downstream models can tolerate sparse inputs	Strict continuity required; see gap filling in sparse trajectories
Window overlap	None (tumbling)	Sliding / hopping	Aggregation for spatial joins	Smoothing, trend detection; see rolling statistics for mobility metrics

Pipeline Integration

A production movement analytics stack layers temporal windowing between raw ingestion and spatial enrichment. The canonical stage sequence is:

Ingestion & schema validation — enforce device_id (string), ts_utc (UTC datetime, microsecond precision), lat/lon (float64 WGS84), speed_kmh (float32, nullable). Reject or quarantine rows that fail type constraints.
Timestamp normalization — convert to UTC, strip DST ambiguities, enforce ISO 8601. Time-series synchronization strategies covers multi-sensor clock alignment that must precede this step.
Sampling rate optimization — downsample or regularize reporting intervals so that high-frequency devices (1 Hz telematics) do not dominate aggregate counts. This step is upstream of windowing, not inside it.
Window assignment — apply floor/truncation to assign each ping to its temporal bin. Use group_by_dynamic in Polars or DATE_TRUNC in SQL; never compute window IDs from row_number() or positional offsets.
Interval aggregation — compute per-window statistics: n_unique(device_id), mean(speed_kmh), max(speed_kmh), count(*). Filter windows with fewer than the minimum ping threshold (typically 2) to suppress noise.
CRS projection — before any distance or speed computation, project from EPSG:4326 to a metric CRS. See coordinate reference system mapping for the transformation pipeline and zone selection logic.
Spatial grid assignment — convert projected centroids to H3 (or S2/quadkey) indices. Use vectorized batch conversion, not row-wise iteration.
Metric computation — group by (cell_id, window_start) and compute the final spatiotemporal matrix.
Export — write to GeoParquet, Delta Lake, or a PostGIS table partitioned by date.

Seasonal and cyclical alignment is applied after step 8, enriching the matrix with day-of-week, holiday, and shift-period flags before the output is consumed by routing or forecasting models.

Implementation: Production-Ready Python Stack

The following implementation demonstrates a complete temporal windowing and spatial mapping pipeline. Every function has a typed signature, handles empty DataFrames and missing columns, and uses a metric projected CRS for all spatial operations.

PYTHON

from __future__ import annotations

import polars as pl
import geopandas as gpd
import h3
import pyproj
from shapely.geometry import Point

# ---------------------------------------------------------------------------
# Stage 1: Timestamp normalization
# ---------------------------------------------------------------------------

def normalize_timestamps(df: pl.LazyFrame, ts_col: str = "ts_utc") -> pl.LazyFrame:
    """Cast ts_col to UTC microsecond datetime; raise on null results."""
    return df.with_columns(
        pl.col(ts_col)
        .cast(pl.Datetime(time_unit="us", time_zone="UTC"))
        .alias(ts_col)
    )


# ---------------------------------------------------------------------------
# Stage 2: Fixed temporal window assignment (15-minute tumbling windows)
# ---------------------------------------------------------------------------

def assign_temporal_windows(
    df: pl.LazyFrame,
    ts_col: str = "ts_utc",
    every: str = "15m",
) -> pl.LazyFrame:
    """
    Assign each row to a fixed tumbling window anchored at UTC midnight.

    'every' and 'period' are set equal for non-overlapping (tumbling) windows.
    offset="0ns" ensures alignment to the UTC epoch, not the first record.
    """
    required = {ts_col, "device_id", "speed_kmh", "lat", "lon"}
    schema_cols = set(df.collect_schema().names())
    missing = required - schema_cols
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    return (
        df
        .sort(ts_col)
        .group_by_dynamic(
            ts_col,
            every=every,
            period=every,
            offset="0ns",
            group_by=None,
        )
        .agg(
            pl.col("device_id").n_unique().alias("unique_devices"),
            # NOTE: speed_kmh computed from projected coordinates upstream;
            # raw WGS84 degree distances must never be used here.
            pl.col("speed_kmh").mean().alias("avg_speed_kmh"),
            pl.col("speed_kmh").max().alias("max_speed_kmh"),
            pl.col("lat").mean().alias("centroid_lat"),
            pl.col("lon").mean().alias("centroid_lon"),
            pl.len().alias("ping_count"),
        )
    )


# ---------------------------------------------------------------------------
# Stage 3: Filter low-confidence windows and map to H3 grid
# ---------------------------------------------------------------------------

def map_to_h3(
    windowed: pl.DataFrame,
    resolution: int = 7,
    min_pings: int = 2,
) -> gpd.GeoDataFrame:
    """
    Filter sparse windows and assign H3 cell indices.

    Requires h3-py >= 4.0 (uses h3.latlng_to_cell).
    For h3-py 3.x, replace with h3.geo_to_h3(lat, lon, resolution).
    """
    if windowed.is_empty():
        return gpd.GeoDataFrame(
            columns=["ts_utc", "h3_index", "unique_devices",
                     "avg_speed_kmh", "max_speed_kmh", "ping_count"],
            geometry=[],
            crs="EPSG:4326",
        )

    filtered = windowed.filter(pl.col("ping_count") >= min_pings)
    pdf = filtered.to_pandas()

    pdf["h3_index"] = [
        h3.latlng_to_cell(row.centroid_lat, row.centroid_lon, resolution)
        for row in pdf.itertuples()
    ]

    gdf = gpd.GeoDataFrame(
        pdf,
        geometry=[
            Point(row.centroid_lon, row.centroid_lat)
            for row in pdf.itertuples()
        ],
        crs="EPSG:4326",
    )
    return gdf


# ---------------------------------------------------------------------------
# Stage 4: Final spatiotemporal matrix aggregation
# ---------------------------------------------------------------------------

def build_spatiotemporal_matrix(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """
    Group by (h3_index, ts_utc) to produce the final spatiotemporal matrix.

    Returns one row per (cell, window_start) pair.
    """
    if gdf.empty:
        return gdf

    matrix = (
        gdf.groupby(["h3_index", "ts_utc"], as_index=False)
        .agg(
            unique_devices=("unique_devices", "sum"),
            avg_speed_kmh=("avg_speed_kmh", "mean"),
            max_speed_kmh=("max_speed_kmh", "max"),
            ping_count=("ping_count", "sum"),
        )
    )
    return matrix


# ---------------------------------------------------------------------------
# Usage example
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    raw = pl.scan_parquet("data/telematics_stream.parquet")

    normalized = normalize_timestamps(raw)
    windowed_lazy = assign_temporal_windows(normalized, every="15m")
    windowed = windowed_lazy.collect()

    gdf = map_to_h3(windowed, resolution=7, min_pings=2)
    matrix = build_spatiotemporal_matrix(gdf)

    matrix.to_parquet("output/spatiotemporal_matrix.parquet", index=False)

Key design choices in this implementation:

pl.scan_parquet with .collect() at the end defers all I/O and keeps memory bounded.
offset="0ns" in group_by_dynamic anchors windows to UTC midnight, not the first record timestamp. This is the most common source of non-reproducible window boundaries in team pipelines.
min_pings >= 2 removes single-ping windows before spatial mapping, preventing phantom H3 cells from device wake-up events.
The map_to_h3 empty-guard returns a correctly typed empty GeoDataFrame, which prevents downstream groupby crashes on no-data partitions.

Architecture Decision: H3 Resolution Selection

The H3 resolution controls both spatial granularity and computational cost. The table below shows the trade-offs for mobility work:

H3 Resolution	Avg cell area	Typical use case	Neighbors within 1 hop
5	~252 km²	Regional freight flow, city-level density	6
6	~36 km²	District-level commute patterns	6
7	~5.2 km²	Neighbourhood traffic, parking demand	6
8	~0.74 km²	Intersection-level congestion	6
9	~0.1 km²	Individual road segments, parking lots	6

Resolution 7 is a reliable starting point for city-scale mobility analytics: cells are large enough to aggregate multiple vehicles per 15-minute window under normal conditions, but fine enough to distinguish major arterials from residential streets. Increase to resolution 8 or 9 only when individual road-link attribution is required, and ensure that your minimum ping threshold scales accordingly (higher resolution = fewer pings per cell = noisier counts).

Engineering Pitfalls & Production Gotchas

1. Window Boundary Drift Across Distributed Nodes

When a Spark or Dask job runs group_by_dynamic on partitioned data, each partition may anchor its epoch independently if offset is not set globally. A partition containing records from 2024-03-01 00:07:00 will produce windows starting at 00:07, 00:22, 00:37 rather than 00:00, 00:15, 00:30. Fix: always use offset="0ns" relative to the UTC epoch, and validate boundary alignment with a post-run continuity check before downstream joins.

Diagnostic signal: OD matrix join rates drop below 100% on a day-of-week that should be complete; window start_time modulo window_duration is non-zero.

2. DST Ambiguity Corrupting Time-Series Joins

Storing timestamps in local time causes duplicate hour labels during the fall-back transition and a skipped hour during spring-forward. A pipeline that ingests Europe/London data in winter GMT (UTC+0) and summer BST (UTC+1) will misalign windows by exactly one hour for half the year.

Fix: enforce UTC at ingestion. Apply timezone offsets only at the reporting layer. Use pytz.utc or zoneinfo.ZoneInfo("UTC") explicitly — never rely on system locale defaults. See handling timezone shifts in cross-border mobility data for the full mitigation pattern.

Diagnostic signal: Weekly aggregation totals for affected regions are ~4% lower than expected (one hour lost per 24-hour window) for six months of the year.

3. Sampling Rate Inequality Inflating Counts

A fleet with mixed hardware — 1 Hz high-end telematics alongside 0.1 Hz consumer GPS — will produce count distributions where high-frequency devices appear ten times more “present” in a given cell than low-frequency ones. This distorts vehicle density estimates and OD flows.

Fix: apply sampling rate optimization upstream to regularize reporting intervals to a common baseline (e.g. one ping per 30 seconds) before windowing. If regularization is not feasible, weight each ping by the inverse of its device’s sample rate within the window.

Diagnostic signal: The top 5% of devices by ping_count account for more than 40% of total pings in a fleet that should be homogeneous.

4. CRS Mismatch in Spatial Joins After Windowing

geopandas.sjoin silently produces incorrect results when the left and right GeoDataFrames have different CRS, particularly when one is EPSG:4326 and the other is a projected UTM zone. Mismatched joins produce either zero matches or false matches depending on whether geopandas auto-reprojects.

Fix: always call gdf.to_crs(epsg=32632) (or the appropriate UTM zone) before any sjoin, buffer, or distance operation. After the join, re-project back to EPSG:4326 for storage. The coordinate reference system mapping guide covers UTM zone selection for arbitrary bounding boxes.

Diagnostic signal: sjoin returns a DataFrame with fewer rows than the left input on a dataset where every point is known to fall within a zone polygon.

5. Late-Arriving Events in Streaming Pipelines

In real-time streaming systems (Flink, Kafka Streams), GPS pings from devices in poor connectivity areas can arrive minutes or hours after their recorded timestamp. Without explicit watermarks, these late events either get dropped or extend open windows indefinitely.

Fix: set a watermark of 5–15 minutes (depending on your fleet’s typical connectivity lag) and route late events to a dedicated correction stream rather than silently discarding them. Apply gap filling in sparse trajectories to retrospectively patch affected windows during nightly reprocessing.

Diagnostic signal: Window closure rate (percentage of windows that close on time) drops below 98% for fleet segments operating in rural or underground environments.

Python Tooling Landscape

Library	Key capability	When to use it in this pipeline
`polars`	Lazy `group_by_dynamic`, vectorized aggregation, Arrow-native	Primary temporal windowing engine for batch pipelines
`duckdb`	SQL `DATE_TRUNC` windowing, Parquet-native, zero-copy Arrow exchange	SQL-first teams or pipelines that mix SQL and Python
`geopandas`	`sjoin`, `to_crs`, geometry operations on DataFrames	Spatial join stage after temporal aggregation
`h3` (h3-py ≥ 4.0)	`latlng_to_cell`, `cell_to_latlng`, `grid_ring`	Hexagonal grid assignment and neighborhood queries
`movingpandas`	`TrajectoryCollection`, `Trajectory.add_speed()`, temporal resampling	Pre-aggregation trajectory cleaning; connects to trajectory object design patterns
`pyproj`	`Transformer.from_crs`, UTM zone detection, CRS pipeline definitions	All metric-CRS projections; mandatory before speed/distance calc
`scipy.signal`	Savitzky-Golay smoothing on speed series	Post-aggregation noise reduction in speed profiles
`apache-flink` / `pyflink`	Watermarks, tumbling/sliding window operators, exactly-once semantics	Real-time streaming pipelines with sub-minute latency

Speed and acceleration profiling relies on the metric-projected speed values produced by this pipeline, so correctness here propagates directly to kinematic feature quality.

Validation & Testing Patterns

A temporal windowing pipeline is only as reliable as its test coverage. The following checks should be automated as part of your CI pipeline or dbt test suite:

Temporal Continuity Check

PYTHON

def assert_window_continuity(
    matrix: pl.DataFrame,
    every: str = "15m",
    ts_col: str = "ts_utc",
) -> None:
    """Verify that window start times are evenly spaced with no gaps or duplicates."""
    times = matrix.select(pl.col(ts_col).unique().sort())[ts_col].to_list()
    if len(times) < 2:
        return
    expected_gap = pl.duration(minutes=15)  # adjust to match 'every'
    gaps = [
        (times[i + 1] - times[i])
        for i in range(len(times) - 1)
    ]
    bad = [g for g in gaps if g != expected_gap]
    assert not bad, f"Window boundary gaps detected: {bad[:5]}"

Metric Consistency Check

Cross-validate aggregated totals against raw ping counts. A correctly built pipeline must satisfy:

TEXT

sum(matrix["ping_count"]) == raw_df.filter(ping_count >= min_pings threshold handled correctly).height

Any discrepancy indicates dropped intervals, duplicate rows in the raw source, or misconfigured group_by_dynamic period/offset parameters.

Spatial Coverage Audit

PYTHON

def assert_coordinates_in_bounds(
    gdf: gpd.GeoDataFrame,
    lat_bounds: tuple[float, float] = (-90.0, 90.0),
    lon_bounds: tuple[float, float] = (-180.0, 180.0),
) -> None:
    """Flag rows with out-of-range centroids — common symptom of CRS mismatch."""
    lats = gdf.geometry.y
    lons = gdf.geometry.x
    bad_lat = ((lats < lat_bounds[0]) | (lats > lat_bounds[1])).sum()
    bad_lon = ((lons < lon_bounds[0]) | (lons > lon_bounds[1])).sum()
    assert bad_lat == 0, f"{bad_lat} rows with out-of-range latitude"
    assert bad_lon == 0, f"{bad_lon} rows with out-of-range longitude"

Performance Benchmarking

Profile memory usage and join latency on a representative data slice (one day, one major city) before deploying to production. Expected behavior for a well-tuned Polars pipeline:

normalize_timestamps + assign_temporal_windows: sub-linear in row count (vectorized Arrow operations).
map_to_h3 row-wise loop: the only O(n) Python loop — replace with h3.latlng_to_cell via numpy vectorize if throughput is a bottleneck.
build_spatiotemporal_matrix groupby: linear with cardinality of (cell_id, window_start) unique pairs.

Quadratic scaling in any stage indicates unindexed spatial joins or accidental cross-joins from a non-unique window boundary column.

Dynamic Time-Binning Strategies — adaptive window sizing based on velocity and density signals
Gap Filling in Sparse Trajectories — interpolation and Kalman-filter approaches for missing intervals
Rolling Statistics for Mobility Metrics — sliding-window aggregation for moving-average speed and congestion scoring
Seasonal & Cyclical Alignment — aligning aggregated windows to operational calendars and recurring mobility patterns
Coordinate Reference System Mapping — CRS transformation pipelines required before spatial joins

Back to the site home

Temporal Aggregation & Window Mapping for Movement Data Pipelines

Explore deeper