Why can't I compute distances directly in EPSG:4326?

EPSG:4326 measures angular degrees, not metres. Euclidean distance in degree-space is only accurate near the equator; at 52° latitude a naive degree-distance understates east-west distances by ~38%. Always reproject to a metric CRS before computing speed, distance, or area.

What sampling rate should I use for urban fleet telematics?

1 Hz (one fix per second) captures lane-change and turn behaviour. 0.1 Hz (one fix per 10 seconds) is adequate for origin-destination studies. Adaptive logging—higher frequency during manoeuvres, lower on straight highway segments—can cut storage 60-80% versus fixed 1 Hz without losing analytical fidelity.

How do I handle GPS dropouts without inventing false positions?

Dead-reckoning forward from the last known velocity and heading gives a plausible corridor. Represent imputed segments as uncertainty ellipses or confidence corridors rather than deterministic points—downstream models can then weight observations probabilistically.

When should I use H3 cells versus a road-network model?

H3 hexagons excel for density estimation, exposure analysis, and any aggregation where topology is irrelevant. A network model is mandatory whenever you need routing feasibility, travel-time estimation, or capacity constraints—grids cannot represent one-way streets or turn restrictions.

Spatiotemporal Data Foundations & Structures

Without a correct data model beneath them, even sophisticated movement analytics produce silent, hard-to-diagnose errors: projected velocities that are 40% too high, trajectory gaps that are silently interpolated as straight lines, timestamps that drift by hours across device fleets. Mobility data scientists, urban analysts, and logistics engineering teams lose weeks tracing these failures back to foundational misconfigurations in coordinate systems, sampling architecture, and storage layout.

Prerequisites & Scope

This guide assumes Python 3.10+ and familiarity with pandas DataFrames and basic GIS concepts. The primary libraries referenced are:

geopandas ≥ 0.14 — spatial DataFrame operations and CRS management
pyproj ≥ 3.6 — coordinate reference system transformations via PROJ 9
movingpandas ≥ 0.17 — trajectory object construction and movement semantics
shapely ≥ 2.0 — geometry operations (vectorized via GEOS)
pyarrow / geoparquet — columnar storage for large trajectory datasets
scipy — signal filtering (Savitzky-Golay, Kalman state estimation)

Data formats covered: raw GNSS NMEA, telematics JSON streams, AVL CSV exports, and GeoParquet archives. If you are new to the broader movement analytics stack, start with Movement Pattern Extraction & Trajectory Analysis to understand how the structures built here are consumed downstream.

Core Conceptual Model

Point Sequences and Trajectory Objects

Every movement dataset starts as a time-ordered sequence of coordinate-timestamp tuples (x, y, t). Raw sequences are analytically unusable at scale: they carry no entity context, bounding-box metadata, or schema guarantees. The first structural step is elevating them into trajectory objects, which encapsulate the entity identifier, temporal extent, spatial bounding box, transport mode, and confidence scores alongside the geometry. movingpandas.Trajectory is the standard Python representation; it wraps a GeoDataFrame with a DatetimeIndex and exposes vectorized spatial operations without leaving the pandas ecosystem.

Trajectory objects should support lazy evaluation and be backed by columnar storage (GeoParquet via pyarrow) to minimize memory overhead during batch processing. Crucially, the CRS must be embedded in the object schema and validated on construction—silent CRS mismatches are the single most common source of incorrect velocity calculations.

Space-Time Prisms and Reachability Envelopes

Between any two sampled positions, the true path is unknown. A space-time prism defines the feasible geographic region an entity could have occupied during that gap, bounded by maximum physically plausible velocity and the elapsed time. Rather than assuming a straight-line interpolation, the prism provides a mathematically rigorous uncertainty envelope—essential for privacy-preserving analytics, epidemiological exposure modelling, and probabilistic map-matching. When generating imputed segments after GPS dropout, represent the gap as a prism corridor, not a deterministic line.

Grid-Based and Network-Referenced Representations

Choosing between a grid and a network model is one of the most consequential architecture decisions in movement data engineering. Grid models discretize continuous coordinates into uniform cells—H3 hexagons at resolution 8 cover roughly 0.74 km² and enable O(1) spatial joins by cell key. This makes them ideal for density estimation, origin-destination matrices, and exposure analysis. Network models map coordinates to road segments using linear referencing systems, preserving the topological constraints—one-way streets, turn restrictions, grade separations—that grids cannot express. Use grids for aggregation and exposure; use networks for routing and capacity modelling.

Architecture Decision Map

Design choice	Option A	Option B	When to choose A	When to choose B
Spatial discretization	H3 hexagonal grid	Road-network LRS	Density, heatmaps, exposure	Routing, travel time, capacity
CRS strategy	Normalize at ingestion	Transform at query	Batch pipelines, predictable accuracy	Exploratory work with mixed regions
Sampling regime	Fixed interval (e.g. 1 Hz)	Adaptive (event-triggered)	Regulated telematics, model inputs	Cost-sensitive IoT, consumer apps
Trajectory storage	GeoParquet (columnar)	PostGIS (row)	Analytical batch jobs	Low-latency point lookups
Gap handling	Dead-reckoning + prism	Straight-line interpolation	Safety, insurance, epidemiology	Coarse OD studies where gaps are rare
Timestamp storage	UTC + explicit tz metadata	Local wall clock	Cross-border, multi-timezone fleets	Single-timezone, controlled deployments

Pipeline Integration

Spatiotemporal data foundations sit at the ingestion and cleaning layer of a broader movement analytics stack. The pipeline stages below are roughly sequential; each stage’s output is the next stage’s prerequisite.

Stage 1 — Ingestion. Raw streams arrive as NMEA sentences, JSON telemetry records, or CSV AVL exports. The schema contract (entity_id, timestamp, latitude, longitude, optional hdop, speed_ms, heading_deg) must be enforced at this boundary using Pydantic validators or great_expectations suites. Reject or quarantine records that fail rather than letting malformed data propagate silently.

Stage 2 — CRS normalization. Immediately after schema validation, reproject all coordinates to the analytical CRS using pyproj.Transformer. For regional datasets, UTM zone selection is deterministic from centroid longitude; for global datasets, use an equal-area projection like EPSG:6933 (WGS 84 / NSIDC EASE-Grid 2.0 Global). Store both the original WGS84 coordinates and the projected metric coordinates. Full guidance is in Coordinate Reference System Mapping.

Stage 3 — Temporal alignment. Normalize timestamps to UTC, detect and compensate for clock drift, and resample multi-source streams to a common interval. See Time-Series Synchronization Strategies for implementation patterns including NTP-based drift correction and event-driven bucketing.

Stage 4 — Noise filtering. Apply GPS Precision & Error Handling techniques: HDOP threshold filtering (reject fixes with hdop > 4.0), kinematic plausibility checks (reject points implying speed > mode-specific ceiling), and Kalman or Savitzky-Golay smoothing for trajectory continuity.

Stage 5 — Analytical layer. Clean trajectory objects flow into Movement Pattern Extraction & Trajectory Analysis for segmentation, stay-point detection, and kinematic feature extraction.

Engineering Pitfalls and Production Gotchas

1. Distance calculations in geographic coordinates

Computing Euclidean distance directly on EPSG:4326 latitude/longitude values is the most widespread accuracy bug in movement data pipelines. At 52° N latitude, one degree of longitude spans roughly 69 km, while one degree of latitude spans 111 km—treating them as equal introduces an immediate ~38% east-west distortion. Any function that accepts a GeoDataFrame and computes distances, speeds, or areas must assert gdf.crs.is_projected before proceeding.

PYTHON

def assert_metric_crs(gdf: gpd.GeoDataFrame) -> None:
    """Raise ValueError if the GeoDataFrame is not in a projected (metric) CRS."""
    if gdf.crs is None:
        raise ValueError("GeoDataFrame has no CRS assigned.")
    if not gdf.crs.is_projected:
        raise ValueError(
            f"CRS {gdf.crs.srs!r} is geographic, not metric. "
            "Reproject to a local UTM zone or EPSG:6933 before computing distances."
        )

2. Projection mismatches across pipeline stages

Joining two GeoDataFrame objects in different CRSs silently returns geometrically incorrect results in geopandas ≤ 0.13 (later versions raise a CRSMismatchWarning). Always validate CRS equality before spatial joins:

PYTHON

if left.crs != right.crs:
    right = right.to_crs(left.crs)

3. Clock drift corrupting multi-entity correlation

Device clocks drift by 0.5–5 seconds per hour in unmanaged embedded systems. Over a 12-hour fleet shift, a 2 s/h drift accumulates 24 seconds of offset. When correlating two vehicles that were co-located at a junction, 24-second drift makes them appear 600 m apart at 90 km/h. Apply NTP-anchored drift correction at the ingestion layer; store the correction offset as a column for audit purposes.

4. Aliasing from under-sampling dynamic manoeuvres

A 30-second sampling interval cannot reconstruct a 10-second turn. The trajectory appears as a straight line through the intersection, which map-matching algorithms route incorrectly. For urban fleet analytics, a minimum of 5-second sampling is required to capture turn geometry. Use adaptive logging that increases frequency during deceleration events (detectable from speed_delta between consecutive fixes).

5. Silent trajectory splits from storage partitioning

Time-based table partitioning can split a single long trajectory across partition boundaries, causing queries filtered to one day to return incomplete entities. Always store a trajectory_id that spans partitions, and implement cross-partition trajectory reassembly in your query layer or materialise complete trajectories before downstream use.

Python Tooling Landscape

Library	Version	Role in the pipeline	Key notes
`geopandas`	≥ 0.14	Spatial DataFrames, CRS management, spatial joins	Use `geopandas.sjoin` for trajectory-to-zone matching; see Optimizing Spatial Joins
`pyproj`	≥ 3.6	CRS definition, datum shifts, transformer pipelines	Pipeline-style `Transformer.from_pipeline` for multi-step projections
`movingpandas`	≥ 0.17	Trajectory objects, segmentation, speed/direction computation	`TrajectoryCollection` wraps multi-entity datasets; integrates with GeoPandas
`shapely`	≥ 2.0	Geometry construction, buffering, intersection	Vectorized operations via GEOS 3.11; use `shapely.vectorized` for large arrays
`scipy.signal`	≥ 1.11	Savitzky-Golay smoothing, Butterworth filtering	`savgol_filter` for position smoothing; use `filterpy` for full Kalman state
`pyarrow` / `geoarrow`	≥ 14	GeoParquet read/write, Arrow-native columnar ops	Preferred storage format for batch analytical workloads
`h3-py`	≥ 3.7	H3 hexagonal indexing, resolution selection	`h3.latlng_to_cell` for coordinate-to-cell assignment; `h3.grid_disk` for neighbours
`duckdb`	≥ 0.9	In-process OLAP queries on GeoParquet	Supports spatial extensions; outperforms PostGIS for read-heavy batch analytics

Sampling Rate Optimization

Sampling Rate Optimization is a first-order cost and quality decision. The trade-off is not simply “more points = better accuracy”:

1 Hz fixed: captures lane changes and tight turns; appropriate for insurance telematics and safety analytics; storage-intensive (~86,400 points/device/day).
0.1 Hz fixed: sufficient for origin-destination studies and daily route reconstruction; misses short stops and sharp manoeuvres.
Adaptive: logs at 1 Hz during deceleration (Δspeed > 0.5 m/s² over 2 s) and 0.05 Hz during constant highway cruising. Reduces raw point count by 60–80% against fixed 1 Hz. Requires the receiver to buffer and timestamp events locally.

For Douglas-Peucker simplification applied post-collection, a spatial tolerance of 5–10 m preserves all analytically meaningful shape for urban traces while eliminating 50–70% of redundant collinear points. Implement with shapely.simplify(tolerance=8.0, preserve_topology=True).

Further detail, including downsampling high-frequency GPS tracks without losing path integrity, is covered in the Sampling Rate Optimization section.

Validation and Testing Patterns

Spatiotemporal pipelines fail silently more often than they crash. Build these checks into your CI/CD data quality gates:

Inverse-transform round-trip. After projecting coordinates from EPSG:4326 to a metric CRS and back, the reprojected result should match the original to within 1 mm. Any larger residual indicates a datum shift misconfiguration.

PYTHON

import numpy as np
from pyproj import Transformer

def validate_roundtrip(lats: np.ndarray, lons: np.ndarray, target_epsg: int, tol_m: float = 0.001) -> None:
    """Assert that forward + inverse CRS transform returns to within tol_m metres."""
    fwd = Transformer.from_crs(4326, target_epsg, always_xy=True)
    inv = Transformer.from_crs(target_epsg, 4326, always_xy=True)
    xs, ys = fwd.transform(lons, lats)
    lons_rt, lats_rt = inv.transform(xs, ys)
    # Convert degree residual to approximate metres at equator (1 deg ≈ 111 km)
    residual_m = np.sqrt(((lons - lons_rt) * 111_320) ** 2 + ((lats - lats_rt) * 110_574) ** 2)
    if residual_m.max() > tol_m:
        raise AssertionError(f"CRS round-trip residual {residual_m.max():.6f} m exceeds tolerance {tol_m} m")

Velocity plausibility bounds. After noise filtering, assert that no trajectory segment exceeds a physically plausible ceiling for the declared transport mode.

PYTHON

MAX_SPEED_MS = {"pedestrian": 3.5, "cyclist": 12.0, "car": 55.6, "train": 97.2}

def assert_velocity_bounds(traj_gdf: gpd.GeoDataFrame, mode: str) -> None:
    ceiling = MAX_SPEED_MS.get(mode, 55.6)
    if (traj_gdf["speed_ms"] > ceiling).any():
        n_violations = (traj_gdf["speed_ms"] > ceiling).sum()
        raise AssertionError(f"{n_violations} points exceed {ceiling} m/s ceiling for mode {mode!r}")

Temporal monotonicity. Timestamps must be strictly increasing within each trajectory. Backward jumps indicate either duplicate records or clock-reset events; both corrupt kinematic derivations.

PYTHON

def assert_temporal_monotonicity(ts: pd.DatetimeIndex, entity_id: str) -> None:
    deltas = ts[1:] - ts[:-1]
    if (deltas.total_seconds() <= 0).any():
        raise AssertionError(f"Non-monotonic timestamps detected for entity {entity_id!r}")

Bounding-box containment. All points in a trajectory should fall within the expected geographic bounding box for the deployment region. Points outside it are almost always projection failures or hardware faults, not genuine observations.

GPS Precision & Error Handling — noise filtering, HDOP thresholding, and Kalman smoothing for raw GNSS streams
Coordinate Reference System Mapping — CRS selection, PROJ transformation pipelines, and spatial join optimization
Time-Series Synchronization Strategies — aligning asynchronous device clocks, UTC normalization, and resampling
Sampling Rate Optimization — adaptive logging strategies and Douglas-Peucker simplification
Trajectory Object Design Patterns — movingpandas schema design, serialization, and GeoParquet storage
Movement Pattern Extraction & Trajectory Analysis — the downstream stage that consumes the structures built here

Back to Home

Spatiotemporal Data Foundations & Structures

Explore deeper