Gap Filling in Sparse Trajectories

Q: What gap duration is safe to interpolate vs. flag as unrecoverable?

The threshold depends on transport mode and application. For urban delivery fleets, gaps exceeding 15 minutes typically indicate device sleep or route abandonment and should be flagged rather than filled. For pedestrian traces, even 5-minute gaps can produce implausible straight-line fills. Use a multiplier of 10× the expected sampling interval as a starting rule, then calibrate against domain-specific maximum sustained speeds.

Q: Should I interpolate in WGS84 or a projected CRS?

Always project to a metric CRS (EPSG:3857 or a local UTM zone) before any distance-based interpolation or velocity calculation. WGS84 degree offsets at mid-latitudes introduce systematic error of 0.5–1% on east-west distances. Project at ingestion, interpolate in metric space, then reproject to WGS84 only for export or visualization.

Q: Why does cubic spline interpolation produce speed spikes at gap boundaries?

Cubic splines enforce C² continuity by matching first and second derivatives at knot boundaries. When anchor points on either side of a gap have significantly different velocities — common after sharp turns or sudden stops — the spline overshoots to satisfy the derivative constraints. PCHIP avoids this by preserving local monotonicity at the cost of C¹ continuity only. For noisy traces, Kalman filter prediction is preferable because it propagates uncertainty rather than imposing geometric constraints.

Q: How does gap filling interact with downstream stay-point detection?

Interpolated segments artificially inflate position counts during gap periods, which can create phantom stay-points if the filled trajectory happens to concentrate points near a location. Always pass the is_interpolated flag through to stay-point algorithms and exclude or down-weight synthetic points when computing dwell-time thresholds.

Systematic reconstruction of missing spatiotemporal points between observed GPS fixes, enabling Temporal Aggregation & Window Mapping pipelines to operate on continuous, uniformly sampled movement sequences.

Prerequisites Checklist

Complete these setup steps before implementing any interpolation stage. The pipeline will fail silently — not loudly — if these are not in place.

Python packages: pandas>=2.0, geopandas>=0.13, numpy, scipy>=1.11, shapely>=2.0, pyproj>=3.5
Projected CRS: All geometries must be in a metric coordinate system before distance or velocity calculations — EPSG:3857 or a local UTM zone. Never interpolate in WGS84. See coordinate reference system mapping for projection patterns.
UTC-normalized timestamps: Parse to datetime64[ns, UTC] and enforce strict monotonic ordering per entity. Daylight saving transitions introduce apparent 1-hour gaps that corrupt gap classifiers.
Sampling metadata: Know the device’s expected reporting interval (e.g., 1 s, 5 s, 30 s) so the pipeline can distinguish sensor jitter from genuine signal loss.
Domain constraints: Maximum plausible speed, acceleration limits, and road-network topology if map-matching follows interpolation.
Upstream stage: Raw trace ingestion and GPS precision and error handling must complete before gap filling. Interpolating over unfiltered noise locks artifacts into the cleaned output.

Gap Taxonomy

Understanding gap origins is prerequisite to choosing the right fill strategy. Applying the wrong method amplifies rather than removes error.

Error Source	Mechanism	Typical Duration	Impact on Downstream	Mitigation
Urban canyon multipath	Signal blockage by dense buildings; receiver loses lock	10–120 s	Velocity spikes at re-acquisition; phantom stops	PCHIP or Kalman fill; snap to road network post-fill
Battery power-saving mode	OS-level GPS duty-cycling at low-battery states	60–600 s	Straight-line phantom paths across impassable terrain	Flag if > 10× interval; inject NaN sentinel
Tunnel transit	Complete satellite occlusion; predictable re-acquisition at exit	30–300 s	Missing route segment; break in trajectory continuity	DR (dead reckoning) with last-known heading + speed if available
Device reboot or hard reset	No telemetry during boot sequence	60–300 s	Trajectory split into disconnected segments	Mark as session boundary; never fill across
Cellular handoff packet loss	MQTT or HTTP retransmit drops a ping during network switch	5–30 s	Micro-gap in otherwise dense trace	Linear or PCHIP micro-fill
Manual flight mode	User-initiated shutdown; no kinematics available	Hours	Route abandonment; do not infer path	Mark as `unrecoverable`; split entity session

Pipeline Overview

The seven stages below are deterministic and must execute in sequence. No stage may pass NaN coordinates or non-monotonic timestamps to the next.

Stage 1 — Ingest and Normalize

Parse raw CSV or Parquet, parse timestamps to datetime64[ns, UTC], convert to GeoDataFrame, and project to a metric CRS. Drop exact-duplicate timestamps per entity, sort by timestamp, and assert that geometry contains valid Point objects with no NaN coordinates. Fail early — do not let corrupt rows propagate.

Stage 2 — Gap Detection and Classification

Compute per-entity temporal deltas with diff(). Flag intervals exceeding a configurable multiple of expected_interval:

Micro-gaps (1–3×): Sensor jitter or brief packet loss.
Standard gaps (3–10×): Urban canyon dropout, tunnel, or cellular handoff.
Extended gaps (>10×): Device sleep, manual shutdown, or route abandonment.

Stage 3 — Plausibility Filter

Discard gaps whose implied straight-line speed exceeds the domain maximum, or whose duration exceeds the transport-mode recovery threshold. Flag these as unrecoverable, inject an explicit NaN sentinel row, and split the entity’s trajectory at the gap boundary. Never attempt to interpolate across an unrecoverable break.

Stage 4 — Temporal Interpolation

Resample the time axis to the target frequency using pd.Grouper. For smooth, physically plausible paths, prefer PCHIP (Piecewise Cubic Hermite Interpolating Polynomial), which preserves local monotonicity and avoids the derivative-matching overshoot of cubic splines. For noisy or multi-modal movement, use Kalman filter interpolation, which propagates positional uncertainty across the gap instead of imposing geometric constraints.

Stage 5 — Kinematic Clipping

Compute instantaneous velocity and acceleration from interpolated metric coordinates. Clip against domain thresholds (e.g., v_max = 120 km/h, a_max = 3.5 m/s²). Retain both raw interpolated and clipped values for auditability. Clipping is necessary because cubic methods can produce phantom speed spikes at gap boundaries when anchor-point velocities differ sharply.

Stage 6 — Spatial Smoothing and Topological Correction

Apply a Savitzky-Golay filter to reduce micro-jitter introduced during resampling. If a road network is available, snap interpolated points to the nearest valid edge within a tolerance radius (typically 15 m for urban environments). This prevents topologically impossible cross-block shortcuts or off-road segments that violate drivable infrastructure.

Stage 7 — Validate and Export

Recompute temporal deltas to verify uniform spacing. Assert no NaN coordinates remain in filled segments. Attach metadata columns: is_interpolated (bool), gap_duration_s (float), interpolation_method (str). Export to Parquet with explicit schema enforcement. This output becomes the reliable input for Rolling Statistics for Mobility Metrics.

Implementation Walkthrough

The function below implements stages 1–5 in a single, vectorized pass. It handles empty frames, single-fix entities, duplicate timestamps, and CRS validation explicitly.

PYTHON

import pandas as pd
import numpy as np
import geopandas as gpd
from scipy.interpolate import PchipInterpolator
from shapely.geometry import Point
from pyproj import CRS

METRIC_CRS = "EPSG:3857"


def fill_trajectory_gaps(
    df: pd.DataFrame,
    entity_col: str,
    time_col: str,
    lat_col: str,
    lon_col: str,
    expected_interval_s: float,
    max_gap_multiplier: float = 10.0,
    v_max_ms: float = 33.3,  # 120 km/h in m/s
    target_freq: str = "5s",
) -> pd.DataFrame:
    """
    Detect, classify, and fill gaps in a GPS trajectory DataFrame.

    Parameters
    ----------
    df : Input DataFrame with raw GPS fixes.
    entity_col : Column identifying unique moving entities (e.g. 'vehicle_id').
    time_col : Timestamp column; must be timezone-aware or will be coerced to UTC.
    lat_col, lon_col : WGS84 latitude and longitude column names.
    expected_interval_s : Device's nominal reporting interval in seconds.
    max_gap_multiplier : Gaps exceeding this multiple of expected_interval_s are unrecoverable.
    v_max_ms : Maximum plausible speed in metres per second.
    target_freq : pandas offset alias for the output time grid (e.g. '5s', '10s').

    Returns
    -------
    pd.DataFrame with uniform temporal spacing per entity, added columns:
    is_interpolated (bool), gap_duration_s (float), interpolation_method (str).
    """
    if df.empty:
        return df.copy()

    required = {entity_col, time_col, lat_col, lon_col}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Input DataFrame missing columns: {missing}")

    # --- Stage 1: Normalize timestamps and project to metric CRS ---
    df = df.copy()
    df[time_col] = pd.to_datetime(df[time_col], utc=True)
    df = df.sort_values([entity_col, time_col])
    # Drop exact duplicate (entity, timestamp) rows
    df = df.drop_duplicates(subset=[entity_col, time_col])

    gdf = gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df[lon_col], df[lat_col], crs="EPSG:4326"),
    )
    # All metric calculations MUST use a projected CRS, not raw WGS84
    gdf = gdf.to_crs(METRIC_CRS)
    gdf["_x"] = gdf.geometry.x
    gdf["_y"] = gdf.geometry.y

    max_gap_s = expected_interval_s * max_gap_multiplier

    all_output: list[pd.DataFrame] = []

    for entity_id, group in gdf.groupby(entity_col, sort=False):
        group = group.reset_index(drop=True)

        if len(group) < 2:
            # Cannot interpolate with fewer than 2 fixes
            group["is_interpolated"] = False
            group["gap_duration_s"] = 0.0
            group["interpolation_method"] = "none"
            all_output.append(group)
            continue

        # --- Stage 2: Gap detection ---
        t_sec = group[time_col].astype("int64") / 1e9  # epoch seconds
        delta_s = group[time_col].diff().dt.total_seconds().fillna(0.0)
        group["_delta_s"] = delta_s
        group["_gap_tier"] = pd.cut(
            delta_s,
            bins=[-np.inf, expected_interval_s * 3, expected_interval_s * 10, np.inf],
            labels=["micro", "standard", "extended"],
        )

        # --- Stage 3: Mark unrecoverable gaps ---
        # Implied speed check: distance / gap_duration > v_max_ms → unrecoverable
        dx = group["_x"].diff().fillna(0.0)
        dy = group["_y"].diff().fillna(0.0)
        dist_m = np.sqrt(dx**2 + dy**2)
        implied_speed = dist_m / delta_s.replace(0, np.nan)
        unrecoverable = (delta_s > max_gap_s) | (implied_speed > v_max_ms)

        # Build session segments split at unrecoverable boundaries
        group["_session"] = unrecoverable.cumsum()

        # --- Stage 4: Interpolate within each session ---
        session_frames: list[pd.DataFrame] = []
        for session_id, seg in group.groupby("_session", sort=True):
            seg = seg.reset_index(drop=True)
            if len(seg) < 2:
                seg["is_interpolated"] = False
                seg["gap_duration_s"] = 0.0
                seg["interpolation_method"] = "none"
                session_frames.append(seg)
                continue

            t0 = seg[time_col].iloc[0]
            t1 = seg[time_col].iloc[-1]
            new_index = pd.date_range(start=t0, end=t1, freq=target_freq, tz="UTC")

            # Original timestamps as float seconds for interpolator
            orig_t = (seg[time_col] - t0).dt.total_seconds().values
            new_t = (new_index - t0).total_seconds().values

            try:
                interp_x = PchipInterpolator(orig_t, seg["_x"].values)(new_t)
                interp_y = PchipInterpolator(orig_t, seg["_y"].values)(new_t)
                method = "pchip"
            except ValueError:
                # Fallback: linear interpolation when PCHIP preconditions fail
                interp_x = np.interp(new_t, orig_t, seg["_x"].values)
                interp_y = np.interp(new_t, orig_t, seg["_y"].values)
                method = "linear_fallback"

            interp_df = pd.DataFrame({
                time_col: new_index,
                "_x": interp_x,
                "_y": interp_y,
                entity_col: entity_id,
            })

            # Mark which rows were synthesized
            orig_times_set = set(seg[time_col].values)
            interp_df["is_interpolated"] = ~interp_df[time_col].isin(orig_times_set)
            interp_df["interpolation_method"] = np.where(
                interp_df["is_interpolated"], method, "original"
            )

            # Carry forward gap_duration_s for interpolated rows
            gap_map = seg.set_index(time_col)["_delta_s"].to_dict()
            interp_df["gap_duration_s"] = interp_df[time_col].map(gap_map).fillna(0.0)

            session_frames.append(interp_df)

        entity_result = pd.concat(session_frames, ignore_index=True)

        # --- Stage 5: Kinematic clipping (metric CRS) ---
        dt = entity_result[time_col].diff().dt.total_seconds().replace(0, np.nan)
        vx = entity_result["_x"].diff() / dt
        vy = entity_result["_y"].diff() / dt
        speed = np.sqrt(vx**2 + vy**2)
        # Flag, but do not silently discard, rows exceeding v_max_ms
        entity_result["speed_clipped"] = speed.clip(upper=v_max_ms)
        entity_result["speed_raw_ms"] = speed

        # Reconstruct geometry in metric CRS then reproject to WGS84 for export
        entity_result["geometry"] = [
            Point(x, y) for x, y in zip(entity_result["_x"], entity_result["_y"])
        ]
        gdf_out = gpd.GeoDataFrame(entity_result, geometry="geometry", crs=METRIC_CRS)
        gdf_out = gdf_out.to_crs("EPSG:4326")
        gdf_out[lat_col] = gdf_out.geometry.y.round(6)
        gdf_out[lon_col] = gdf_out.geometry.x.round(6)

        all_output.append(gdf_out.drop(columns=["_x", "_y", "_delta_s", "_gap_tier",
                                                  "_session"], errors="ignore"))

    return pd.concat(all_output, ignore_index=True)

Reliability notes:

The PchipInterpolator call is wrapped in try/except ValueError to catch duplicate timestamps or insufficient anchor points; it degrades gracefully to linear interpolation with an explicit method tag.
All velocity calculations use metric-projected coordinates. Computing speed in EPSG:4326 introduces systematic error — never do this.
The _session column propagates unrecoverable-gap splits through the entire entity without a second pass.
Schema validation with pandera or pydantic should be applied at the function boundary in production: assert is_interpolated is bool, gap_duration_s is float, and geometry is a valid Point.

Mathematical Grounding

PCHIP interpolation constructs a piecewise cubic polynomial on each sub-interval [t_k, t_{k+1}]. The derivative at each knot is chosen to preserve the sign of the finite difference — if f(t_{k+1}) > f(t_k) then d_k > 0 — which prevents the derivative-matching overshoot of classic cubic splines. Formally, given anchor values (t_k, p_k) and derivatives d_k:

TEXT

H(t) = h₀₀(τ)·p_k + h₁₀(τ)·Δt·d_k + h₀₁(τ)·p_{k+1} + h₁₁(τ)·Δt·d_{k+1}

where τ = (t − t_k) / Δt and h₀₀, h₁₀, h₀₁, h₁₁ are the Hermite basis polynomials. The monotonicity-preserving derivative selection is defined in Fritsch & Carlson (1980) and implemented directly in scipy.interpolate.PchipInterpolator.

For spatial coordinates, H(t) is applied independently to the projected X and Y components. The combined path is not geometrically constrained to follow a geodesic — topological correction (Stage 6) handles alignment with the road network.

Calibration and Parameter Tuning

Threshold values depend heavily on transport mode, device hardware, and analytical requirements. Use this table as a starting point and validate against labeled gaps in your own data.

Transport Mode	Expected Interval	Micro-gap Ceiling	Max Recoverable Gap	v_max (m/s)	a_max (m/s²)
Urban passenger car	5 s	15 s	2 min	33 (120 km/h)	3.5
Long-haul truck	30 s	90 s	15 min	30 (108 km/h)	1.5
Urban cyclist	5 s	15 s	3 min	14 (50 km/h)	2.5
Pedestrian	1 s	3 s	5 min	5 (18 km/h)	1.5
Maritime vessel	60 s	180 s	30 min	15 (54 km/h)	0.5
Delivery drone	1 s	3 s	1 min	25 (90 km/h)	5.0

Tuning guidance:

Start with max_gap_multiplier = 10 and halve it if your domain has reliable, high-frequency hardware. Increase it for maritime or aviation use-cases where extended gaps are operationally normal.
Set v_max_ms conservatively — 10% below the physical maximum — to catch interpolation artifacts at gap edges without flagging legitimate acceleration events.
For compliance-sensitive applications (driver hours-of-service, fleet insurance), always err toward flagging rather than filling. Downstream aggregations can explicitly down-weight or exclude is_interpolated == True rows.

Integration and Compatibility

Filled trajectories flow directly into adjacent parts of the analytics stack:

Dynamic Time-Binning Strategies: Gap-filled traces with uniform temporal spacing make density-driven bin-edge computation reliable. Without filling, sparse-segment density estimates collapse and produce unrealistically wide bins.
Rolling Statistics for Mobility Metrics: Windowed aggregations over speed, heading variance, or dwell count require a consistent time grid. Gaps in the input break rolling window alignment and produce NaN-inflated statistics.
Stay-point detection (DBSCAN): Pass is_interpolated through to the stay-point algorithm and exclude synthetic points from dwell-time accumulation, otherwise filled segments near frequent locations generate phantom stops.
Kalman-based map matching: PCHIP-filled traces provide a good prior path for Kalman map-matching, but the filter’s own prediction step replaces geometric interpolation across extended gaps — see interpolating missing GPS points with Kalman filters for the fused approach.
Time-series synchronization strategies: Multi-source telemetry fusion (CAN-bus, cellular, Bluetooth) benefits from a shared uniform time grid — gap filling on each source before the join reduces O(n log n) nearest-neighbour matching to O(n).

FAQ

What gap duration is safe to interpolate vs. flag as unrecoverable?

The threshold depends on transport mode. For urban delivery fleets, gaps exceeding 15 minutes typically indicate device sleep or route abandonment and should be flagged. For pedestrian traces, even 5-minute gaps can produce implausible straight-line fills. Use 10× the expected sampling interval as a starting rule, then calibrate against domain-specific maximum sustained speeds.

Should I interpolate in WGS84 or a projected CRS?

Always project to a metric CRS before any distance-based interpolation or velocity calculation. WGS84 degree offsets at mid-latitudes introduce systematic east-west error of 0.5–1%. Project at ingestion, interpolate in metric space, then reproject to WGS84 only for export or visualization.

Why does cubic spline interpolation produce speed spikes at gap boundaries?

Cubic splines enforce C² continuity by matching second derivatives at knots. When anchor points on either side of a gap have significantly different velocities — common after sharp turns or sudden stops — the spline overshoots. PCHIP avoids this by preserving local monotonicity. For noisy traces, Kalman prediction is preferable because it propagates uncertainty rather than imposing geometric constraints.

How do I handle entity IDs with only one or two GPS fixes?

Entities with fewer than two valid fixes cannot be interpolated. Drop them from the interpolation stage, route them to a low-confidence output, and attach an explicit flag. Attempting interpolation on single-fix traces causes scipy boundary errors or degenerate zero-length segments.

How does gap filling interact with downstream stay-point detection?

Interpolated segments artificially inflate position counts, which can create phantom stay-points if the filled trajectory concentrates points near a location. Pass is_interpolated through to stay-point algorithms and exclude or down-weight synthetic points when computing dwell-time thresholds.

Gap Filling in Sparse Trajectories

Explore deeper