Why does gpd.sjoin() run out of memory on large trajectory datasets?

The default sjoin implementation can materialize an internal Cartesian-product DataFrame before filtering. On datasets with 1M+ points and hundreds of zones, that intermediate table exhausts RAM before the predicate is applied. The chunked STRtree pattern described here bounds peak memory to O(chunk_size + zone_count) regardless of trajectory length.

Can I use EPSG:4326 for the join if both layers are in WGS84?

You can, but you should not for analytical joins. Bounding-box checks in degree space produce asymmetric extents that grow with latitude, so an STRtree built in EPSG:4326 is less efficient than one built in a local metric CRS. More importantly, any downstream distance or area calculation will be wrong unless you project first.

How do I choose between within and contains for point-in-polygon tests?

within(point, polygon) returns True when the point lies strictly inside the polygon boundary. contains is the inverse relationship tested from the polygon's perspective. For trajectory points against zone polygons, within is the correct predicate and is also slightly faster because Shapely evaluates points before the polygon envelope.

What chunk_size should I use for a 5M-row trajectory dataset?

Start at 50,000 rows per chunk. On a worker with 16 GB RAM and a zone layer under 10,000 polygons, this holds peak memory below 4 GB. Increase to 100,000 if your worker has 32 GB and the zone layer is small. Decrease to 10,000–20,000 if you observe swap activity or GC pauses in profiling.

Does the STRtree need to be rebuilt when zone boundaries change?

Yes. The STRtree indexes the geometry values it was constructed with, so any change to a zone polygon — addition, deletion, or boundary edit — requires a full rebuild. Tree construction is typically under 2% of total join time for static operational boundaries, so rebuilding on zone update events is the correct pattern.

Optimizing spatial joins for trajectory-to-zone matching

Matching GPS trajectory points to operational zones — census tracts, delivery polygons, traffic analysis zones — degenerates into an O(n × m) problem when handled naively. The fix is a four-stage pipeline: project both layers to a metric CRS, build an STRtree once on the zone layer, pre-filter using bounding-box intersection, then apply vectorized exact validation only to the surviving candidates. On multi-million-row mobility datasets, this drops join latency from hours to seconds.

Why naive joins fail at scale

The root cause is unindexed geometry comparisons against geographic coordinates. Coordinate reference system mapping underpins every spatial predicate: when trajectory points and zone polygons live in WGS84 (EPSG:4326), bounding boxes are measured in decimal degrees. A degree of longitude at 52° N is ~69 km, while at 10° N it is ~110 km. That asymmetry makes axis-aligned bounding box (AABB) pre-filtering inefficient and causes predicate libraries to underestimate or overestimate candidates, forcing more exact geometry calls than necessary.

The second failure mode is memory exhaustion. Standard gpd.sjoin() implementations build an internal cross-join structure before filtering. On a dataset with 2 M trajectory points and 500 zone polygons, the intermediate structure can exceed 8 GB before any predicate runs. Without chunking, this triggers OOM kills on standard cloud instances.

These problems sit squarely within GPS precision & error handling as well — trajectory points that carry positional noise should be cleaned before the join, not after, because a drifted point that falls outside a zone boundary will silently produce a null match that is indistinguishable from a genuine non-match.

Core optimization pipeline

Four deterministic steps eliminate both the CRS distortion and memory problems:

Project to a local metric CRS. Reproject both layers to a UTM zone or another metric projection covering your analysis region before any geometry operation. This makes AABB extents uniform across latitude and enables accurate distance-based predicates.
Precompute an STRtree on the zone layer. Build the index once, outside the loop. The Sort-Tile-Recursive tree partitions the 2D plane into balanced rectangles and returns candidate zone indices in O(log m) time per query.
Bounding-box pre-filter with intersects. Query the tree using the intersects predicate to retrieve zone candidates whose bounding box overlaps each trajectory point. AABB intersection is a fast min/max comparison; only pairs that survive move to exact geometry evaluation.
Vectorized exact validation in fixed-size chunks. Apply Shapely 2.0+ within() on NumPy geometry arrays across trajectory batches of fixed row count. This caps peak memory and keeps CPU cache locality high.

The diagram below shows how trajectory points flow through the two-stage filter before producing matched zone assignments.

Production-ready Python implementation

The function below uses GeoPandas and Shapely 2.0+ to execute a chunked, indexed join. It avoids the memory-heavy gpd.sjoin() by manually routing candidates through an STRtree and applying vectorized exact validation. All geometry operations run in a metric CRS; WGS84 inputs are projected before the loop runs.

PYTHON

import geopandas as gpd
import pandas as pd
import numpy as np
from shapely import STRtree, within
import warnings

warnings.filterwarnings("ignore", category=UserWarning)


def optimized_trajectory_to_zone_join(
    traj_gdf: gpd.GeoDataFrame,
    zones_gdf: gpd.GeoDataFrame,
    chunk_size: int = 50_000,
    target_crs: str = "EPSG:32633",
) -> pd.DataFrame:
    """
    Indexed, chunked spatial join: trajectory points to zone polygons.

    Parameters
    ----------
    traj_gdf   : GeoDataFrame of trajectory GPS points (any CRS, Point geometry)
    zones_gdf  : GeoDataFrame of zone polygons (any CRS, Polygon/MultiPolygon)
    chunk_size : rows per processing batch; tune to available RAM
    target_crs : a metric projected CRS matching the analysis region
                 (NEVER use EPSG:4326 here — degree units break AABB efficiency)

    Returns
    -------
    DataFrame with columns [trajectory_idx, zone_id]
    """
    if traj_gdf.empty or zones_gdf.empty:
        return pd.DataFrame(columns=["trajectory_idx", "zone_id"])

    # 1. Enforce consistent projected CRS on both layers.
    #    Do this BEFORE building the index — the tree is tied to the
    #    geometry values it was constructed with.
    if str(traj_gdf.crs) != target_crs:
        traj_gdf = traj_gdf.to_crs(target_crs)
    if str(zones_gdf.crs) != target_crs:
        zones_gdf = zones_gdf.to_crs(target_crs)

    # 2. Precompute STRtree on zones (one-time cost, amortized over all chunks).
    zone_geoms = zones_gdf.geometry.values
    zone_tree = STRtree(zone_geoms)
    zone_idx_map = zones_gdf.index.values

    results: list[pd.DataFrame] = []

    # 3. Chunked execution: peak memory = O(chunk_size + zone_count).
    for start in range(0, len(traj_gdf), chunk_size):
        chunk = traj_gdf.iloc[start : start + chunk_size]
        traj_geoms = chunk.geometry.values
        chunk_orig_idx = chunk.index.values

        # 4. Bounding-box pre-filter via STRtree.
        #    Returns shape (2, N): row 0 = zone positions, row 1 = traj positions.
        candidates = zone_tree.query(traj_geoms, predicate="intersects")

        if candidates.shape[1] == 0:
            continue

        z_pos, t_pos = candidates

        # 5. Exact point-in-polygon validation (vectorized C-level call).
        exact_mask = within(traj_geoms[t_pos], zone_geoms[z_pos])

        if not exact_mask.any():
            continue

        z_pos = z_pos[exact_mask]
        t_pos = t_pos[exact_mask]

        # Map chunk-relative positions back to original DataFrame indices.
        t_orig_idx = chunk_orig_idx[t_pos]
        matched_zone_ids = zone_idx_map[z_pos]

        results.append(
            pd.DataFrame(
                {"trajectory_idx": t_orig_idx, "zone_id": matched_zone_ids}
            )
        )

    if not results:
        return pd.DataFrame(columns=["trajectory_idx", "zone_id"])

    return pd.concat(results, ignore_index=True)

Validation block

Run these checks immediately after the join to catch silent failures before they propagate downstream.

PYTHON

import logging

def validate_join_output(
    result: pd.DataFrame,
    traj_gdf: gpd.GeoDataFrame,
    zones_gdf: gpd.GeoDataFrame,
) -> None:
    """Sanity-check the join result and log a summary."""
    assert set(result.columns) >= {"trajectory_idx", "zone_id"}, (
        "Output is missing required columns."
    )

    # No trajectory index should be outside the input range
    valid_traj_idx = set(traj_gdf.index)
    rogue = set(result["trajectory_idx"]) - valid_traj_idx
    assert not rogue, f"{len(rogue)} trajectory indices not in input GeoDataFrame"

    # No zone id should be outside the zone layer
    valid_zone_ids = set(zones_gdf.index)
    rogue_zones = set(result["zone_id"]) - valid_zone_ids
    assert not rogue_zones, f"{len(rogue_zones)} zone IDs not in zones GeoDataFrame"

    match_rate = len(result["trajectory_idx"].unique()) / len(traj_gdf)
    logging.info(
        "Join complete: %d matches, %.1f%% of trajectory points matched",
        len(result),
        match_rate * 100,
    )

    # Alert if match rate is suspiciously low (possible CRS mismatch)
    if match_rate < 0.01 and len(traj_gdf) > 1000:
        logging.warning(
            "Match rate below 1%% — verify both layers cover the same geographic extent "
            "and that target_crs is appropriate for the region."
        )

Key post-run checks:

Output shape: result should have between 0 and len(traj_gdf) rows; more rows than trajectory points indicates duplicate zone matches (expected if zones overlap).
Match rate sanity: A match rate below 1% on a dense urban dataset almost always means a CRS mismatch — one layer projected differently from the other.
Null zone IDs: Any NaN in zone_id means a trajectory point fell outside all zone boundaries, which is legitimate; confirm it is not caused by extent mismatch.
Duplicate handling: If zones overlap and you need at most one zone per point, add .drop_duplicates(subset="trajectory_idx", keep="first") after the join.

Common mistakes and gotchas

Leaving coordinates in EPSG:4326. Bounding boxes in decimal degrees are latitude-dependent: a 0.01° box spans ~1.1 km at the equator but only ~0.6 km at 55° N. This degrades AABB pruning efficiency by up to 40% at mid-latitudes and makes every downstream distance or speed calculation wrong. The best practices for CRS transformations in movement data page covers projection selection in detail.
Rebuilding the STRtree inside the chunk loop. The tree construction cost is proportional to zone count; doing it once outside the loop and reusing it across all batches is the critical performance win. Benchmarks on 10,000-zone layers show rebuilding per chunk adds 25–60 seconds to a join that otherwise runs in under 2 seconds.
Using iterrows() for the join. A row-by-row Python loop with shapely.contains() per point processes roughly 1,000–5,000 points per second. The vectorized STRtree + within() pattern processes 200,000–500,000 points per second on the same hardware.
Forgetting that STRtree is tied to its geometry array. If you reproject zones_gdf after building the tree, the tree still references the pre-projection geometry objects. Always build the tree after the final CRS is set.
Ignoring points on zone boundaries. The within predicate returns False for points that sit exactly on a polygon boundary. If boundary points must match, replace within with covers (polygon fully covers point, including boundary) or apply a 1-centimeter buffer to zone geometries before indexing.
Assuming one match per point when zones overlap. Overlapping delivery zones or administrative boundaries produce one result row per matching zone per point. Downstream aggregations must account for this or you will overcount dwell times and trip counts.

Tuning guidelines

Parameter	Starting value	Adjust when
`chunk_size`	`50_000`	Increase to `100_000` on workers with 32 GB+ RAM; decrease to `10_000` if swap activity or GC pauses appear
`target_crs`	Local UTM zone	Use a single regional CRS; avoid `EPSG:3857` (Web Mercator distorts area at high latitudes)
STRtree rebuild	On zone geometry change only	Rebuild when polygons are added, removed, or edited; not on attribute-only updates
Predicate	`within`	Switch to `covers` if boundary-straddling points must match; use `intersects` only for line or polygon trajectories

This pattern scales linearly with trajectory volume and logarithmically with zone complexity. It is well-suited for real-time fleet tracking, historical mobility reconstruction, and high-frequency urban analytics. When cleaning GPS data before the join, follow the handling GPS drift in raw trajectory logs workflow first — noisy points that land outside a zone boundary produce silent null matches that are impossible to distinguish from genuine non-matches without drift correction. For downstream speed and acceleration metrics, the sampling rate optimization pipeline ensures the temporal density of matched points is consistent across zones before aggregation.