Project Methodology & Architecture

How we process, verify, and deliver the world's most comprehensive scuba diving dataset.

The Problem with Dive Data

Historically, scuba diving information has been siloed. Dive centers keep their own proprietary lists of sites, community forums hold scattered trip reports, and geographical naming conventions vary wildly between languages and localities.

To build Dive Navigator, we couldn't just scrape one website. We had to build an ingestion engine capable of cross-referencing multiple disparate sources, handling geospatial discrepancies, and using AI to normalize unstructured text into queryable facts.

The Data Pipeline

1. Ingestion & Aggregation

We ingest data from a mix of community submissions, historical dive logs, government open data protocols, and verified operator records.

Read about our Data Sources →

2. Spatial Normalization via PostGIS

Because a dive site might be known by three different names, we rely heavily on geospatial bounding boxes. Using ST_Distance and ST_Contains, we group submissions into geographic clusters, identifying duplicate coordinate assertions within a 50-100 meter radius to establish the "Canonical Site".

3. LLM Fact Extraction

Unstructured text (like "It's a deep wall dive, about 30m max, strong currents usually") is passed through Large Language Models to extract structured, definitive JSON objects:

{
  "max_depth_m": 30,
  "dive_type": ["wall", "deep"],
  "current_strength": "strong"
}

4. Confidence Scoring & Verification

Facts extracted by AI are assigned a confidence score. They enter a `fact_candidates` table. If multiple independent sources correlate the exact same depth or hazard, the 'Trust Score' increases. Only high-trust or manually verified facts make it to the live maps.

Read about our Verification Process →