DAME

DAME — EPC → BigQuery Ingestion & Enrichment

Industrial‑grade, minimal‑ops pipeline to ingest UK EPC (Energy Performance Certificate) data from the Open Data Communities API, land it as JSON in GCS, and load into BigQuery with curated views ready for analytics and GenAI/RAG.

Why this exists

  • JSON in, JSON out: Avoids brittle CSV parsing and schema drift.
  • Idempotent: Checkpoints in GCS prevent double work.
  • Partitioned & clustered: Cheap scans and fast filters in BigQuery.
  • Separation of concerns: HTTP client, loaders, orchestration, SQL views are cleanly split.
  • One‑command local run or BigQuery Studio Pipeline (scheduled 08:00 Europe/London).

High‑level architecture

flowchart TD
    API["EPC API JSON"] --> Ingest["Ingest Scripts<br>domestic.py | nondomestic.py"]
    Ingest -->|Normalize| Data{"Normalized Data"}

    subgraph Storage
        GCS["GCS Bucket<br>epc/json/..."]
        BQ["BigQuery<br>raw_json tables"]
    end

    Data -->|Load| GCS
    Data -->|Load| BQ

    subgraph Analytics
        Views["Curated Views<br>enr_*_certificates_v"]
        Recs["Recommendations"]
    end

    BQ --> Views
    BQ -.-> Recs

Repo layout (key paths)

app/dame_epc/
  ├─ settings.py                 # .env + env loader
  ├─ schema.py                   # table names, landing schema, view SQL helpers
  ├─ epc_api.py                  # HTTP-only EPC client (search-after, retries)
  ├─ io_utils.py                 # GCS & BigQuery helpers
  ├─ logging_setup.py            # JSON logging
  ├─ state.py                    # GCS checkpoints (is_done/mark_done/clear)
  ├─ domestic.py                 # monthly domestic ingest (JSON → GCS → BQ)
  ├─ nondomestic.py              # monthly non-domestic ingest
  ├─ bulk_recommendations.py     # per‑LMK or yearly ZIP recommendations
  └─ main.py                     # orchestrator (loop months/kinds, checkpoints)

cloud/dame-prod-473718/
  └─ bigquery/
      ├─ pipeline/pipeline.md    # BQ Studio Pipeline playbook (Prod)
      └─ sql/views/              # curated view DDLs (templated with {{PROJECT}}, {{DATASET}})

scripts/
  ├─ apply_views.py              # create/replace all views from SQL files
  └─ month_ingest_local.sh       # local helper to run orchestrator + apply views

README.md

Prerequisites

  • Google Cloud project with BigQuery and GCS enabled.
  • EPC API credentials from Open Data Communities.
  • gcloud CLI configured with Application Default Credentials (ADC).

Quickstart — one command

chmod +x scripts/month_ingest_local.sh

# Run a month (or range), include recommendations, then apply views
./scripts/month_ingest_local.sh \
  --env .env \
  --start 2024-01 --end 2024-01 \
  --kinds domestic,non-domestic \
  --with-recs \
  --apply-views

This will: 1. Ingest certs month‑by‑month. 2. Optionally fetch recommendations. 3. Write NDJSON.gz to GCS. 4. Load into partitioned/clustered BigQuery raw tables. 5. Apply curated views.

Data model (landing schema)

Column Type Mode Notes
lmk_key STRING REQUIRED Certificate LMK (primary key surrogate)
lodgement_date DATE NULLABLE Partition field (if present)
postcode STRING NULLABLE
uprn STRING NULLABLE Treat as string to avoid int pitfalls
payload JSON NULLABLE Full source record