Data methodology
How we turn hundreds of inconsistent government permit feeds into one clean, queryable, honestly-labeled dataset — and exactly what we will and won't claim about it.
In short: PermitStack ingests U.S. building permits directly from official government open-data portals every day, normalizes every source into one schema, classifies each permit into ~20 categories, rejects impossible (future-dated) records, and labels every jurisdiction with a data_status so you always know whether its data is current. Today: 49M+ permits, 522 active jurisdictions, 24 historical-archive sources, 50 states.
Where the data comes from
Every permit in PermitStack originates from a government authority having jurisdiction — a city or county building department — and is published through that government's own open-data infrastructure. We do not buy data from a reseller and we do not infer permits from imagery. We read each jurisdiction's official feed directly.
U.S. jurisdictions publish on a handful of common platforms, and we built a dedicated connector for each:
- Socrata (Tyler Data & Insights / SODA API) — the most common open-data platform for large cities.
- ArcGIS (Esri FeatureServer / MapServer layers) — GIS-native permit layers, often county-level.
- CKAN (DataStore API) — used by many mid-size municipal open-data portals.
- CARTO (SQL API) — geospatial open-data feeds.
- Tyler EnerGov (Citizen Self Service) — the public JSON search behind Tyler's permitting product, covering a large swath of cities and counties.
- Accela (Citizen Access portals) — for jurisdictions with no open-data feed, we use their public Citizen Access permit-export, then normalize the result like any other source.
A small number of very large statewide datasets — for example California's NEM solar interconnection data (~2.7M records) and New York's NY-Sun program (~550K) — run through dedicated bulk loaders but land in the exact same schema. You can see every jurisdiction, its platform, and its current freshness on the coverage page.
Daily ingestion cadence
We run ingestion every day, beginning at 03:00 UTC. Each run is incremental: for every jurisdiction we ask its source only for permits added or changed since our last successful pull, which keeps the load light on the upstream portal and lets a newly issued permit appear in our API within days of issuance. Statewide bulk feeds refresh on their own schedule (the largest, California solar, monthly), and we re-run any feed in full when a source migrates or changes shape.
This daily cadence is one of our core advantages. For comparison, Shovels.ai — the category leader — refreshes roughly twice monthly (publicly reported, as of June 2026). We update faster and we publish that fact rather than burying it.
Normalization into one schema
Raw permit feeds disagree about almost everything: field names, date formats, status vocabularies, coordinate projections, even what counts as a "permit." Our loader maps each source's columns onto a single canonical record, so a permit from a Socrata city in Texas and an ArcGIS county in California come out identical in our API. Normalization includes:
- Field mapping — each jurisdiction config maps source columns to canonical fields (
permit_number,address,date_filed,date_issued,status,valuation,contractor_name, latitude/longitude, and more). - Geocoding & coordinate handling — we standardize coordinates to WGS84, transforming sources published in a local state-plane projection so every permit has a consistent lat/lng.
- Status normalization — every source's local labels collapse into our fixed set:
filed,issued,in_progress,final,expired,cancelled,revoked,unknown(see permit status). - De-duplication — records upsert on a stable per-jurisdiction key so re-ingesting a feed never creates duplicates.
Classification into ~20 categories
Permit "types" are free text and wildly inconsistent — a roof replacement might be filed as REROOF, RE-ROOF, or a sentence in a description field (see reroof vs. re-roof). Our classifier reads each permit's type and description and assigns one of ~20 normalized categories, so you can query "all solar permits" or "all roofing permits" across every jurisdiction with a single filter instead of guessing local spellings.
Current category counts across the dataset:
| Category | Plain meaning | Records |
|---|---|---|
OTHER | Other | 17,201,635 |
ELECTRICAL | Electrical | 6,308,600 |
SOLAR | Solar | 5,935,469 |
PLUMBING | Plumbing | 4,759,253 |
INTERIOR_REMODEL | Interior Remodel | 3,215,412 |
ROOFING | Roofing | 2,388,748 |
HVAC | Hvac | 2,317,455 |
MECHANICAL | Mechanical | 2,142,565 |
RENOVATION | Renovation | 1,789,952 |
NEW_CONSTRUCTION | New Construction | 1,382,646 |
ADDITION | Addition | 1,087,590 |
SIGN | Sign | 673,937 |
FENCE | Fence | 656,612 |
DEMOLITION | Demolition | 566,999 |
POOL | Pool | 579,318 |
FIRE_ALARM | Fire Alarm | 535,441 |
FOUNDATION | Foundation | 252,709 |
GRADING | Grading | 131,545 |
BATTERY | Battery | 45,168 |
EV_CHARGER | Ev Charger | 31,109 |
Each major category also has a glossary entry — for example solar, roofing, HVAC, electrical, and new construction — and several map to a dedicated landing page like solar permit data.
Future-date rejection & data integrity
Upstream feeds are not clean. Several publish placeholder or typo dates — a permit "issued" in the year 2099, or filed next month. We do not pass those through. During ingestion we reject impossible future-dated values on date_filed, date_issued, and date_completed, nulling the bad date and logging it rather than letting a garbage timestamp pollute a roof-age or freshness calculation. A failed record in a batch never takes down the good records around it.
Honest labeling: the data_status flag
This is the part we care most about. Coverage counts are easy to inflate by leaving dead cities in the total. We don't. Every jurisdiction carries a data_status flag with one of three values:
- active — the source publishes current data; we ingest it daily. (522 jurisdictions)
- historical_archive — the upstream feed is frozen or decommissioned; we keep the historical records for research but expect no new permits. (24 sources)
- frozen — temporarily not updating; we're watching it.
This flag is exposed in API responses and rendered on the coverage page next to every jurisdiction, with a freshness indicator showing how recently each source published. The discipline is simple: a "no permit found" should never silently mean "we quietly stopped getting data here." We'd rather tell you a jurisdiction is archived than let you draw a wrong conclusion from a gap.
What we will and won't claim
We report live counts, not round marketing numbers — the figures on this page are read from the production database when it's generated. Permit data is a near-complete but not total record of construction activity: some work is unpermitted, and coverage varies by jurisdiction. We say so, we label it, and we point you to the coverage page so you can verify before you rely on a city. For the vocabulary behind any field referenced here, see the building permit glossary.
Build on honestly-labeled permit data
49M+ U.S. building permits, refreshed daily, with transparent per-jurisdiction data status. Free tier, no credit card.
Get a Free API Key