Search…
Scraper Schemas
Connecting Scrapers and Datasets.

Overview

Scrapers are best understood by the Dataset they are scraping. The Datasets database is normalized, so it's not easy to tell at a glance what properties a dataset has (unless you have memorized what agency_id = 73e93439e6bf4ffc8b3f931a86fa3ad0 or data_type = 4 means.
The schema.json file in the root of each Scraper folder gives us two things:
    1.
    Human- and machine-readable information about each Scraper.
    2.
    The ability for data collectors to quickly update the Datasets database while submitting or updating Scraper code.

Usage

With the schema.json file populated, we can keep Dataset and Agency information in sync without forcing volunteers to use DoltHub. The schema file also points exactly where the output files are located, and tells the ETL library how to map tabular data.
We currently have one working example for the /USA/CA/butte_county/college/chico scrapers. Running the etl.py script will clone the DoltHub repos, read from the schema.json, make the changes in a new branch, sync the database with the schema.json and prepare for a commit.

Schema example & definitions

1
{
2
"agency_id": "73e93439e6bf4ffc8b3f931a86fa3ad0",
3
"agency_info":{
4
"agency_name":"Clanton Police Department",
5
"agency_coords":{"lat": "32.83853", "lng":"-86.62936"},
6
"agency_type" : 4,
7
"city":"Clanton",
8
"state": "AL",
9
"zip":"35045",
10
"county_fips":"01021"
11
},
12
"data": [
13
{
14
"dataset_id": "5740697099a311ebab258c8590d4a7fc",
15
"url":"https://cityprotect.com/agency/540048e6ee664a6f88ae0ceb93717e50",
16
"full_data_location":"data/cityprotect",
17
"source_type": 3,
18
"data_type": 10,
19
"format_type": 2,
20
"update_freq": 3,
21
"last_modified": "2021-05-25 21:07:03.793049 +0000 UTC",
22
"mapping":{
23
"id": "__uuid__",
24
"ccn":"ccn",
25
"incidentDate": "date",
26
...
27
"datasets_id": "__dataset_id__",
28
"date_insert": "__date__"
29
}
30
}
31
]
32
}
Copied!
Field
Description
agency_id
The agencies.id column
agency_info
Information changed here will be updated in the datasets table the next time the ETL script is run.
agency_name
The agencies.name column.
agency_coords
The latitude and longitude coordinates from the main agency building.
city
Not applicable for county or state agencies.
state
Two-letter ISO code (e.g. "IN" or "CA")
county_fips
If agency_coords is populated, this will be populated automatically. You can also look it up manually from the counties table.
data
Most agencies have multiple Datasets.
dataset_id
If you're scraping an existing dataset, add the id here.
url
The datasets_url column.
full_data_location
The location of the scraper Extraction. This is typically the data directory at the top level of the Scraper.
source_type
The source_types.id column.
data_type
The data_types.id column.
update_freq
The update_frequency.id column.
last_modified
In most cases, leave this blank and it will be filled in automatically. The script will update either the schema.json file or the datasets db if the value in the other location is more recent.
mapping
If you're scraping tabular data with a data_type that exists in the data_intake database, use these fields to map columns in your Extraction to columns in data_intake.__uuid__ , __dataset_id__ , and __date__ will be automatically overwritten by the script.
Last modified 15d ago