Search…
Find Datasets to submit

Introduction

Every agency has one homepage and one or many datasets. Datasets are URLs to scrape for data. You may want to submit a new dataset to scrape. This guide is for you!

Quick start

  1. 1.
    Head to the Datasets repo to see which datasets we already have.
  2. 2.
    Find a web page on the agency site with potentially useful data.
  3. 3.
    Use this as a starting point this to submit with web tools or the CLI.
For example, take Alameda, California:

Lists of Datasets

This is about to get a little meta.
Some Datasets are lists of other Datasets. They often look like this.
​
​
These have a Data Type of "list_of_datasets". Here they are in the database. This may be a good place to start when looking to add every dataset to a particular city.

Scraping Lists of Datasets

Since Datasets are a resource, and not directly useful data, we don't need to prove chain of custody on them or save scrapers.
Code-free tools like Parsehub can be used when you encounter a "List of Datasets", and may be quicker for some people than writing code.

Deeper Dive

This tutorial will refer to several tables or dolthub links. Here are the common tables you will be either querying in SQL or referencing on the website in one place:
  • data_types: Online or select * from data_types
  • statuses: Online or select * from dataset_status
  • source_types: Online or select * from source_types
  • id generator: Online or SELECT REPLACE(uuid(), '-', '');
    • Note, you can also generate one from python shell, using the uuid v4 generator (which is better than SQLs default gen).
    • As of 25 May 2021, the id columns have been set up to automatically generate the uuid for you if you do not specify the column in the insert.
1
# launch into a shell, or use this in a python program
2
~/ > python3
3
​
4
Python 3.9.4 (default, Apr 5 2021, 01:50:46)
5
>>> from uuid import uuid4
6
>>> str(uuid4()).replace('-', '')
7
'adaf2c5b17404300a5fb5a6501567f61'
8
​
9
# Alternatively, here is a one-liner
10
python -c "import uuid; print(str(uuid.uuid4()).replace('-',''))"
11
f1ecbb3c45834da3ba2a79e86b61a398
Copied!
​

Agency with Multiple Different Data Types

The best case is when we get a link that is usually a directory or listing of files, a map with data on it, or a HTML table with data on it. First we need to pick an agency from the agencies table.
Let's see Alameda, California. First we need to grab the agency id (either from the web or sql) for later use
We will use the Police Department.
So 5c2d0726d183487ba746402872573f42 is our agency id! Now we perform a google search for Alameda Police Department CA, and we find their homepage. We can also see in the above table that homepage_url is blank, so we can run a quick update to store this homepage!
1
UPDATE agencies SET homepage_url = 'https://www.alamedaca.gov/Departments/Police-Department' where id = '5c2d0726d183487ba746402872573f42';
Copied!
On the sidebar they have a button to Review Crime Activity! Perfect! Let's click on that!
5 different types of data!
Awesome! This particular link gives us 5 different types of data! We will want to capture each different data type in its own dataset record, as the scraper for each data type will most likely be a bit different and the table the data goes into will also be different. So let's start cross-referencing the data-types with what we see on the page to build to our datasets!

Dataset Enumeration

Let's just go down the list of the 5 types we have for this agency.

Alameda Crime Graphics

Crime Graphics
The first accordion of this page has a link for crime graphics. When we click the button here, it takes us to a brand new link with incident reports, arrest logs, missing persons, daily bulletins and more! Unfortunately, this website does not change the URL when you click any of the links on the sidebar. So that means this data_type is a multi, which we can find on our data types reference list as id 27 . This data being housed on an external source, makes it a third party source type, which looking that up is 3. It's also a good idea to leave a note for a multi data type. The status_id for our purposes will always be 1 because you are just adding the dataset but not scraping it. Scrapers will update this column to show the status of each dataset later! We have many other columns we can fill out for a dataset, but most of which will be used by data scrapers in their task, so it's not completely necessary to do now, unless you really want to!
The datasets table will automatically generate the id for you if you do not specify it. However, if you really wish to specify it yourself: you can use the sql function REPLACE(uuid(), '-', '') (which can be found via this link). This will always be a random string of letters and numbers, so for me I received c813aab6ba3d11ebb082927cb9e1207c. You also would need to specify it in the statement below such as INSERT INTO datasets (id, url ...).
1
INSERT INTO datasets
2
(url, status_id, data_types_id, source_type_id, agency_id, notes)
3
VALUES
4
(
5
'https://alameda.crimegraphics.com/2013/default.aspx',
6
1,
7
27,
8
3,
9
'5c2d0726d183487ba746402872573f42',
10
'CrimeGraphics: Incident Reports, Arrest Logs, Missing Persons, Daily Bulletins'
11
);
Copied!

Alameda Crime Statistics

Just like above, let's go to the next accordion menu of Crime Statistics:
This provides us with another link, but this time the link is still on the official agency webpage. Looking back at our source types page: we would now use 2 for a direct source type. Now we look again at the data types and we find 24 which is annual_reports. And we are ready to insert!
1
INSERT INTO datasets
2
(url, status_id, data_types_id, source_type_id, agency_id)
3
VALUES
4
(
5
'https://www.alamedaca.gov/Departments/Police-Department/Annual-Crime-Stats',
6
1,
7
24,
8
2,
9
'5c2d0726d183487ba746402872573f42'
10
);
Copied!

PDFs

The final example I wanted to touch is a multi data type for PDFs. We can see the 3rd and 5th accordion: Monthly Crime Statistics and Policty and Hate Crimes Reports. The url in the datasets table must be unique, so we cannot simply paste the url twice for both datatypes. In this case we need to use data type 27 again for multi, we can use 2 for a direct source type again, and gen a new id. The difference for this is I will leave a note of what the multi entails.
1
INSERT INTO datasets
2
(url, status_id, data_types_id, source_type_id, agency_id, notes)
3
VALUES
4
(
5
'https://www.alamedaca.gov/Departments/Police-Department/Crime-Activity',
6
1,
7
27,
8
2,
9
'5c2d0726d183487ba746402872573f42',
10
'PDFs for Crime Stats (23), Policy & Hate Crimes Reports'
11
);
Copied!
We just need to do the 4th accordion, and then we can push our data in a PR or we can start back over and pick another agency! You can also verify the records you added in your sql instance where you're working: select * from datasets where agency_id='5c2d0726d183487ba746402872573f42' and date_insert >= '2021-05-21' it will grab all of the records for the current agency, and the date_insert will help in case the agency you are working with already had a previous record added but you found more!
​