Datasets Properties

Schema Overview

Quick tips:
  • One Agency can have many Datasets.
  • Agencies are described primarily by their location.
  • Datasets are primarily described by their url.

Datasets table

id is a UUID without hyphens. If you leave this field blank, it will automatically generate.
url is the web location of the data for us to obtain. It could be a directory of files, a link to an aggregator, a map of incident reports. Our Examples & Best Practices guide will help you determine what to use for the url as well as determining the other fields for this table
status_id is the current status of the dataset
  1. 1.
    not started - we just received this dataset and have not started ingesting data for it
  2. 2.
    invalid URL - this dataset has had its URL change or is no longer working. We need to find
  3. 3.
    scraping in progress - someone is actively figuring out how to download / retrieve / scrape the data for us to store
  4. 4.
    initial data scraped - someone has successfully retrieved the data from the agency
  5. 5.
    initial data loaded - someone has taken the initially scraped data and formatted it in a way we can ingest it in our database (and / or provide a link to the original harvested data to be viewed)
  6. 6.
    done; process automated - we have figured out a way to do a sync with the agency, constantly refreshing with new data
source_type_id is where the data is sourced from
  1. 1.
    court- records coming out of an official court system
  2. 2.
    direct - records hosted by the agency themselves for consumption
  3. 3.
    third-party - records hosted by another provider (ArcGIS) or aggregator (CityProtect / Crimegraphics)
data_types_id is what kind of data this dataset is. Most of these are self explanatory
  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
    video_metadata - metadata on a specific video (location, timestamps)
  22. 22.
    videos - this table in data-intake will store a link to view the actual video
  23. 23.
    crime_statistics - aggregated crime statistics (like the federal UCR or department specific aggregation)
  24. 24.
    annual_reports - some agencies provide annual reports over a plethora of information from the past year. Usually these are in the form of PDF documents
  25. 25.
  26. 26.
  27. 27.
    multi - while we prefer to have a specific data type for each url, our current schema requiring a unique URL may not allow that. Sometimes agencies store multiple types of data on the same URL (like Crimegraphics). Try to use this type sparingly and fill in the notes field on what all data can be found
format_types_id is how the data is structured. As of writing, we do not have format types fully fledged out. There is not quite a standard to how agencies present data unless it comes from an aggregator such as CityProtect or ArcGIS. Usually scrapers will have a better idea of the format_type. Here are our current format_types:
  1. 1.
    NIBRS - Data from the National Incident-Based Reporting System
  2. 2.
    cityprotect - Incident Report data from
  3. 3.
    crimemapping - Incident report data from
  4. 4.
    opendata - the Open Data format / initiative
  5. 5.
    crimegraphics - Data from
agency_id is a relationship to the agencies table. It links this dataset back to a specific agency.
update_frequency how often the data updates (usually the person responsible for retrieving the data will figure this out as they try to get the data). These are self-explanatory.
  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
portal_type this is a legacy field from our original schema. If the data comes from a certain aggregator like CityProtect, Tyler Technologies, ArcGIS, you can specify that here. This field may be merged into format_types_id in the future.
coverage_start if you can find when the agency started producing the data at this link, you can record it here. Some agencies keep historical records, some agencies wipe and start fresh each day.
scraper_id Not yet implemented, okay to leave blank
notes any notes about the dataset we should know. Especially useful if the data_types_id is multi for us to at-a-glance figure out what all it may comprise of.
can_scrape some agencies – and some states – have laws expressly forbidding scraping. We absolutely need to know if we are banned from scraping. Let us know that by setting this value to 0 if scraping is not allowed and 1 if there is no language indicating scraping is prohibited.
date_insert automatically generates when a new dataset is added
last_modified auto generates on insert, will need to be manually updated when you modify
machine_readable not fully implemented. If the dataset contains machine readable output like an API endpoint, JSON, CSV or XLSX files, we consider that machine readable. It is very easy to transform and load into a database, so that would be 1. If the output of the datasets is in PDF, DOCX format that is not easily parsable, it would be considered not machine readable so 0

Agencies Table

id is identical to the
name is the name of the agency such as "California Highway Patrol" or "Greenwood County Sheriff's" Office
agency_type is pretty self explanatory:
  1. 1.
    federal - the big leagues, like FBI, CIA level
  2. 2.
    state - agencies for an entire state
  3. 3.
    county - agencies for a specific county
  4. 4.
    city - agencies for a specific city
  5. 5.
    university - agencies that only serve a campus
  6. 6.
    defunct - this agency no longer exists, it could have been merged with another local agency or completely disbanded for one reason or another
state_iso a link to our states table. A two-digit code representing the state / province the agency is a part of
city is the city the agency is a part of (will not apply to federal / state / county agencies)
zip is the postal code of the agency, may not always apply
county_fips is the Federal Information Processing System code for US counties. The full FIPS is 13 characters long and unique to each geographic area. The first 2 digits are state, and the following 3 digits are county specific. You can use the FCC API with the lat and lng to find the FIPS code.
lat is the latitude of the main HQ of this particular agency (provided some agencies have multiple districts)
lng is the longitude of the main HQ of this particular agency (provided some agencies have multiple districts)
date_insert automatically generated on insert
data_policy we can store a link to Terms and Conditions or a Data Policy page from an agency. Very useful if they expressly forbid scraping
country should always be US for now, but is here in case we decided to start exploring other countries
homepage_url this is the agencies homepage that may be of interest if the dataset URLs break, we may be able to search their homepage again to see if we can find updated information
Copy link
Edit on GitHub