Law enforcement, courts, and corrections. Our focus is on the United States.
This is a scale which we're still defining.
The records should exist somewhere, but we need to locate them.
We know where records can be accessed.
There is historic data available in a stable archive.
data custody / provenance
Who collected and published the data?
agency_described (which agency is the data about?)
originating_entity (who generated the records?)
entity (who is publishing the records?)
Sometimes these are all the same entity; sometimes they are all different.
A URL pointing to a place on a police website where public records may be scraped, like "police-agency.com/arrest-reports". Read more here.
A raw, unprocessed HTML archive of a Data Source at a specific time.
Packaged with data (like a Data Source or scraper extraction), metadata is information about when and how the data was collected.
Some information is required by federal, state, or local law to be public. Governments keep several types of public records, and make them publicly available to different degrees.
A bit of code responsible for collecting an Extraction from a Data Source or Archive. Check out the GitHub repo. For more about our philosophy, start here.
Colloquially, "scraper" may refer to a person writing a Scraper.
The result of running a Scraper is an "extraction", usually intended to further parse or process an HTML page or PDF into more usable data.