2021-03-27 database working session
27 Mar 2021
Mitch Miller
Kristin Tynski
Jeff Jockisch
Discussion topics
does not yet include sources from @stabs or form submissions
still need to remove the form and point people to dolthub, if this proof of concept looks good
why is dolt better than a daily backup?
lowers barrier to entry → can work without much automation or maintenance.
documents chain of custody.
overhead—how much time do people spend managing commits and PRs vs. actually scraping?
ease of use for contributors is a primary objective if we’re going to scale
scale—we’re building a unique database of databases in addition to scraped data
dolt can be used to identify & track sources → the translated data could be sent to
bounties only work for data, not scrapers/infrastructure.
bounties allow people to contribute with minimal social engagement.
Using Dolt as a POC for storing data in a version controlled fashion
How do we scale this?
Resolving version control may slow things down
Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm
Edit history is tracked
Why dolt instead of daily backups? Or our own history tables?
Dolt can be used as a little more than an audit log for data sources
Do we need a global table for
→ we make SQLINDEX
es of that based ondata_types
? Or are we better off creating tables for eachdata_type
? How do we decide?
We should identify which data points are most consistently available across data sources → this gives us targets to hit.
metadata fields always need to be added in data discovery
parent child relationship between types and properties
the query will be ugly but this is how it’s best designed for large code
mongodb may work as a metadata repository
we’d need something to talk to both
Define the next steps for keeping structure parity between Datasets, Data, and Scrapers
How should these structures relate to one another?
Scrapers need a config file for where they’re sending data
Dolt repos may need a path back to the scraper
dolthub ≠ where people provide scraper code
Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between?
We can decide which tiers we’re actively ingesting based on how often they’re available.
Enterprise features / structure down the road
Risk: github or dolthub fails
We’ll want a backup DB, then an API.
Sooner rather than later we need to back up our data for better disaster recovery.
Documentation: read the docs vs confluence
Richard is writing documentation for backend design & workflows
it’s in markdown
Confluence is a stopgap for collaborating and getting on the same page
What does implementation look like?
Does it work for open source?
People bringing drop-in scrapers
Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible
Messaging scrapers: what are we targeting?
We need to make sure we’re clear about the breadth / scope of data.
Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context
data to not collect
At one point this was our list, is it accurate?
Consider: not collecting data we don’t know is legal to have.
the government made the data public. If it’s public, it’s not considered personal.
We need to be careful which source the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have
With the bounty program we can mandate that they include certain proofs in their submission
Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision
Are we collecting data aggregated by third parties?
Each aggregator has their own format
Do we want to throw it away?
Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria
We do want aggregate level data for validating the record level data that is being returned to us
For now these are stored as
source_type = “third_party”
Data types to prioritize
arrest reports
traffic stops
incident reports
Scraper approval
Multi-tier approval is possible with github
How does wikipedia do it?
Base scraper approval is most urgent
ETL framework
Should extraction / ETL be a required part of a scraper?
Mitch Miller is using python to create a framework which should be flexible enough to meet our needs
ETL should not go directly into the scraper, but it does need to be closely related.
Action items
Former user (Deleted) find out how much overhead (time) is involved in an end to end dolt scrape → public
Former user (Deleted) add data format (NIBRS) column to dataset catalogue
Former user (Deleted) Identify minimum data properties to meet NIBRS data format, has this somewhere alreadyhttps://pdap.atlassian.net/browse/PDAP-118
Former user (Deleted) Document data_types priority in scrapers readme. Arrest Reports, Traffic Stops, Incident reports. What’s most available. What most easily paints the full picture. Tiers. https://pdap.atlassian.net/browse/PDAP-119
Former user (Deleted) validate workflow: localized, raw data is stored in dolt → it could be aggregated / ETL’d to a centralized server. dolt is the audit/transparency.
Former user (Deleted) Expose basic roadmap in documentation with backup DB → API
Josh Chamberlain Draft a policy and rationale for “fields not to collect” → A&P
Josh Chamberlain Draft a policy and rationale for mirroring dataset websites → A&P
Former user (Deleted) import dataset catalogue form submissions and deprecate form
Richard Ji Get @stabs base scraper approved
Mitch Miller is making an ETL framework
@stabs need to be recognized. Let’s be sure to celebrate their hard work
Was this helpful?