2021-03-27 database working session

Date

27 Mar 2021

Item

Notes

dolt

Review the DoltHub proof of concept for dataset catalogue
- does not yet include sources from @stabs or form submissions
- still need to remove the form and point people to dolthub, if this proof of concept looks good

why is dolt better than a daily backup?
lowers barrier to entry → can work without much automation or maintenance.
documents chain of custody.
overhead—how much time do people spend managing commits and PRs vs. actually scraping?
ease of use for contributors is a primary objective if we’re going to scale
scale—we’re building a unique database of databases in addition to scraped data
free
dolt can be used to identify & track sources → the translated data could be sent to
bounties only work for data, not scrapers/infrastructure.
bounties allow people to contribute with minimal social engagement.
Using Dolt as a POC for storing data in a version controlled fashion
How do we scale this?
Resolving version control may slow things down
Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm
Edit history is tracked
Why dolt instead of daily backups? Or our own history tables?
Dolt can be used as a little more than an audit log for data sources

Do we need a global table for unique_data_properties → we make SQL INDEXes of that based on data_types? Or are we better off creating tables for each data_type? How do we decide?

We should identify which data points are most consistently available across data sources → this gives us targets to hit.
metadata fields always need to be added in data discovery
- parent child relationship between types and properties
- the query will be ugly but this is how it’s best designed for large code
mongodb may work as a metadata repository
- we’d need something to talk to both

Define the next steps for keeping structure parity between Datasets, Data, and Scrapers
- How should these structures relate to one another?

Scrapers need a config file for where they’re sending data
Dolt repos may need a path back to the scraper
dolthub ≠ where people provide scraper code
Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between?
We can decide which tiers we’re actively ingesting based on how often they’re available.

Enterprise features / structure down the road

Risk: github or dolthub fails
We’ll want a backup DB, then an API.
Sooner rather than later we need to back up our data for better disaster recovery.

Documentation: read the docs vs confluence

People bringing drop-in scrapers

Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible
Messaging scrapers: what are we targeting?
- We need to make sure we’re clear about the breadth / scope of data.
Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context

data to not collect

At one point this was our list, is it accurate?

Consider: not collecting data we don’t know is legal to have.

the government made the data public. If it’s public, it’s not considered personal.
We need to be careful which source the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have
With the bounty program we can mandate that they include certain proofs in their submission

Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision

Are we collecting data aggregated by third parties?

Each aggregator has their own format
Do we want to throw it away?
Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria
We do want aggregate level data for validating the record level data that is being returned to us
For now these are stored as source_type = “third_party”

Data types to prioritize

Scraper approval

ETL framework

Should extraction / ETL be a required part of a scraper?
Mitch Miller is using python to create a framework which should be flexible enough to meet our needs
ETL should not go directly into the scraper, but it does need to be closely related.

Former user (Deleted) find out how much overhead (time) is involved in an end to end dolt scrape → public
Former user (Deleted) add data format (NIBRS) column to dataset catalogue
Former user (Deleted) Identify minimum data properties to meet NIBRS data format, has this somewhere alreadyhttps://pdap.atlassian.net/browse/PDAP-118
Former user (Deleted) Document data_types priority in scrapers readme. Arrest Reports, Traffic Stops, Incident reports. What’s most available. What most easily paints the full picture. Tiers. https://pdap.atlassian.net/browse/PDAP-119
Former user (Deleted) validate workflow: localized, raw data is stored in dolt → it could be aggregated / ETL’d to a centralized server. dolt is the audit/transparency.
Former user (Deleted) Expose basic roadmap in documentation with backup DB → API
Josh Chamberlain Draft a policy and rationale for “fields not to collect” → A&P
Josh Chamberlain Draft a policy and rationale for mirroring dataset websites → A&P
Former user (Deleted) import dataset catalogue form submissions and deprecate form
Richard Ji Get @stabs base scraper approved

Mitch Miller is making an ETL framework

Was this helpful?