2021-03-27 database working session


27 Mar 2021



Discussion topics




  • Review the DoltHub proof of concept for dataset catalogue

    • does not yet include sources from @stabs or form submissions

    • still need to remove the form and point people to dolthub, if this proof of concept looks good

  • why is dolt better than a daily backup?

  • lowers barrier to entry → can work without much automation or maintenance.

  • documents chain of custody.

  • overhead—how much time do people spend managing commits and PRs vs. actually scraping?

  • ease of use for contributors is a primary objective if we’re going to scale

  • scale—we’re building a unique database of databases in addition to scraped data

  • free

  • dolt can be used to identify & track sources → the translated data could be sent to

  • bounties only work for data, not scrapers/infrastructure.

  • bounties allow people to contribute with minimal social engagement.

  • Using Dolt as a POC for storing data in a version controlled fashion

  • How do we scale this?

  • Resolving version control may slow things down

  • Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm

  • Edit history is tracked

  • Why dolt instead of daily backups? Or our own history tables?

  • Dolt can be used as a little more than an audit log for data sources

  • Do we need a global table for unique_data_properties → we make SQL INDEXes of that based on data_types? Or are we better off creating tables for each data_type? How do we decide?

  • We should identify which data points are most consistently available across data sources → this gives us targets to hit.

  • metadata fields always need to be added in data discovery

    • parent child relationship between types and properties

    • the query will be ugly but this is how it’s best designed for large code

  • mongodb may work as a metadata repository

    • we’d need something to talk to both

  • Define the next steps for keeping structure parity between Datasets, Data, and Scrapers

    • How should these structures relate to one another?

  • Scrapers need a config file for where they’re sending data

  • Dolt repos may need a path back to the scraper

  • dolthub ≠ where people provide scraper code

  • Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between?

  • We can decide which tiers we’re actively ingesting based on how often they’re available.

Enterprise features / structure down the road

  • Risk: github or dolthub fails

  • We’ll want a backup DB, then an API.

  • Sooner rather than later we need to back up our data for better disaster recovery.

Documentation: read the docs vs confluence

  • Richard is writing documentation for backend design & workflows

  • it’s in markdown

  • Confluence is a stopgap for collaborating and getting on the same page

  • What does implementation look like?

  • https://readthedocs.org/

    • Does it work for open source?

People bringing drop-in scrapers

  • Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible

  • Messaging scrapers: what are we targeting?

    • We need to make sure we’re clear about the breadth / scope of data.

  • Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context

data to not collect

At one point this was our list, is it accurate?

  • CaseNum

  • FirstName

  • MiddleName

  • LastName

  • DOB

  • DefenseAttorney

  • PublicDefender

  • Judge

  • ArrestingOfficer

  • ArrestingOfficerBadgeNumber

Consider: not collecting data we don’t know is legal to have.

  • the government made the data public. If it’s public, it’s not considered personal.

  • We need to be careful which source the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have

  • With the bounty program we can mandate that they include certain proofs in their submission

Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision

Are we collecting data aggregated by third parties?

  • Each aggregator has their own format

  • Do we want to throw it away?

  • Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria

  • We do want aggregate level data for validating the record level data that is being returned to us

  • For now these are stored as source_type = “third_party”

Data types to prioritize

  • arrest reports

  • traffic stops

  • incident reports

Scraper approval

  • Multi-tier approval is possible with github

  • How does wikipedia do it?

  • Base scraper approval is most urgent

ETL framework

  • Should extraction / ETL be a required part of a scraper?

  • Mitch Miller is using python to create a framework which should be flexible enough to meet our needs

  • ETL should not go directly into the scraper, but it does need to be closely related.

Action items

Mitch Miller is making an ETL framework




  • @stabs need to be recognized. Let’s be sure to celebrate their hard work