| |
dolt Review the DoltHub proof of concept for dataset catalogue does not yet include sources from @stabs or form submissions still need to remove the form and point people to dolthub, if this proof of concept looks good
| why is dolt better than a daily backup? lowers barrier to entry → can work without much automation or maintenance. documents chain of custody. overhead—how much time do people spend managing commits and PRs vs. actually scraping? ease of use for contributors is a primary objective if we’re going to scale scale—we’re building a unique database of databases in addition to scraped data dolt can be used to identify & track sources → the translated data could be sent to bounties only work for data, not scrapers/infrastructure. bounties allow people to contribute with minimal social engagement. Using Dolt as a POC for storing data in a version controlled fashion Resolving version control may slow things down Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm Why dolt instead of daily backups? Or our own history tables? Dolt can be used as a little more than an audit log for data sources
|
Do we need a global table for unique_data_properties → we make SQL INDEX es of that based on data_types ? Or are we better off creating tables for each data_type ? How do we decide?
| We should identify which data points are most consistently available across data sources → this gives us targets to hit. metadata fields always need to be added in data discovery parent child relationship between types and properties the query will be ugly but this is how it’s best designed for large code
mongodb may work as a metadata repository we’d need something to talk to both
|
Define the next steps for keeping structure parity between Datasets, Data, and Scrapers How should these structures relate to one another?
| Scrapers need a config file for where they’re sending data Dolt repos may need a path back to the scraper dolthub ≠ where people provide scraper code Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between? We can decide which tiers we’re actively ingesting based on how often they’re available.
|
Enterprise features / structure down the road | Risk: github or dolthub fails We’ll want a backup DB, then an API. Sooner rather than later we need to back up our data for better disaster recovery.
|
Documentation: read the docs vs confluence | Richard is writing documentation for backend design & workflows Confluence is a stopgap for collaborating and getting on the same page What does implementation look like?
|
People bringing drop-in scrapers | Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible Messaging scrapers: what are we targeting? We need to make sure we’re clear about the breadth / scope of data.
Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context
|
| At one point this was our list, is it accurate? ArrestingOfficerBadgeNumber
Consider: not collecting data we don’t know is legal to have. the government made the data public. If it’s public, it’s not considered personal. We need to be careful which source the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have With the bounty program we can mandate that they include certain proofs in their submission
Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision |
Are we collecting data aggregated by third parties? | Each aggregator has their own format Do we want to throw it away? Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria We do want aggregate level data for validating the record level data that is being returned to us For now these are stored as source_type = “third_party”
|
| |
| Multi-tier approval is possible with github How does wikipedia do it? Base scraper approval is most urgent
|
| Should extraction / ETL be a required part of a scraper? Mitch Miller is using python to create a framework which should be flexible enough to meet our needs ETL should not go directly into the scraper, but it does need to be closely related.
|