# 2021-03-27 database working session

## Date <a href="#id-2021-03-27databaseworkingsession-date" id="id-2021-03-27databaseworkingsession-date"></a>

27 Mar 2021

## Participants <a href="#id-2021-03-27databaseworkingsession-participants" id="id-2021-03-27databaseworkingsession-participants"></a>

* [Eddie Brown (Unlicensed)](https://pdap.atlassian.net/wiki/people/5fd63e354d2179006ecbcb80?ref=confluence)
* [Alec Akin](https://pdap.atlassian.net/wiki/people/60319bf02a42cc0069af9ac8?ref=confluence)
* Mitch Miller
* [Josh Lintag](https://pdap.atlassian.net/wiki/people/5f20c61fc9c094001c5d32ca?ref=confluence)
* Kristin Tynski
* Jeff Jockisch
* [Richard Ji](https://pdap.atlassian.net/wiki/people/5f8f95be0e068b00766b6903?ref=confluence)
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence)

## Goals <a href="#id-2021-03-27databaseworkingsession-goals" id="id-2021-03-27databaseworkingsession-goals"></a>

## Discussion topics <a href="#id-2021-03-27databaseworkingsession-discussiontopics" id="id-2021-03-27databaseworkingsession-discussiontopics"></a>

| Item                                                                                                                                                                                                                                                                                                                                                | Notes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>dolt</p><ul><li><p>Review the DoltHub proof of concept for <a href="https://www.dolthub.com/repositories/pdap/datasets">dataset catalogue</a></p><ul><li>does not yet include sources from @stabs or form submissions</li><li>still need to remove the form and point people to dolthub, if this proof of concept looks good</li></ul></li></ul> | <ul><li>why is dolt better than a daily backup?</li><li>lowers barrier to entry → can work without much automation or maintenance.</li><li>documents chain of custody.</li><li>overhead—how much time do people spend managing commits and PRs vs. <strong>actually scraping</strong>?</li><li><strong>ease of use for contributors</strong> is a primary objective if we’re going to scale</li><li>scale—we’re building a unique database of databases in addition to scraped data</li><li>free</li><li>dolt can be used to identify & track sources → the translated data could be sent to</li><li>bounties only work for data, not scrapers/infrastructure.</li><li>bounties allow people to contribute with minimal social engagement.</li><li>Using Dolt as a POC for storing data in a version controlled fashion</li><li>How do we scale this?</li><li>Resolving version control may slow things down</li><li>Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm</li><li>Edit history is tracked</li><li>Why dolt instead of daily backups? Or our own history tables?</li><li>Dolt can be used as a little more than an audit log for data sources</li></ul> |
| <ul><li>Do we need a global table for <code>unique\_data\_properties</code> → we make SQL <code>INDEX</code>es of that based on <code>data\_types</code>? Or are we better off creating <strong>tables</strong> for each <code>data\_type</code>? How do we decide?</li></ul>                                                                       | <ul><li>We should identify which data points are most consistently available across data sources → this gives us targets to hit.</li><li><p>metadata fields always need to be added in data discovery</p><ul><li>parent child relationship between types and properties</li><li>the query will be ugly but this is how it’s best designed for large code</li></ul></li><li><p>mongodb may work as a metadata repository</p><ul><li>we’d need something to talk to both</li></ul></li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| <ul><li><p>Define the next steps for keeping structure parity between Datasets, Data, and Scrapers</p><ul><li>How should these structures relate to one another?</li></ul></li></ul>                                                                                                                                                                | <ul><li><strong>Scrapers need a config file for where they’re sending data</strong></li><li><strong>Dolt repos may need a path back to the scraper</strong></li><li>dolthub ≠ where people provide scraper code</li><li>Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between?</li><li>We can decide which tiers we’re actively ingesting based on how often they’re available.</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Enterprise features / structure down the road                                                                                                                                                                                                                                                                                                       | <ul><li>Risk: github or dolthub fails</li><li>We’ll want a backup DB, then an API.</li><li>Sooner rather than later we need to back up our data for better disaster recovery.</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Documentation: read the docs vs confluence                                                                                                                                                                                                                                                                                                          | <ul><li>Richard is writing documentation for backend design & workflows</li><li>it’s in markdown</li><li>Confluence is a stopgap for collaborating and getting on the same page</li><li>What does implementation look like?</li><li><p><a href="https://readthedocs.org/"><https://readthedocs.org/></a></p><ul><li>Does it work for open source?</li></ul></li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| People bringing drop-in scrapers                                                                                                                                                                                                                                                                                                                    | <ul><li>Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible</li><li><p>Messaging scrapers: <strong>what are we targeting</strong>?</p><ul><li>We need to make sure we’re clear about the breadth / scope of data.</li></ul></li><li>Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| data to not collect                                                                                                                                                                                                                                                                                                                                 | <p>At one point this was our list, is it accurate?</p><ul><li>CaseNum</li><li>FirstName</li><li>MiddleName</li><li>LastName</li><li>DOB</li><li>DefenseAttorney</li><li>PublicDefender</li><li>Judge</li><li>ArrestingOfficer</li><li>ArrestingOfficerBadgeNumber</li></ul><p><strong>Consider:</strong> not collecting data we don’t know is legal to have.</p><ul><li>the government made the data public. If it’s public, it’s not considered personal.</li><li>We need to be careful <strong>which source</strong> the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have</li><li>With the bounty program we can mandate that they include certain proofs in their submission</li></ul><p><em>Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision</em></p><ul><li><a href="https://www.lawfareblog.com/understanding-supreme-courts-carpenter-decision"><https://www.lawfareblog.com/understanding-supreme-courts-carpenter-decision></a></li></ul>                                                                                                                              |
| Are we collecting data aggregated by third parties?                                                                                                                                                                                                                                                                                                 | <ul><li>Each aggregator has their own format</li><li>Do we want to throw it away?</li><li>Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria</li><li>We do want aggregate level data for validating the record level data that is being returned to us</li><li>For now these are stored as <code>source\_type = “third\_party”</code></li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Data types to prioritize                                                                                                                                                                                                                                                                                                                            | <ul><li>arrest reports</li><li>traffic stops</li><li>incident reports</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Scraper approval                                                                                                                                                                                                                                                                                                                                    | <ul><li>Multi-tier approval is possible with github</li><li>How does wikipedia do it?</li><li>Base scraper approval is most urgent</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| ETL framework                                                                                                                                                                                                                                                                                                                                       | <ul><li>Should extraction / ETL be a required part of a scraper?</li><li>Mitch Miller is using python to create a framework which should be flexible enough to meet our needs</li><li>ETL should not go directly into the scraper, but it does need to be closely related.</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

## Action items <a href="#id-2021-03-27databaseworkingsession-actionitems" id="id-2021-03-27databaseworkingsession-actionitems"></a>

* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) find out how much overhead (time) is involved in an end to end dolt scrape → public
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) add data format (NIBRS) column to dataset catalogue
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) Identify minimum data properties to meet NIBRS data format, has this somewhere already<https://pdap.atlassian.net/browse/PDAP-118>
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) Document data\_types priority [in scrapers readme](https://github.com/Police-Data-Accessibility-Project/Scrapers/pull/43). Arrest Reports, Traffic Stops, Incident reports. What’s most available. What most easily paints the full picture. Tiers. <https://pdap.atlassian.net/browse/PDAP-119>
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) validate workflow: **localized, raw data is stored in dolt → it could be aggregated / ETL’d to a centralized server.** dolt is the audit/transparency.
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) Expose basic roadmap in documentation with backup DB → API
* [Josh Chamberlain](https://pdap.atlassian.net/wiki/people/6068f9e790e3950069fbaaf4?ref=confluence) Draft a policy and rationale for “fields not to collect” → **A\&P**
* [Josh Chamberlain](https://pdap.atlassian.net/wiki/people/6068f9e790e3950069fbaaf4?ref=confluence) Draft a policy and rationale for mirroring dataset websites → **A\&P**
* [Former user (Deleted)](https://pdap.atlassian.net/wiki/people/5f8f95be40588b0077ed830a?ref=confluence) import dataset catalogue [form submissions](https://docs.google.com/spreadsheets/d/176f0pTxlIyvBWqXmJCPPlh2zRrcMZNgojB2fJOBBhGw/edit#gid=901781374) and deprecate form
* [Richard Ji](https://pdap.atlassian.net/wiki/people/5f8f95be0e068b00766b6903?ref=confluence) Get @stabs [base scraper](https://github.com/Police-Data-Accessibility-Project/Scrapers/pull/38/files) approved

**Mitch Miller** is making an ETL framework

<https://pdap.atlassian.net/browse/PDAP-113>

<https://pdap.atlassian.net/browse/PDAP-114>

## Decisions <a href="#id-2021-03-27databaseworkingsession-decisions" id="id-2021-03-27databaseworkingsession-decisions"></a>

* @stabs need to be recognized. Let’s be sure to celebrate their hard work


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pdap.io/meta/operations/staff/meeting-minutes/project-home-2021-03-27-database-working-session.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
