📖
Police Data Access Point Docs
pdap.ioGitHub
  • 👋Welcome
  • ⚡Activities
    • Label new Data Sources
      • Labeling events
    • Work on Data Request projects
    • Find new Data Sources
    • Publish data
    • Web scraping
    • FOIA requests
    • Advocate for open data
  • 🔬About
    • Search the PDAP database
    • Terms & definitions
      • What is a Data Source?
      • Terminology
    • Database details
      • Data Sources data dictionary
      • Agencies data dictionary
      • Requests data dictionary
      • Record Types taxonomy
      • Hidden properties
    • GitHub
    • Hugging Face
  • 📡API
    • Introduction/Getting Started
  • 🛠️Tools & Resources
    • Related projects
    • Resources for using data
    • Using LLMs like ChatGPT
  • 🔁Meta
    • Internal Tools (Retool)
    • Internal dev resources
      • GitHub issue template
      • GitHub pull request template
      • Product changes checklist
      • ☑️Production QA Checklist
      • Retool
    • Operations
      • Staff resources
        • Meeting Minutes
          • 2021-07-14
          • 2021-06-16
          • 2021-03-14
          • 2020 11-21 Tech Stack discussion
          • 2020-09-30 Leadership Cadence
          • 2020-10-14 Leadership Cadence
          • 2020-10-21 Leadership Cadence
          • 2020-10-28 Leadership Cadence
          • 2020-11-04 Leadership Cadence
          • 2020-11-12 Leadership Cadence
          • 2020-11-18 Leadership Cadence
          • 2020-11-25 Leadership Cadence Notes
          • 2020-12-02 Leadership Cadence Notes
          • 2020-12-09 Leadership Cadence Notes
          • 2020-12-12 Working Session
          • 2020-12-16 Leadership Cadence
          • 2020-12-30 Leadership Cadence Notes
          • 2021-01-06 Leadership Cadence Notes
          • 2021-01-13 Leadership Cadence Notes
          • 2021-01-20 Leadership Cadence Notes
          • 2021-01-27 Leadership Cadence Notes
          • 2021-02-03 Leadership Cadence Notes
          • 2021-02-10 Leadership Cadence Notes
          • 2021-02-17 Leadership Cadence Notes
          • 2021-02-24 Leadership Cadence Notes
          • 2021-03-03 Leadership Cadence Notes
          • 2021-03-10 Leadership Cadence Notes
          • 2021-03-16 Leadership Cadence Notes
          • 2021-03-27 database working session
          • 2021-03-31
          • 2020-12-1
          • 2021-01-23
          • 2021-04-10 Meeting notes
          • 2021-04-17 Meeting notes
          • 2021-04-21 Leadership Cadence
          • 2021-04-28 Leadership Cadence
          • 2021-05-05 Leadership Cadence
          • 2021-05-12 Leadership Cadence
          • 2021-05-19 Leadership Cadence
          • 2021-05-26 Leadership Cadence
          • 2021-06-02 Leadership Cadence
          • Decision log
        • Brand assets
      • Legal
        • Public records access laws & precedent
        • Legal Data Scraping
        • State Computer Crimes laws
      • Policy
        • Impartiality resolution
        • PDAP Access
        • PDAP Privacy Policy
        • Password Management
        • Personally Identifiable Information
    • Community calls
      • October 17, 2023
      • February 22, 2023
      • February 1, 2023
      • January 20, 2023
      • January 5, 2023
      • October 25, 2022
      • September 22, 2022
      • August 23, 2022
      • October 2, 2021
      • September 25, 2021
      • September 11, 2021
      • September 4, 2021
      • August 7, 2021
      • July 27 Dolt Bounty retro
      • July 17, 2021
      • July 10, 2021
      • June 26, 2021
      • June 19, 2021
      • June 12, 2021
      • June 5, 2021
      • May 1, 2021
      • April 24, 2021
    • Newsletter
    • Join our Discord
Powered by GitBook
On this page
  • Date
  • Participants
  • Goals
  • Discussion topics
  • Action items
  • Decisions

Was this helpful?

Edit on GitHub
  1. Meta
  2. Operations
  3. Staff resources
  4. Meeting Minutes

2021-03-27 database working session

Date

27 Mar 2021

Participants

  • Eddie Brown (Unlicensed)

  • Alec Akin

  • Mitch Miller

  • Josh Lintag

  • Kristin Tynski

  • Jeff Jockisch

  • Richard Ji

  • Former user (Deleted)

Goals

Discussion topics

Item

Notes

dolt

    • does not yet include sources from @stabs or form submissions

    • still need to remove the form and point people to dolthub, if this proof of concept looks good

  • why is dolt better than a daily backup?

  • lowers barrier to entry → can work without much automation or maintenance.

  • documents chain of custody.

  • overhead—how much time do people spend managing commits and PRs vs. actually scraping?

  • ease of use for contributors is a primary objective if we’re going to scale

  • scale—we’re building a unique database of databases in addition to scraped data

  • free

  • dolt can be used to identify & track sources → the translated data could be sent to

  • bounties only work for data, not scrapers/infrastructure.

  • bounties allow people to contribute with minimal social engagement.

  • Using Dolt as a POC for storing data in a version controlled fashion

  • How do we scale this?

  • Resolving version control may slow things down

  • Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm

  • Edit history is tracked

  • Why dolt instead of daily backups? Or our own history tables?

  • Dolt can be used as a little more than an audit log for data sources

  • Do we need a global table for unique_data_properties → we make SQL INDEXes of that based on data_types? Or are we better off creating tables for each data_type? How do we decide?

  • We should identify which data points are most consistently available across data sources → this gives us targets to hit.

  • metadata fields always need to be added in data discovery

    • parent child relationship between types and properties

    • the query will be ugly but this is how it’s best designed for large code

  • mongodb may work as a metadata repository

    • we’d need something to talk to both

  • Define the next steps for keeping structure parity between Datasets, Data, and Scrapers

    • How should these structures relate to one another?

  • Scrapers need a config file for where they’re sending data

  • Dolt repos may need a path back to the scraper

  • dolthub ≠ where people provide scraper code

  • Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between?

  • We can decide which tiers we’re actively ingesting based on how often they’re available.

Enterprise features / structure down the road

  • Risk: github or dolthub fails

  • We’ll want a backup DB, then an API.

  • Sooner rather than later we need to back up our data for better disaster recovery.

Documentation: read the docs vs confluence

  • Richard is writing documentation for backend design & workflows

  • it’s in markdown

  • Confluence is a stopgap for collaborating and getting on the same page

  • What does implementation look like?

    • Does it work for open source?

People bringing drop-in scrapers

  • Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible

  • Messaging scrapers: what are we targeting?

    • We need to make sure we’re clear about the breadth / scope of data.

  • Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context

data to not collect

At one point this was our list, is it accurate?

  • CaseNum

  • FirstName

  • MiddleName

  • LastName

  • DOB

  • DefenseAttorney

  • PublicDefender

  • Judge

  • ArrestingOfficer

  • ArrestingOfficerBadgeNumber

Consider: not collecting data we don’t know is legal to have.

  • the government made the data public. If it’s public, it’s not considered personal.

  • We need to be careful which source the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have

  • With the bounty program we can mandate that they include certain proofs in their submission

Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision

Are we collecting data aggregated by third parties?

  • Each aggregator has their own format

  • Do we want to throw it away?

  • Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria

  • We do want aggregate level data for validating the record level data that is being returned to us

  • For now these are stored as source_type = “third_party”

Data types to prioritize

  • arrest reports

  • traffic stops

  • incident reports

Scraper approval

  • Multi-tier approval is possible with github

  • How does wikipedia do it?

  • Base scraper approval is most urgent

ETL framework

  • Should extraction / ETL be a required part of a scraper?

  • Mitch Miller is using python to create a framework which should be flexible enough to meet our needs

  • ETL should not go directly into the scraper, but it does need to be closely related.

Action items

  • Former user (Deleted) find out how much overhead (time) is involved in an end to end dolt scrape → public

  • Former user (Deleted) add data format (NIBRS) column to dataset catalogue

  • Former user (Deleted) Identify minimum data properties to meet NIBRS data format, has this somewhere alreadyhttps://pdap.atlassian.net/browse/PDAP-118

  • Former user (Deleted) Document data_types priority in scrapers readme. Arrest Reports, Traffic Stops, Incident reports. What’s most available. What most easily paints the full picture. Tiers. https://pdap.atlassian.net/browse/PDAP-119

  • Former user (Deleted) validate workflow: localized, raw data is stored in dolt → it could be aggregated / ETL’d to a centralized server. dolt is the audit/transparency.

  • Former user (Deleted) Expose basic roadmap in documentation with backup DB → API

  • Josh Chamberlain Draft a policy and rationale for “fields not to collect” → A&P

  • Josh Chamberlain Draft a policy and rationale for mirroring dataset websites → A&P

  • Former user (Deleted) import dataset catalogue form submissions and deprecate form

  • Richard Ji Get @stabs base scraper approved

Mitch Miller is making an ETL framework

https://pdap.atlassian.net/browse/PDAP-113

https://pdap.atlassian.net/browse/PDAP-114

Decisions

  • @stabs need to be recognized. Let’s be sure to celebrate their hard work

Previous2021-03-16 Leadership Cadence NotesNext2021-03-31

Was this helpful?

Review the DoltHub proof of concept for

🔁
dataset catalogue
https://readthedocs.org/
https://www.lawfareblog.com/understanding-supreme-courts-carpenter-decision