📖
Police Data Access Point Docs
pdap.ioGitHub
  • 👋Welcome
  • ⚡Activities
    • Label new Data Sources
      • Labeling events
    • Volunteer for Data Requests
    • Search for Data Sources
    • Publish data
    • Web scraping
    • FOIA requests
    • Advocate for open data
  • 🔬About
    • Search the PDAP database
    • Terms & definitions
      • What is a Data Source?
      • Terminology
    • Database details
      • Data Sources data dictionary
      • Agencies data dictionary
      • Requests data dictionary
      • Record Types taxonomy
      • Hidden properties
    • GitHub
    • Hugging Face
  • 📡API
    • Introduction/Getting Started
  • 🛠️Tools & Resources
    • Related projects
    • Resources for using data
    • Using LLMs like ChatGPT
  • 🔁Meta
    • Internal Tools (Retool)
    • Internal dev resources
      • GitHub issue template
      • GitHub pull request template
      • Product changes checklist
      • ☑️Production QA Checklist
      • Retool
    • Operations
      • Staff resources
        • Meeting Minutes
          • 2021-07-14
          • 2021-06-16
          • 2021-03-14
          • 2020 11-21 Tech Stack discussion
          • 2020-09-30 Leadership Cadence
          • 2020-10-14 Leadership Cadence
          • 2020-10-21 Leadership Cadence
          • 2020-10-28 Leadership Cadence
          • 2020-11-04 Leadership Cadence
          • 2020-11-12 Leadership Cadence
          • 2020-11-18 Leadership Cadence
          • 2020-11-25 Leadership Cadence Notes
          • 2020-12-02 Leadership Cadence Notes
          • 2020-12-09 Leadership Cadence Notes
          • 2020-12-12 Working Session
          • 2020-12-16 Leadership Cadence
          • 2020-12-30 Leadership Cadence Notes
          • 2021-01-06 Leadership Cadence Notes
          • 2021-01-13 Leadership Cadence Notes
          • 2021-01-20 Leadership Cadence Notes
          • 2021-01-27 Leadership Cadence Notes
          • 2021-02-03 Leadership Cadence Notes
          • 2021-02-10 Leadership Cadence Notes
          • 2021-02-17 Leadership Cadence Notes
          • 2021-02-24 Leadership Cadence Notes
          • 2021-03-03 Leadership Cadence Notes
          • 2021-03-10 Leadership Cadence Notes
          • 2021-03-16 Leadership Cadence Notes
          • 2021-03-27 database working session
          • 2021-03-31
          • 2020-12-1
          • 2021-01-23
          • 2021-04-10 Meeting notes
          • 2021-04-17 Meeting notes
          • 2021-04-21 Leadership Cadence
          • 2021-04-28 Leadership Cadence
          • 2021-05-05 Leadership Cadence
          • 2021-05-12 Leadership Cadence
          • 2021-05-19 Leadership Cadence
          • 2021-05-26 Leadership Cadence
          • 2021-06-02 Leadership Cadence
          • Decision log
        • Brand assets
      • Legal
        • Public records access laws & precedent
        • Legal Data Scraping
        • State Computer Crimes laws
      • Policy
        • Impartiality resolution
        • PDAP Access
        • PDAP Privacy Policy
        • Password Management
        • Personally Identifiable Information
    • Community calls
      • October 17, 2023
      • February 22, 2023
      • February 1, 2023
      • January 20, 2023
      • January 5, 2023
      • October 25, 2022
      • September 22, 2022
      • August 23, 2022
      • October 2, 2021
      • September 25, 2021
      • September 11, 2021
      • September 4, 2021
      • August 7, 2021
      • July 27 Dolt Bounty retro
      • July 17, 2021
      • July 10, 2021
      • June 26, 2021
      • June 19, 2021
      • June 12, 2021
      • June 5, 2021
      • May 1, 2021
      • April 24, 2021
    • Newsletter
    • Join our Discord
Powered by GitBook
On this page
  • Date
  • Participants
  • Goals
  • Discussion topics
  • Action items
  • Decisions

Was this helpful?

Edit on GitHub
  1. Meta
  2. Operations
  3. Staff resources
  4. Meeting Minutes

2021-03-27 database working session

Previous2021-03-16 Leadership Cadence NotesNext2021-03-31

Was this helpful?

Date

27 Mar 2021

Participants

  • Mitch Miller

  • Kristin Tynski

  • Jeff Jockisch

Goals

Discussion topics

Item

Notes

dolt

    • does not yet include sources from @stabs or form submissions

    • still need to remove the form and point people to dolthub, if this proof of concept looks good

  • why is dolt better than a daily backup?

  • lowers barrier to entry → can work without much automation or maintenance.

  • documents chain of custody.

  • overhead—how much time do people spend managing commits and PRs vs. actually scraping?

  • ease of use for contributors is a primary objective if we’re going to scale

  • scale—we’re building a unique database of databases in addition to scraped data

  • free

  • dolt can be used to identify & track sources → the translated data could be sent to

  • bounties only work for data, not scrapers/infrastructure.

  • bounties allow people to contribute with minimal social engagement.

  • Using Dolt as a POC for storing data in a version controlled fashion

  • How do we scale this?

  • Resolving version control may slow things down

  • Automatic scrapers pushing into a branch to merge to master becomes a bottleneck; not applicable atm

  • Edit history is tracked

  • Why dolt instead of daily backups? Or our own history tables?

  • Dolt can be used as a little more than an audit log for data sources

  • Do we need a global table for unique_data_properties → we make SQL INDEXes of that based on data_types? Or are we better off creating tables for each data_type? How do we decide?

  • We should identify which data points are most consistently available across data sources → this gives us targets to hit.

  • metadata fields always need to be added in data discovery

    • parent child relationship between types and properties

    • the query will be ugly but this is how it’s best designed for large code

  • mongodb may work as a metadata repository

    • we’d need something to talk to both

  • Define the next steps for keeping structure parity between Datasets, Data, and Scrapers

    • How should these structures relate to one another?

  • Scrapers need a config file for where they’re sending data

  • Dolt repos may need a path back to the scraper

  • dolthub ≠ where people provide scraper code

  • Tiered approach for data properties: 1. NIBRS 2. Store CSV…anything in between?

  • We can decide which tiers we’re actively ingesting based on how often they’re available.

Enterprise features / structure down the road

  • Risk: github or dolthub fails

  • We’ll want a backup DB, then an API.

  • Sooner rather than later we need to back up our data for better disaster recovery.

Documentation: read the docs vs confluence

  • Richard is writing documentation for backend design & workflows

  • it’s in markdown

  • Confluence is a stopgap for collaborating and getting on the same page

  • What does implementation look like?

    • Does it work for open source?

People bringing drop-in scrapers

  • Scrapers (the humans) often bring their own scraper library with them → we should enable them to slot in to the extent it’s possible

  • Messaging scrapers: what are we targeting?

    • We need to make sure we’re clear about the breadth / scope of data.

  • Fragmenting the target makes it hard for us to progress. We want to focus on NIBRS format data to validate it against what’s available from the FBI & broaden context

data to not collect

At one point this was our list, is it accurate?

  • CaseNum

  • FirstName

  • MiddleName

  • LastName

  • DOB

  • DefenseAttorney

  • PublicDefender

  • Judge

  • ArrestingOfficer

  • ArrestingOfficerBadgeNumber

Consider: not collecting data we don’t know is legal to have.

  • the government made the data public. If it’s public, it’s not considered personal.

  • We need to be careful which source the data is coming from. If it’s not a direct source, we may be subject to different restrictions. Will third party aggregators have

  • With the bounty program we can mandate that they include certain proofs in their submission

Whenever we decide on a property to collect, we need to justify it / provide rationale for the decision

Are we collecting data aggregated by third parties?

  • Each aggregator has their own format

  • Do we want to throw it away?

  • Discrepancies in the data aren’t necessarily “red flags”; agency’s maybe aggregating according to different criteria so there maybe discrepancies according to that unknown criteria

  • We do want aggregate level data for validating the record level data that is being returned to us

  • For now these are stored as source_type = “third_party”

Data types to prioritize

  • arrest reports

  • traffic stops

  • incident reports

Scraper approval

  • Multi-tier approval is possible with github

  • How does wikipedia do it?

  • Base scraper approval is most urgent

ETL framework

  • Should extraction / ETL be a required part of a scraper?

  • Mitch Miller is using python to create a framework which should be flexible enough to meet our needs

  • ETL should not go directly into the scraper, but it does need to be closely related.

Action items

Mitch Miller is making an ETL framework

Decisions

  • @stabs need to be recognized. Let’s be sure to celebrate their hard work

Review the DoltHub proof of concept for

find out how much overhead (time) is involved in an end to end dolt scrape → public

add data format (NIBRS) column to dataset catalogue

Identify minimum data properties to meet NIBRS data format, has this somewhere already

Document data_types priority . Arrest Reports, Traffic Stops, Incident reports. What’s most available. What most easily paints the full picture. Tiers.

validate workflow: localized, raw data is stored in dolt → it could be aggregated / ETL’d to a centralized server. dolt is the audit/transparency.

Expose basic roadmap in documentation with backup DB → API

Draft a policy and rationale for “fields not to collect” → A&P

Draft a policy and rationale for mirroring dataset websites → A&P

import dataset catalogue and deprecate form

Get @stabs approved

🔁
Eddie Brown (Unlicensed)
Alec Akin
Josh Lintag
Richard Ji
Former user (Deleted)
Former user (Deleted)
Former user (Deleted)
Former user (Deleted)
https://pdap.atlassian.net/browse/PDAP-118
Former user (Deleted)
in scrapers readme
https://pdap.atlassian.net/browse/PDAP-119
Former user (Deleted)
Former user (Deleted)
Josh Chamberlain
Josh Chamberlain
Former user (Deleted)
form submissions
Richard Ji
base scraper
https://pdap.atlassian.net/browse/PDAP-113
https://pdap.atlassian.net/browse/PDAP-114
dataset catalogue
https://readthedocs.org/
https://www.lawfareblog.com/understanding-supreme-courts-carpenter-decision