πŸ“–
Police Data Access Point Docs
pdap.ioGitHub
  • πŸ‘‹Welcome
  • ⚑Activities
    • Label new Data Sources
      • Labeling events
    • Volunteer for Data Requests
    • Search for Data Sources
    • Publish data
    • Web scraping
    • FOIA requests
    • Advocate for open data
  • πŸ”¬About
    • Search the PDAP database
    • Terms & definitions
      • What is a Data Source?
      • Terminology
    • Database details
      • Data Sources data dictionary
      • Agencies data dictionary
      • Requests data dictionary
      • Record Types taxonomy
      • Hidden properties
    • GitHub
    • Hugging Face
  • πŸ“‘API
    • Introduction/Getting Started
  • πŸ› οΈTools & Resources
    • Related projects
    • Resources for using data
    • Using LLMs like ChatGPT
  • πŸ”Meta
    • Internal Tools (Retool)
    • Internal dev resources
      • GitHub issue template
      • GitHub pull request template
      • Product changes checklist
      • β˜‘οΈProduction QA Checklist
      • Retool
    • Operations
      • Staff resources
        • Meeting Minutes
          • 2021-07-14
          • 2021-06-16
          • 2021-03-14
          • 2020 11-21 Tech Stack discussion
          • 2020-09-30 Leadership Cadence
          • 2020-10-14 Leadership Cadence
          • 2020-10-21 Leadership Cadence
          • 2020-10-28 Leadership Cadence
          • 2020-11-04 Leadership Cadence
          • 2020-11-12 Leadership Cadence
          • 2020-11-18 Leadership Cadence
          • 2020-11-25 Leadership Cadence Notes
          • 2020-12-02 Leadership Cadence Notes
          • 2020-12-09 Leadership Cadence Notes
          • 2020-12-12 Working Session
          • 2020-12-16 Leadership Cadence
          • 2020-12-30 Leadership Cadence Notes
          • 2021-01-06 Leadership Cadence Notes
          • 2021-01-13 Leadership Cadence Notes
          • 2021-01-20 Leadership Cadence Notes
          • 2021-01-27 Leadership Cadence Notes
          • 2021-02-03 Leadership Cadence Notes
          • 2021-02-10 Leadership Cadence Notes
          • 2021-02-17 Leadership Cadence Notes
          • 2021-02-24 Leadership Cadence Notes
          • 2021-03-03 Leadership Cadence Notes
          • 2021-03-10 Leadership Cadence Notes
          • 2021-03-16 Leadership Cadence Notes
          • 2021-03-27 database working session
          • 2021-03-31
          • 2020-12-1
          • 2021-01-23
          • 2021-04-10 Meeting notes
          • 2021-04-17 Meeting notes
          • 2021-04-21 Leadership Cadence
          • 2021-04-28 Leadership Cadence
          • 2021-05-05 Leadership Cadence
          • 2021-05-12 Leadership Cadence
          • 2021-05-19 Leadership Cadence
          • 2021-05-26 Leadership Cadence
          • 2021-06-02 Leadership Cadence
          • Decision log
        • Brand assets
      • Legal
        • Public records access laws & precedent
        • Legal Data Scraping
        • State Computer Crimes laws
      • Policy
        • Impartiality resolution
        • PDAP Access
        • PDAP Privacy Policy
        • Password Management
        • Personally Identifiable Information
    • Community calls
      • October 17, 2023
      • February 22, 2023
      • February 1, 2023
      • January 20, 2023
      • January 5, 2023
      • October 25, 2022
      • September 22, 2022
      • August 23, 2022
      • October 2, 2021
      • September 25, 2021
      • September 11, 2021
      • September 4, 2021
      • August 7, 2021
      • July 27 Dolt Bounty retro
      • July 17, 2021
      • July 10, 2021
      • June 26, 2021
      • June 19, 2021
      • June 12, 2021
      • June 5, 2021
      • May 1, 2021
      • April 24, 2021
    • Newsletter
    • Join our Discord
Powered by GitBook
On this page
  • Updates from PDAP
  • Call notes

Was this helpful?

Edit on GitHub
  1. Meta
  2. Community calls

January 5, 2023

PreviousJanuary 20, 2023NextOctober 25, 2022

Last updated 1 year ago

Was this helpful?

Updates from PDAP

  • The new front page points people to the new "request data" workflow.

    • We've been doing this informally alreadyβ€”we'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.

  • We're making good progress on turning our Airtable tools into a home-grown app. There's now a which mirrors our Data Sources database as CSV and JSON.

  • We're working on ways to identify URLs en masse.

    • GitHub issue:

    • A volunteer wrote a sitemap scraper, which locates potentially useful URLs given a list. It's an open PR still under review. We want to get it merged soon:

    • We did a Doccano labeling exercise in an attempt to train a machine learning algorithm to identify content based on URL. We still need to experiment in order to close the loop on this.

Call notes

  • linklabel regex script:

    • (C++) scans massive amounts of URLs for keywords

    • early bottleneck: commoncrawl URL database is ~4TB

    • to do:

      • publish the script + regex library

      • generate lists of URLs

      • get a good list of regex keywords set in advance

      • crunch through the URLs

      • ???

  • Elasticsearch

    • Supports

  • Commoncrawl storage

    • craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs

    • goal: people can get batches of URLs off the server

πŸ”
https://pdap.io
public repo
https://github.com/Police-Data-Accessibility-Project/planning/issues/196
https://github.com/Police-Data-Accessibility-Project/scrapers/pull/195/files
https://github.com/Police-Data-Accessibility-Project/data-source-identification/pull/1
stemming