January 5, 2023

Updates from PDAP

  • The new https://pdap.io front page points people to the new "request data" workflow.

    • We've been doing this informally already—we'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.

  • We're making good progress on turning our Airtable tools into a home-grown app. There's now a public repo which mirrors our Data Sources database as CSV and JSON.

  • We're working on ways to identify URLs en masse.

Call notes

  • linklabel regex script: https://github.com/Police-Data-Accessibility-Project/data-source-identification/pull/1

    • (C++) scans massive amounts of URLs for keywords

    • early bottleneck: commoncrawl URL database is ~4TB

    • to do:

      • publish the script + regex library

      • generate lists of URLs

      • get a good list of regex keywords set in advance

      • crunch through the URLs

      • ???

  • Elasticsearch

  • Commoncrawl storage

    • craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs

    • goal: people can get batches of URLs off the server

Last updated