January 5, 2023

Updates from PDAP

  • The new front page points people to the new "request data" workflow.
    • We've been doing this informally already—we'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.
  • We're making good progress on turning our Airtable tools into a home-grown app. There's now a public repo which mirrors our Data Sources database as CSV and JSON.
  • We're working on ways to identify URLs en masse.

Call notes

    • (C++) scans massive amounts of URLs for keywords
    • early bottleneck: commoncrawl URL database is ~4TB
    • to do:
      • publish the script + regex library
      • generate lists of URLs
      • get a good list of regex keywords set in advance
      • crunch through the URLs
      • ???
  • Elasticsearch
  • Commoncrawl storage
    • craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs
    • goal: people can get batches of URLs off the server