Comment on page
January 5, 2023
- The new https://pdap.io front page points people to the new "request data" workflow.
- We've been doing this informally already—we'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.
- We're working on ways to identify URLs en masse.
- A volunteer wrote a sitemap scraper, which locates potentially useful URLs given a list. It's an open PR still under review. We want to get it merged soon: https://github.com/Police-Data-Accessibility-Project/scrapers/pull/195/files
- We did a Doccano labeling exercise in an attempt to train a machine learning algorithm to identify content based on URL. We still need to experiment in order to close the loop on this.
- linklabel regex script: https://github.com/Police-Data-Accessibility-Project/data-source-identification/pull/1
- (C++) scans massive amounts of URLs for keywords
- early bottleneck: commoncrawl URL database is ~4TB
- to do:
- publish the script + regex library
- generate lists of URLs
- get a good list of regex keywords set in advance
- crunch through the URLs
- Commoncrawl storage
- craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs
- goal: people can get batches of URLs off the server