January 5, 2023
Updates from PDAP
The new https://pdap.io front page points people to the new "request data" workflow.
We've been doing this informally alreadyβwe'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.
We're making good progress on turning our Airtable tools into a home-grown app. There's now a public repo which mirrors our Data Sources database as CSV and JSON.
We're working on ways to identify URLs en masse.
A volunteer wrote a sitemap scraper, which locates potentially useful URLs given a list. It's an open PR still under review. We want to get it merged soon: https://github.com/Police-Data-Accessibility-Project/scrapers/pull/195/files
We did a Doccano labeling exercise in an attempt to train a machine learning algorithm to identify content based on URL. We still need to experiment in order to close the loop on this.
Call notes
linklabel regex script: https://github.com/Police-Data-Accessibility-Project/data-source-identification/pull/1
(C++) scans massive amounts of URLs for keywords
early bottleneck: commoncrawl URL database is ~4TB
to do:
publish the script + regex library
generate lists of URLs
get a good list of regex keywords set in advance
crunch through the URLs
???
Elasticsearch
Supports stemming
Commoncrawl storage
craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs
goal: people can get batches of URLs off the server
Last updated