January 5, 2023
Last updated
Was this helpful?
Last updated
Was this helpful?
The new front page points people to the new "request data" workflow.
We've been doing this informally alreadyβwe'd like to get more requests, so we can better understand the kinds of things people are looking for and make more of an impact.
We're making good progress on turning our Airtable tools into a home-grown app. There's now a which mirrors our Data Sources database as CSV and JSON.
We're working on ways to identify URLs en masse.
GitHub issue:
A volunteer wrote a sitemap scraper, which locates potentially useful URLs given a list. It's an open PR still under review. We want to get it merged soon:
We did a Doccano labeling exercise in an attempt to train a machine learning algorithm to identify content based on URL. We still need to experiment in order to close the loop on this.
linklabel regex script:
(C++) scans massive amounts of URLs for keywords
early bottleneck: commoncrawl URL database is ~4TB
to do:
publish the script + regex library
generate lists of URLs
get a good list of regex keywords set in advance
crunch through the URLs
???
Elasticsearch
Supports
Commoncrawl storage
craeft offered to spin up a 4-5TB linux server to hold mass amounts of URLs
goal: people can get batches of URLs off the server