πŸ“–
Police Data Access Point Docs
pdap.ioGitHub
  • πŸ‘‹Welcome
  • ⚑Activities
    • Label new Data Sources
      • Labeling events
    • Volunteer for Data Requests
    • Search for Data Sources
    • Publish data
    • Web scraping
    • FOIA requests
    • Advocate for open data
  • πŸ”¬About
    • Search the PDAP database
    • Terms & definitions
      • What is a Data Source?
      • Terminology
    • Database details
      • Data Sources data dictionary
      • Agencies data dictionary
      • Requests data dictionary
      • Record Types taxonomy
      • Hidden properties
    • GitHub
    • Hugging Face
  • πŸ“‘API
    • Introduction/Getting Started
  • πŸ› οΈTools & Resources
    • Related projects
    • Resources for using data
    • Using LLMs like ChatGPT
  • πŸ”Meta
    • Internal Tools (Retool)
    • Internal dev resources
      • GitHub issue template
      • GitHub pull request template
      • Product changes checklist
      • β˜‘οΈProduction QA Checklist
      • Retool
    • Operations
      • Staff resources
        • Meeting Minutes
          • 2021-07-14
          • 2021-06-16
          • 2021-03-14
          • 2020 11-21 Tech Stack discussion
          • 2020-09-30 Leadership Cadence
          • 2020-10-14 Leadership Cadence
          • 2020-10-21 Leadership Cadence
          • 2020-10-28 Leadership Cadence
          • 2020-11-04 Leadership Cadence
          • 2020-11-12 Leadership Cadence
          • 2020-11-18 Leadership Cadence
          • 2020-11-25 Leadership Cadence Notes
          • 2020-12-02 Leadership Cadence Notes
          • 2020-12-09 Leadership Cadence Notes
          • 2020-12-12 Working Session
          • 2020-12-16 Leadership Cadence
          • 2020-12-30 Leadership Cadence Notes
          • 2021-01-06 Leadership Cadence Notes
          • 2021-01-13 Leadership Cadence Notes
          • 2021-01-20 Leadership Cadence Notes
          • 2021-01-27 Leadership Cadence Notes
          • 2021-02-03 Leadership Cadence Notes
          • 2021-02-10 Leadership Cadence Notes
          • 2021-02-17 Leadership Cadence Notes
          • 2021-02-24 Leadership Cadence Notes
          • 2021-03-03 Leadership Cadence Notes
          • 2021-03-10 Leadership Cadence Notes
          • 2021-03-16 Leadership Cadence Notes
          • 2021-03-27 database working session
          • 2021-03-31
          • 2020-12-1
          • 2021-01-23
          • 2021-04-10 Meeting notes
          • 2021-04-17 Meeting notes
          • 2021-04-21 Leadership Cadence
          • 2021-04-28 Leadership Cadence
          • 2021-05-05 Leadership Cadence
          • 2021-05-12 Leadership Cadence
          • 2021-05-19 Leadership Cadence
          • 2021-05-26 Leadership Cadence
          • 2021-06-02 Leadership Cadence
          • Decision log
        • Brand assets
      • Legal
        • Public records access laws & precedent
        • Legal Data Scraping
        • State Computer Crimes laws
      • Policy
        • Impartiality resolution
        • PDAP Access
        • PDAP Privacy Policy
        • Password Management
        • Personally Identifiable Information
    • Community calls
      • October 17, 2023
      • February 22, 2023
      • February 1, 2023
      • January 20, 2023
      • January 5, 2023
      • October 25, 2022
      • September 22, 2022
      • August 23, 2022
      • October 2, 2021
      • September 25, 2021
      • September 11, 2021
      • September 4, 2021
      • August 7, 2021
      • July 27 Dolt Bounty retro
      • July 17, 2021
      • July 10, 2021
      • June 26, 2021
      • June 19, 2021
      • June 12, 2021
      • June 5, 2021
      • May 1, 2021
      • April 24, 2021
    • Newsletter
    • Join our Discord
Powered by GitBook
On this page
  • Our approach to web scraping
  • Which data are we scraping?
  • Contributing Philosophy
  • Where the community fits in
  • Immediate goals
  • How to contribute
  • Not immediate priorities

Was this helpful?

Edit on GitHub
  1. Activities

Web scraping

PreviousPublish dataNextFOIA requests

Last updated 4 months ago

Was this helpful?

Our approach to web scraping

Which data are we scraping?

Public records about the police system, which can include sources from police, courts, and jails.

Contributing Philosophy

Scraping can turn cumbersome records into useful data. When someone wants to use records but they're in a difficult format, scraping is often the answer.

Typically, it's only worth writing a scraper if you have a use case for the data already, and you can't easily download what you need. What question are you trying to answer?

Where the community fits in

Our target users are the thousands of people already using police data. We can support their work by connecting the community of PDAP volunteer scrapers with real, impactful projects. If you don't have your own ideas about what to scrape, head to to see if any open requests catch your interests. You can also find local groups working on the criminal legal system.

Immediate goals

After hundreds of hours of user research, we have determined that these are how we will add value in the police data landscape.

  • Track independently scraped data in our database. Prevent duplication of effort by showing people what's already out there. To submit data you've scraped, .

  • Connect people with web scraping skills to community members trying to make better use of police data without technical expertise. .

  • Build open-source tools in the to make running a scraper on-demand easier for people who don't know what "CLI" means.

  • Scrape data sources and agency metadata via our . Especially important are Data Sources with a record type of "List of Data Sources."

How to contribute

We're still in the iteration and case study phase. If you want to learn something about the police, you can write a scraper to parse, normalize, or get deeper information from our Data Sources.

  1. Share your extraction and what you learned in Discord.

  2. We'll all learn about the criminal legal system from the experience, and brainstorm ways our tools could better facilitate your work.

  3. Repeat!

Not immediate priorities

Aggregation and hosting of scraped data

  • Data is most often useful in its own context, and scraped data is usually small enough to fit on free-tier hosting. After you publish a dataset, we can list it in our database!

  • It's not an immediate priority to make a big database to store scraped data in a normalized format. Comparing and combining data is its own research project. It's almost everyone's first thought when they hear about our project (ours too). Our research tells us access, organization, sharing, technical skills, and communication are the bottleneck for people using the data.

  • Aggregation is incredibly complex, and involves more than just mapping properties. So much context is needed before data from two departments can be compared.

  • Publishing and vouching for extracted data, and documenting its provenance so it can be audited, is a big project. We only want to undertake this work for data we know will be useful.

Automated scraper farms

  • It's not an immediate priority to automate the running of all the scrapers in our shared repo. The main reason: this is not what our users are asking us for. We plan to archive the sources, and facilitate sharing of scraper code. If we have a stable archive, scraping can be done on-demand.

Scrape every data source!

  • Scraping is hard work, and there are hundreds of thousands of potential data sources out there. For many applications, data doesn't even need to be processed to be usefulβ€”it just needs to be findable. We don't need to scrape things unless it's clearly adding value.

If you don't have scraping skills, you can use the to find someone who may be able to help.

Run a Scraper you wrote, or one from the , to get an extraction.

⚑
https://github.com/Police-Data-Accessibility-Project/scrapers
our volunteer page
start here
Volunteer to respond to requests here
Scrapers repo
Data Source Identification pipeline
#data-exchange channel in Discord
Scrapers Repo