📖
Police Data Access Point Docs
pdap.ioGitHub
  • 👋Welcome
  • ⚡Activities
    • Label new Data Sources
      • Labeling events
    • Volunteer for Data Requests
    • Search for Data Sources
    • Publish data
    • Web scraping
    • FOIA requests
    • Advocate for open data
  • 🔬About
    • Search the PDAP database
    • Terms & definitions
      • What is a Data Source?
      • Terminology
    • Database details
      • Data Sources data dictionary
      • Agencies data dictionary
      • Requests data dictionary
      • Record Types taxonomy
      • Hidden properties
    • GitHub
    • Hugging Face
  • 📡API
    • Introduction/Getting Started
  • 🛠️Tools & Resources
    • Related projects
    • Resources for using data
    • Using LLMs like ChatGPT
  • 🔁Meta
    • Internal Tools (Retool)
    • Internal dev resources
      • GitHub issue template
      • GitHub pull request template
      • Product changes checklist
      • ☑️Production QA Checklist
      • Retool
    • Operations
      • Staff resources
        • Meeting Minutes
          • 2021-07-14
          • 2021-06-16
          • 2021-03-14
          • 2020 11-21 Tech Stack discussion
          • 2020-09-30 Leadership Cadence
          • 2020-10-14 Leadership Cadence
          • 2020-10-21 Leadership Cadence
          • 2020-10-28 Leadership Cadence
          • 2020-11-04 Leadership Cadence
          • 2020-11-12 Leadership Cadence
          • 2020-11-18 Leadership Cadence
          • 2020-11-25 Leadership Cadence Notes
          • 2020-12-02 Leadership Cadence Notes
          • 2020-12-09 Leadership Cadence Notes
          • 2020-12-12 Working Session
          • 2020-12-16 Leadership Cadence
          • 2020-12-30 Leadership Cadence Notes
          • 2021-01-06 Leadership Cadence Notes
          • 2021-01-13 Leadership Cadence Notes
          • 2021-01-20 Leadership Cadence Notes
          • 2021-01-27 Leadership Cadence Notes
          • 2021-02-03 Leadership Cadence Notes
          • 2021-02-10 Leadership Cadence Notes
          • 2021-02-17 Leadership Cadence Notes
          • 2021-02-24 Leadership Cadence Notes
          • 2021-03-03 Leadership Cadence Notes
          • 2021-03-10 Leadership Cadence Notes
          • 2021-03-16 Leadership Cadence Notes
          • 2021-03-27 database working session
          • 2021-03-31
          • 2020-12-1
          • 2021-01-23
          • 2021-04-10 Meeting notes
          • 2021-04-17 Meeting notes
          • 2021-04-21 Leadership Cadence
          • 2021-04-28 Leadership Cadence
          • 2021-05-05 Leadership Cadence
          • 2021-05-12 Leadership Cadence
          • 2021-05-19 Leadership Cadence
          • 2021-05-26 Leadership Cadence
          • 2021-06-02 Leadership Cadence
          • Decision log
        • Brand assets
      • Legal
        • Public records access laws & precedent
        • Legal Data Scraping
        • State Computer Crimes laws
      • Policy
        • Impartiality resolution
        • PDAP Access
        • PDAP Privacy Policy
        • Password Management
        • Personally Identifiable Information
    • Community calls
      • October 17, 2023
      • February 22, 2023
      • February 1, 2023
      • January 20, 2023
      • January 5, 2023
      • October 25, 2022
      • September 22, 2022
      • August 23, 2022
      • October 2, 2021
      • September 25, 2021
      • September 11, 2021
      • September 4, 2021
      • August 7, 2021
      • July 27 Dolt Bounty retro
      • July 17, 2021
      • July 10, 2021
      • June 26, 2021
      • June 19, 2021
      • June 12, 2021
      • June 5, 2021
      • May 1, 2021
      • April 24, 2021
    • Newsletter
    • Join our Discord
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
  1. Meta
  2. Operations
  3. Legal

Legal Data Scraping

Data scraping for an ethical source of truth

PreviousPublic records access laws & precedentNextState Computer Crimes laws

Last updated 2 years ago

Was this helpful?

Questions to ask before you begin:

  • Is the material I intend to scrape protected by copyright?

  • Does the website I intend to scrape require authentication?

  • Will the scraping compromise individual privacy?

  • Will scraping reduce the value of the original data?

  • Am I in violation of the terms of use/service of the website I intend to scrape?

  • Will my scraping activity overload, damage, or otherwise adversely affect a server?

The answers to all of these questions should be “no”. If you cannot answer the question, or are unsure, you should not proceed. Further discussion on each of these questions can be found below.

Is the material I intend to scrape protected by copyright?

Copyright, a form of intellectual property law, protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture. Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed, for example, the creative arrangement of otherwise non-copyrightable facts. The provides additional information on the scope of copyright protection. When scraping data, you should limit scraping to only non-copyrightable facts, and should only collect parts of pages required for the purpose.

Does the website I intend to scrape require authentication?

Scraping should be strictly limited to information that is presumptively accessible to the general public, and which does not require authentication to access such as a username and password. If a website requires authentication to access data, that data should not be scraped. For further reading on this issue, the U.S. Court of Appeals for the Ninth Circuit’s 2019 decision in provides a helpful analysis.

Will the scraping compromise individual privacy?

While information available in public records is public by its nature, be aware if the information being scraped reveals personal data that is generally understood to be private, such as social security numbers and bank or credit card information. This may also include personal data that can identify an individual person, such as name, email, phone number, and address. If in doubt, omit such personal data from the scope of the scrape.

Will scraping reduce the value of the original data?

Again, scraping should be strictly limited to information that is presumptively accessible to the general public. This limitation is important to ensure that original data being scraped is not diminished in value as a result of it being scraped.

Am I in violation of the terms of use/service of the website I intend to scrape?

When you login (e.g., through a username and password) and/or expressly agree to the terms of use/serve of a website, you are entering into a contract with the website owner, thereby agreeing to their rules. These rules may explicitly state that data may not be scraped from the website, and failing to adhere to such terms may put you in breach of that contract. As noted, however, your scraping should be strictly limited to information that is presumptively accessible to the general public, and which does not require authentication.

Will my scraping activity overload, damage, or otherwise adversely affect a server?

have confirmed the legality of web scraping regardless of Terms of Service. Ethics dictate that following Terms of Service is still best practice.

Do not harm the website you are scraping. This includes, for example, using a reasonable crawl rate and assuring that the volume and frequency of queries you make so not burden the website’s servers or interfere with the website’s normal operations. Respect the delay that crawlers should wait between requests by following the crawl-delay directive outlined in the robots.txt file. Where possible, strive to limit scraping to a time of day when the website is unlikely to be experiencing heavy traffic, for example, early in the morning or night. All PDAP scrapers and crawlers must also abide by State and Federal computer crimes statutes, including those collected .

🔁
U.S. Copyright Office
hiQ Labs, Inc. v. LinkedIn Corp.
Recent court rulings
here