PII (WIP / Discussion)


We want people to feel sound ethically and legally when they contribute to PDAP. To that end, we need to do these things:
  1. 1.
    Draw a “bright line” between the source material and the published data; i.e. “we are not changing the data, we are a mirror”. [a]
  2. 2.
    Avoid aggregating already-public PII in a way that would make it more public than it already is [b].
  3. 3.
    Strive for our stated mission [c] of being a complete source of truth. i.e. “we could not be accused of editorializing by omission”. Our current policy is that we're allowed to collect PII [d] .
Our existing policy about accepting PII is legally the right one but we do need to discuss how we aggregate and omit. Discuss in the thread for the next 3 days, then we'll do a survey or something.

How sure are we that omitting PII is against the rules?

We have a policy that allows PII, do we need to require it? Do we need to allow censorship?
Is PDAP the one doing the scraping, or the scraper community?
We can always get a second opinion. Maybe we need to define what PII is?
How can we prevent PDAP from being
Can we choose which data we aggregate? Sounds reasonable to Eddie.
Richard says there could be a security/know-your-user layer—make it more difficult than a search bar. Require people to register, know your user.
PDAP could be considered a wikipedia-like source of good sources.

Actions (polled in discord)

  • Should we define "PII that we must not make more public" as "Name"? 3 yes 0 no 1 yes + address
  • Should we show PII in the mirrored source material? 2 yes 1 no
  • Should we aggregate data on whether a field is included in the source data? 3 yes 0 no
  • Should we restrict data usage to non-commercial in our published policy? 4 yes 0 no
  • Should we redact PII when we provide data at scale (i.e. via API or queryable database)? 4 yes 0 no


[...private data] may also include personal data that can identify an individual person, such as name, email, phone number, and address. If in doubt, omit such personal data from the scope of the scrape.