“Winning is accomplished in the preparation phase, not the execution phase.” Anonymous
You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous...
Transcript of You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous...
![Page 1: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/1.jpg)
You Are Not Anonymous
Jessica SuSanta Clara University
April 13, 2018
![Page 2: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/2.jpg)
Anonymity is important
![Page 3: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/3.jpg)
Anonymity is importantThere are many circumstances
where you'd like to be anonymous
When searching on Google
![Page 4: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/4.jpg)
Anonymity is importantIn 2006, AOL released
anonymous search logs from 650,000 users
![Page 5: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/5.jpg)
Anonymity is importantIn 2006, AOL released
anonymous search logs from 650,000 users
Some users were quickly deanonymized
![Page 6: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/6.jpg)
Anonymity is importantIn 2006, AOL released
anonymous search logs from 650,000 users
![Page 7: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/7.jpg)
Anonymity is importantIn 2006, AOL released
anonymous search logs from 650,000 users
![Page 8: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/8.jpg)
Anonymity is importantIn 2006, AOL released
anonymous search logs from 650,000 users
![Page 9: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/9.jpg)
Anonymity can be broken
(or "How Latanya Sweeney accessed the medical records of the governor of Massachusetts")
![Page 10: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/10.jpg)
Anonymity can be broken
You can break anonymity by linking anonymous datasets to datasets with PII
(personally identifiable information)
![Page 11: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/11.jpg)
Anonymity can be brokenLatanya Sweeney had anonymous
health insurance data
Ethnicity Visit date Diagnosis Procedure Medication
Zip code Birthdate Gender
![Page 12: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/12.jpg)
Anonymity can be brokenShe bought publicly available voter
registration data for $20 that contained some of the same fields
Ethnicity Visit date Diagnosis Procedure Medication
Zip code Birthdate Gender
Name Address
Party affiliation
![Page 13: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/13.jpg)
Anonymity can be brokenFact: 87% of Americans are uniquely identifiable based on their zip code, gender, and birthdate
Ethnicity Visit date Diagnosis Procedure Medication
Zip code Birthdate Gender
Name Address
Party affiliation
![Page 14: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/14.jpg)
Anonymity can be brokenFact: 87% of Americans are uniquely identifiable based on their zip code, gender, and birthdate
Ethnicity Visit date Diagnosis Procedure Medication
Zip code Birthdate Gender
Name Address
Party affiliation
This means you can link people's real names to their sensitive medical histories!
![Page 15: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/15.jpg)
Anonymity can be broken
![Page 16: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/16.jpg)
Anonymity can be brokenNatural response:
Treat (ZIP, gender, birthdate) tuple as personally identifying information
Make sure each combination of personally identifying attributes appears at least twice
![Page 17: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/17.jpg)
Anonymity can be brokenNatural response:
Treat (ZIP, gender, birthdate) tuple as personally identifying information
Make sure each combination of personally identifying attributes appears at least twice
Problem: No clear separation between personally identifying attributes and non-identifying attributes
![Page 18: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/18.jpg)
The Netflix deanonymization study
(or "How to make sensitive inferences from boring data")
![Page 19: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/19.jpg)
The Netflix challengeNetflix has a service that
recommends movies to people
![Page 20: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/20.jpg)
The Netflix challengeNetflix has a service that
recommends movies to people
One key part of this was their algorithm to predict how users would rate movies
![Page 21: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/21.jpg)
The Netflix challengeIn 2006, Netflix announced a $1 million prize
for the first team that could improve their algorithm's performance by 10%
![Page 22: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/22.jpg)
The Netflix challengeAs part of the contest, Netflix released a
dataset of anonymous movie ratings
![Page 23: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/23.jpg)
The Netflix challengeAs part of the contest, Netflix released a
dataset of anonymous movie ratings
Arvind Narayanan, anonymity expert
![Page 24: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/24.jpg)
The Netflix challengeAs part of the contest, Netflix released a
dataset of anonymous movie ratings
Arvind Narayanan, anonymity expert
"Can we figure out who these ratings
belong to?"
![Page 25: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/25.jpg)
Two key ingredients of deanonymization
![Page 26: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/26.jpg)
Two key ingredients of deanonymization
1) Users must have distinctive, uniquely identifying attributes
![Page 27: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/27.jpg)
Two key ingredients of deanonymization
1) Users must have distinctive, uniquely identifying attributes
(e.g. ZIP code, gender, birthdate)
![Page 28: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/28.jpg)
Two key ingredients of deanonymization
1) Users must have distinctive, uniquely identifying attributes
(e.g. ZIP code, gender, birthdate)
2) Those attributes must also appear in a less anonymous dataset
![Page 29: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/29.jpg)
Two key ingredients of deanonymization
1) Users must have distinctive, uniquely identifying attributes
99% of users are uniquely identifiable if you know a randomly selected subset
of 8 of their movie ratings
![Page 30: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/30.jpg)
Two key ingredients of deanonymization
1) Users must have distinctive, uniquely identifying attributes
99% of users are uniquely identifiable if you know a randomly selected subset
of 8 of their movie ratings
2) Those attributes must also appear in a less anonymous dataset
![Page 31: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/31.jpg)
Two key ingredients of deanonymization
1) Users must have distinctive, uniquely identifying attributes
99% of users are uniquely identifiable if you know a randomly selected subset
of 8 of their movie ratings
2) Those attributes must also appear in a less anonymous dataset
Ratings on the Internet Movie Database are attached to people's online identities
![Page 32: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/32.jpg)
Deanonymization results
Two Netflix users were linked to their IMDB profiles
![Page 33: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/33.jpg)
Deanonymization results
Two Netflix users were linked to their IMDB profiles
Movies viewed included "Jesus of Nazareth"
"Power and Terror: Noam Chomsky in Our Times" "Fahrenheit 9/11"
![Page 34: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/34.jpg)
How did they do it?
Naive approach: search for a Netflix user who has rated all of the movies that Mary reviewed.
Suppose Mary is an IMDB user.
![Page 35: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/35.jpg)
How did they do it?
Naive approach: search for a Netflix user who has rated all of the movies that Mary reviewed.
Suppose Mary is an IMDB user.
Problem: there is a lot of noise in the dataset, and IMDB and Netflix records do not perfectly correspond.
![Page 36: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/36.jpg)
How did they do it?Instead, use a scoring function that softly penalizes a
Netflix user for deviating from Mary's IMDB ratings
![Page 37: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/37.jpg)
How did they do it?Instead, use a scoring function that softly penalizes a
Netflix user for deviating from Mary's IMDB ratings
If the highest score is much higher than the second-highest score, return the highest score
Otherwise, there is no match
![Page 38: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/38.jpg)
What have we learned?
We can't divide the data into "public" and "sensitive" attributes
All movie ratings are sensitive when combined with other movie ratings
![Page 39: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/39.jpg)
What have we learned?
We can't divide the data into "public" and "sensitive" attributes
All movie ratings are sensitive when combined with other movie ratings
This is a general problem with sparse, high-dimensional data
![Page 40: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/40.jpg)
What have we learned?Anonymous data is not safe to release
![Page 41: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/41.jpg)
Identifying authors on the Internet
![Page 42: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/42.jpg)
People make "anonymous" posts on the Internet
![Page 43: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/43.jpg)
People make "anonymous" posts on the Internet
Can we figure out who they are based on their writing style?
![Page 44: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/44.jpg)
Experiment designCompare the anonymous posts to
posts that are written under people's names
![Page 45: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/45.jpg)
Experiment designCompare the anonymous posts to
posts that are written under people's names
Koppel et al:
![Page 46: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/46.jpg)
Experiment designCompare the anonymous posts to
posts that are written under people's names
Koppel et al:
10000 blogs from blogger.com
![Page 47: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/47.jpg)
Experiment designCompare the anonymous posts to
posts that are written under people's names
Koppel et al:
10000 blogs from blogger.comDivide each blog into 2000 words of "known text" and a 500-word anonymous "snippet"
![Page 48: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/48.jpg)
Experiment designCompare the anonymous posts to
posts that are written under people's names
Koppel et al:
10000 blogs from blogger.comDivide each blog into 2000 words of "known text" and a 500-word anonymous "snippet"
Match the anonymous snippets to the authors of the known texts
![Page 49: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/49.jpg)
Feature selectionEach post is represented by a vector of numerical features
![Page 50: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/50.jpg)
Feature selectionEach post is represented by a vector of numerical features
Question: What are some examples of good features?
![Page 51: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/51.jpg)
Feature selectionEach post is represented by a vector of numerical features
Question: What are some examples of good features?
Koppel used the numbers of "space-free character 4-grams"
![Page 52: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/52.jpg)
Space-free character 4-grams
Example: Always buy rugs not drugs
![Page 53: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/53.jpg)
Space-free character 4-grams
Example: Always buy rugs not drugs
Space-free character 4-grams:Alwa lway ways
buy rugs not
drug rugs
![Page 54: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/54.jpg)
Space-free character 4-grams
Example: Always buy rugs not drugs
Space-free character 4-grams:Alwa lway ways
buy rugs not
drug rugs
Feature vector:[1 0 1 1 0 1 2 1]
Alw
a
bugs
drug
lway
mug
s
not
rugs
way
s
![Page 55: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/55.jpg)
How to deanonymize
similarity(u, v) =u · v
||u|| ||v||dot productmagnitude of vector
Cosine similarity measures how similar two vectors are
![Page 56: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/56.jpg)
How to deanonymize
Find the cosine similarities between the feature vector of the anonymous document
and the feature vectors of the named documents
Pick the author who wrote the document with the highest cosine similarity
similarity(u, v) =u · v
||u|| ||v||dot productmagnitude of vector
Cosine similarity measures how similar two vectors are
![Page 57: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/57.jpg)
Improvements46% of the snippets were correctly assigned
![Page 58: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/58.jpg)
Improvements46% of the snippets were correctly assigned
To improve this, run the deanonymization algorithm using only a randomly sampled subset of the features
![Page 59: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/59.jpg)
Improvements46% of the snippets were correctly assigned
To improve this, run the deanonymization algorithm using only a randomly sampled subset of the features
Do this on many different subsets
![Page 60: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/60.jpg)
Improvements46% of the snippets were correctly assigned
To improve this, run the deanonymization algorithm using only a randomly sampled subset of the features
Do this on many different subsets
If enough of the subsets agree that the snippet was written by someone, return that person
![Page 61: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/61.jpg)
Improvements46% of the snippets were correctly assigned
To improve this, run the deanonymization algorithm using only a randomly sampled subset of the features
Do this on many different subsets
If enough of the subsets agree that the snippet was written by someone, return that person
Idea: It's harder to be #1 across many subsets of the feature set than it is to be #1 on the full feature set
![Page 62: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/62.jpg)
Another ideaNarayanan et al: Can we train a machine learning
classifier to predict which author wrote a document?
![Page 63: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/63.jpg)
Another ideaNarayanan et al: Can we train a machine learning
classifier to predict which author wrote a document?
![Page 64: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/64.jpg)
In conclusionThere are a whole bunch of these studies
![Page 65: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/65.jpg)
In conclusionThere are a whole bunch of these studies
Can deanonymize location data, credit card data, web browsing history data, etc.
![Page 66: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/66.jpg)
In conclusionThere are a whole bunch of these studies
Can deanonymize location data, credit card data, web browsing history data, etc.
Lesson: be careful when releasing "anonymous" data
![Page 67: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/67.jpg)
In conclusionThere are a whole bunch of these studies
Can deanonymize location data, credit card data, web browsing history data, etc.
Lesson: be careful when releasing "anonymous" dataOften you can link it back to people's real identities
![Page 68: You Are Not Anonymous - Stanford Computer Sciencejtysu/anonymity.pdf · You Are Not Anonymous Jessica Su Santa Clara University April 13, 2018. ... Can we train a machine learning](https://reader033.fdocuments.us/reader033/viewer/2022060405/5f0f12707e708231d44259e3/html5/thumbnails/68.jpg)
Thanks for listening• Latanya Sweeney. k-anonymity: A model for protecting privacy.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.
• Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008.
• Moshe Koppel, Jonathan Schler, and Shlomo Argamon. "Authorship attribution in the wild." Language Resources and Evaluation 45.1 (2011): 83-94.
• Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. On the feasibility of internet-scale author identification. In IEEE Symposium on Security and Privacy, 2012.