Protecting privacy in practice

36
Protecting privacy in practice 2017-06-13 Lars Albertsson www.mapflat.com 1

Transcript of Protecting privacy in practice

Page 1: Protecting privacy in practice

Protecting privacy in practice

2017-06-13Lars Albertsson

www.mapflat.com

1

Page 2: Protecting privacy in practice

Who’s talking?● KTH-PDC Center for High Performance Computing (MSc thesis)● Swedish Institute of Computer Science (distributed system test+debug tools)● Sun Microsystems (building very large machines)● Google (Hangouts, productivity)● Recorded Future (natural language processing startup)● Cinnober Financial Tech. (trading systems)● Spotify (data processing & modelling)● Schibsted Media Group (data processing & modelling)● Mapflat (independent data engineering consultant)

2

Page 3: Protecting privacy in practice

Privacy protection resources

3

All of this might go wrong. Large fine.

Pour your data into our product.

404

Page 4: Protecting privacy in practice

Privacy-driven design● Technical scope

○ Toolbox○ Not complete solutions

● Assuming that you solve: ○ Legal requirements○ Security primitives○ ...

● Not description of any company

4

Archi-tecture

Statis-tics

Legal

Organi-sation

Security

Process Privacy UX

Culture

Page 5: Protecting privacy in practice

Requirements, engineer’s perspective● Right to be forgotten● Limited collection● Limited retention● Limited access

○ From employees○ In case of security breach

● Right for explanations● User data enumeration● User data export

5

Page 6: Protecting privacy in practice

Ancient data-centric systems● The monolith● All data in one place● Analytics + online serving from

single database● Current state, mutable

- Please delete me?- What data have you got on me?

- Sure, no problem!6

DB

Presentation

Logic

Storage

Page 7: Protecting privacy in practice

Event oriented systems

7

Every event All events, ever, raw, unprocessed

Refinement pipeline

Artifact of value

Page 8: Protecting privacy in practice

Event oriented systems

8

Every event All events, ever, raw, unprocessed

Refinement pipeline

Artifact of value

● Motivated by○ New types of data-driven (AI) features○ Quicker product iterations

■ Data-driven product feedback (A/B tests)■ Fewer teams involved in changes

○ Robustness - scales to more complex business logic

Enable disruption

Page 9: Protecting privacy in practice

v

Service

Data processing at scale

9

Cluster storage

Ingress Offline processing Egress

Datalake

DBService

DatasetJobPipeline

Service

Export

Businessintelligence

DBDB

Import

Page 10: Protecting privacy in practice

Workflow manager

10

● Dataset “build tool”● Run job instance when

○ input is available○ output missing○ resources are available

● Backfill for previous failures● DSL describes DAG

○ Includes ingress & egress

Recommended: Luigi / Airflow

DB

Page 11: Protecting privacy in practice

Factors of success

11

● Event-oriented - append only● Immutability● At-least-once semantics● Reproducibility

○ Through 1000s of copies

● Redundancy

- Please delete me? - What data have you got on me?

- Hold on a second...

Page 12: Protecting privacy in practice

Solution space

12

Technical feasibility

Easy to do the right thing

Awareness culture

Page 13: Protecting privacy in practice

Personal information (PII) classification● Red - sensitive data

○ Messages○ GPS location○ Views, preferences

● Yellow - personal data○ IDs (user, device)○ Name, email, address○ IP address

● Green - insensitive data○ Not related to persons○ Aggregated numbers

13

● Grey zone○ Birth date, zip code○ Recommendation / ads models?

Nota bene: This is only an example classification

Page 14: Protecting privacy in practice

PII arithmeticsRed + green = red red + yellow = red yellow + green = yellow

Aggregate(red/yellow) = green ?

Green + green + green = yellow ?

Yellow + yellow + yellow = red ?

Machine_learning_model(yellow) = yellow ?Overfitting => persons could be identified

14

Page 15: Protecting privacy in practice

Make privacy visible at ground level

● In dataset names○ hdfs://red/crm/received_messages/year=2017/month=6/day=13○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13

● In field names○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …”

● In credential / service / table / ... names

● Spreads awareness● Catch mistakes in code review● Enables custom tooling for violation warnings

15

Page 16: Protecting privacy in practice

Eye of the needle tool● Provide data access through gateway tool

○ Thin wrapper around Spark/Hadoop/S3/...○ Hard-wired configuration

● Governance○ Access audit, verification○ Policing/retention for experiment data

16

Page 17: Protecting privacy in practice

Eye of the needle tool● Easy to do the right thing

○ Right resource choice, e.g. “allocate temporary cluster/storage”

○ Enforce practices, e.g. run jobs from central repository code

○ No command for data download

● Enabling for data scientists○ Empowered without operations○ Directory of resources

17

Page 18: Protecting privacy in practice

Towards oblivion● Left to its own devices,

personal (PII) data spreads like weed

● PII data needs to be governed, collared, or discarded

○ Discard what you can

18

Page 19: Protecting privacy in practice

● Discard all PII○ User id in example

● No link between records or datasets

● Replace with non-PII○ E.g. age, gender, country

● Still no link○ Beware: rare combination => not anonymised

Drop user id

Discard: Anonymisation

19

Replace user id with demographics

Useful for business insights

Useful formetrics

Page 20: Protecting privacy in practice

Partial discard: Pseudonymisation● Hash PII● Records are linked

○ Across datasets○ Still PII, GDPR applies○ Persons can be identified (with additional data)○ Hash recoverable from PII

● Hash PII + salt○ Hash not recoverable

● Records are still linked○ Across datasets if salt is constant

20

Hash user id Useful for recommendations

Hash user id + salt

Useful for product insights

Page 21: Protecting privacy in practice

● Push reruns withworkflow manager

- No versioning support in tools- Computationally expensive- Easy to miss datasets+ No data model changes required

Governance: Recomputation

21

Page 22: Protecting privacy in practice

● Fields reference PII table● Clear single record => oblivion

- PII table injection needed- Key with UUID or hash

- Extra join- Multiple or wide PII tables+ Small PII leak risk

Ejected record pattern

22

Page 23: Protecting privacy in practice

● Datasets are immutable● Version n+1 of raw dataset lacks record● Short retention of old versions● Always depend on latest version

Record removal in pipelines

23

User keys2017-06-12

User keys2017-06-13

class Purchases(Task): date = DateParameter()

def requires(self): return [Users(self.date), Orders(self.date), UserKeys.latest()]

Page 24: Protecting privacy in practice

● PII fields encrypted● Per-user decryption key table● Clear single user key => oblivion

- Extra join + decrypt- Decryption (user) id needed+ Multi-field oblivion+ Small PII leak risk

Lost key pattern

24

Page 25: Protecting privacy in practice

● Different fields encryptedwith different keys

● Partial user oblivion○ E.g. forget my GPS coordinates

Lost key partial oblivion

25

Page 26: Protecting privacy in practice

● Encrypt key fields that link datasets● Ability to join is lost● No data loss

○ Salt => anonymous data○ No salt => pseudonymous data

Lost link key

26

Page 27: Protecting privacy in practice

Reversible oblivion● Lost key pattern● Give ejected record key to

third party○ User○ Trusted organisation

● Destroy company copies

27

Page 28: Protecting privacy in practice

Data model deadly sins● Using PII data as key

○ Username, email

● Publishing entity ids containing PII data○ E.g. user shared resources (favourites, compilations) including username

● Publishing pseudonymised datasets○ They can be de-pseudonymised with external data○ E.g. AOL, Netflix, ...

28

Page 29: Protecting privacy in practice

Tombstone line● Produce dataset/stream of

forgotten users● Egress components, e.g. online

service databases, may need push for removal.

○ Higher PII leak risk

29

DB Service

Page 30: Protecting privacy in practice

The art of deletion● Example: Cassandra● Deletions == tombstones● Data remains

○ Until compaction○ In disconnected nodes

Component-specific expertise necessary

30

Page 31: Protecting privacy in practice

Deletion layers● Every component adds deletion burden

○ Minimise number of components○ Ephemeral >> dedicated. Recycle machines.

● Every storage layer adds deletion burden○ Minimise number of storage layers○ Cloud storage requires documented erasure semantics + agreements.

● Invent simple strategies○ Example: Cycle Cassandra machines regularly, erase block devices.

Increasing cost of heterogeneity

31

Page 32: Protecting privacy in practice

Retention limitation● Best solved in workflow manager

○ Connect creation and destruction

● Short default retention, whitelist exceptions● In conflict with technical ideal of immutable raw data

32

Page 33: Protecting privacy in practice

Lake promotion● Remove expire raw dataset, promote derived datasets to lake● First derived dataset = washed(raw)?● Workflow DAG still works

33

Data lake Derived

Cluster storage

Data lake Derived

Cluster storage

Page 34: Protecting privacy in practice

Lineage● Tooling for tracking data flow● Dataset granularity

○ Workflow manager?

● Field granularity○ Framework instrumentation?

● Multiple use cases○ (Discovering data)○ (Pipeline change management)○ Detecting dead end data flows○ Right to export data○ Explanation of model decisions

34

Page 35: Protecting privacy in practice

Solicitation: PII & lineage type systems● Idea: decorate (scala) types

○ PII classification (red/yellow/green)○ Lineage (e.g. processing class id + commit id + dataset revision)

● Assistance with PII arithmetics○ PII[Red, String] + PII[Green, String] => PII[Red, String]○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int]

● Detect unused PII fields● Assist with recomputation

○ For PII cleaning○ Bug fixes

35

Page 36: Protecting privacy in practice

Resources Credits● http://www.slideshare.net/lallea/d

ata-pipelines-from-zero-to-solid● http://www.mapflat.com/lands/res

ources/reading-list● https://ico.org.uk/● EU Article 29 Working Party

36

● Alexander Kjeldaas, independent● Lena Sundin, Spotify● Oscar Söderlund, Spotify● Oskar Löthberg, Spotify● Sofia Edvardsen,

Sharp Cookie Advisors● Øyvind Løkling,

Schibsted Media Group