DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data
DataEngConf SF16 - Data Asserts: Defensive Data Science
-
Upload
hakka-labs -
Category
Technology
-
view
674 -
download
0
Transcript of DataEngConf SF16 - Data Asserts: Defensive Data Science
![Page 1: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/1.jpg)
Data AssertsDefensive Data Science
Tommy Guy
Microsoft
![Page 2: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/2.jpg)
Observation: Complexity In Pipeline
![Page 3: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/3.jpg)
Our pipeline:
DATA!!!
Insight! Direction! Strategy!
![Page 4: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/4.jpg)
Our pipeline in reality: bugs tend to compound
DATA!!!
![Page 5: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/5.jpg)
How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.
![Page 6: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/6.jpg)
Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and unanticipated by data providers.
Pushing requirements upstream to all producers is Sisyphean.
![Page 7: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/7.jpg)
We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored• Counter fields to ensure no dropped rows• Sentinel events to measure join fidelity
Are availability SLAs being met?• Progressive server-client merging
![Page 8: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/8.jpg)
Data Scientists Require Semantic Correctness
Does this field mean what I think it does?
![Page 9: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/9.jpg)
How do Data Scientists identify potential errors?
![Page 10: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/10.jpg)
How do Data Scientists identify potential errors?Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.
![Page 11: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/11.jpg)
What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive• Correctly logging timestamp for Office purchase• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.
![Page 12: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/12.jpg)
What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.
![Page 13: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/13.jpg)
Data Asserts: Defensive Data Science
![Page 14: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/14.jpg)
Data Asserts: Maintain Quality
![Page 15: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/15.jpg)
Data Asserts: Clear Trust Boundaries
![Page 16: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/16.jpg)
These should match!
Data Asserts: Defensive Data Science
![Page 17: DataEngConf SF16 - Data Asserts: Defensive Data Science](https://reader031.fdocuments.us/reader031/viewer/2022030302/587e47d31a28abeb1a8b468b/html5/thumbnails/17.jpg)
Data Asserts in Production: A few Observations• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.