Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat...
Transcript of Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat...
![Page 1: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/1.jpg)
Synthetic Data Generation for Realistic Analytics Examples and
Testing Ronald J. Nowling
Red Hat, Inc. [email protected]
http://rnowling.github.io/
![Page 2: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/2.jpg)
Who Am I?
• Software Engineer at Red Hat • Data Science Team, Emerging
Technologies – Evaluate open-source Big Data space – Ensure software works for Red Hat
customers – Promote data science internally through
consulting projects • Apache BigTop PMC
2
![Page 3: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/3.jpg)
Synthetic Data
• No licensing, privacy, or intellectual property concerns
• Scalable: Laptops to Clusters! • More reliable than external data sets • Enable more realistic example
applications • Enable more comprehensive testing than
wordcount and TeraSort
3
![Page 4: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/4.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
4
![Page 5: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/5.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
5
![Page 6: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/6.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
6
![Page 7: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/7.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
7
![Page 8: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/8.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
8
![Page 9: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/9.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
9
![Page 10: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/10.jpg)
Synthetic Data
• Sensitive Data – Real data on cluster for scalability testing and
validation – Synthetic data for local development and testing
• Needed smaller data sets for checking calculations – Total aggregation results requires re-running old
pipeline – Extra burden on operations team – Delay for development team
10
![Page 11: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/11.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
11
![Page 12: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/12.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
12
![Page 13: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/13.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
13
![Page 14: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/14.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
14
![Page 15: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/15.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
15
![Page 16: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/16.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
16
![Page 17: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/17.jpg)
Issues Tackled
• Error in account validation introduced while refactoring code
• Usage of the correct join types • Validation of date-time operations • Correct Output Formats
17
![Page 18: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/18.jpg)
Apache BigTop BigPetStore Blueprints
• Problem domain: Transactions for a fictional chain of pet stores
• BigPetStore Data Generator simulates customer purchasing behavior to generate realistic transaction data
• Blueprints for big data ecosystem – Hadoop: MapReduce / Pig / Hive / Mahout – Spark – Flink (in progress)
18
![Page 19: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/19.jpg)
BigPetStore
19
![Page 20: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/20.jpg)
BigPetStore
20
HCFS
![Page 21: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/21.jpg)
BigPetStore
21
Core (RDDs) HCFS
![Page 22: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/22.jpg)
BigPetStore
22
Spark SQL
Core (RDDs) HCFS
![Page 23: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/23.jpg)
BigPetStore
23
Spark SQL MLLib
Core (RDDs) HCFS
![Page 24: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/24.jpg)
Team Cluster
• ~10 nodes • 40 cores, 400GB RAM per node
24
![Page 25: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/25.jpg)
Potential Issues
• Infrastructure • Storage • Software Installation • Software Upgrades • Spark Configuration Tuning • User Management
25
![Page 26: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/26.jpg)
Real Stories
• Creating a new user – User Gluster permissions incorrect
• Cluster upgrade – Spark upgrade didn’t take because of issue with
Ansible role configuration – Wiped out our spark.conf – master / mesos
settings wrong
• Gluster moint points disappeared on reboot – Not set in fstab
26
![Page 27: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/27.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
27
![Page 28: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/28.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
28
![Page 29: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/29.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
29
![Page 30: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/30.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
30
![Page 31: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/31.jpg)
k8petstore
31
![Page 32: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/32.jpg)
Use Cases
• Configuration • Scalability • Fault Tolerance
32
![Page 33: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/33.jpg)
k8petstore
• OpenContrail networking solution demo1 • Kubernetes JuJu Charm documentation
example2 • Kubernetes v1.0 launch talk at OSCON3 [1] -
https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and-opencontrail/
[2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281
33
![Page 34: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/34.jpg)
APACHE BIGTOP DATA GENERATORS
34
![Page 35: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/35.jpg)
BigPetStore
35
![Page 36: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/36.jpg)
BigTop Weatherman
36
![Page 37: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/37.jpg)
BigTop Bazaar
37
![Page 38: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/38.jpg)
Vision
• Encourage synthetic data generation for testing and realistic examples
• Serve as a resource for the larger Apache and open source communities
• Emphasis on – Flexibility – Scalability – Realism
• We look forward to collaborating and getting folks involved!
38
![Page 39: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/39.jpg)
Conclusion
• Synthetic data generators and blueprints are useful!
• Case studies: – Data Processing Pipelines – Cluster Deployment – Kubernetes
• BigPetStore and BigTop Data Generators efforts in Apache BigTop
• Open invitation to get involved and collaborate
39
![Page 40: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/40.jpg)
Resources
http://bigtop.apache.org/
http://github.com/apache/bigtop
http://rnowling.github.io/
40
![Page 41: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space](https://reader033.fdocuments.us/reader033/viewer/2022050205/5f58ea38819f04303c530861/html5/thumbnails/41.jpg)
QUESTIONS
41