poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call...
Transcript of poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call...
![Page 1: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/1.jpg)
poloclub.github.io/#cse6242CSE6242 / CX4242
Data & Visual AnalyticsDuen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS AnalyticsGeorgia Tech Mahdi Roozbahani Lecturer, Computational Science & Engineering, Georgia TechFounder of Filio, a visual asset management platform
1
![Page 2: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/2.jpg)
Course TAs Be very very nice to them!
Sam Stentz (Co-Head TA)Dhaval Desai (Co-Head TA)Soo Hyung ParkEdmund ChenRiya BakhtianiJianyuan Lu
2
![Page 3: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/3.jpg)
3
The course focuses on working with big data.
(Also the focus of Polo’s research group)
![Page 5: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/5.jpg)
5
Internet50 Billion Web Pages
www.worldwidewebsize.com www.opte.org
![Page 6: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/6.jpg)
6
Facebook2 Billion Users
![Page 7: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/7.jpg)
7
Citation Network
www.scirus.com/press/html/feb_2006.html#2 Modified from well-formed.eigenfactor.org
250 Million Articles
![Page 8: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/8.jpg)
TwitterWho-follows-whom (500 million users)
Who-buys-what (120 million users)
cellphone networkWho-calls-whom (100 million users)
Protein-protein interactions200 million possible interactions in human genome
8
Many More
Sources: www.selectscience.net www.phonedog.com www.mediabistro.com www.practicalecommerce.com/
![Page 9: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/9.jpg)
9
“Big Data” AnalyzedGraph Nodes Edges
YahooWeb 1.4 Billion 6 Billion
Symantec Machine-File Graph 1 Billion 37 Billion
Twitter 104 Million 3.7 Billion
Phone call network 30 Million 260 Million
We also work with small data. Small data also needs love.
![Page 10: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/10.jpg)
710
![Page 11: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/11.jpg)
7Number of items an average human
holds in working memory
±2George Miller, 1956
10
![Page 12: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/12.jpg)
11
![Page 13: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/13.jpg)
711
![Page 14: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/14.jpg)
Data
Insights12
![Page 15: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/15.jpg)
13
How to do that?
COMPUTATION +
HUMAN INTUITION
![Page 16: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/16.jpg)
14
Or, to ride the AI wave…
ARTIFICIAL INTELLIGENCE+
HUMAN INTELLIGENCE
![Page 17: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/17.jpg)
Both develop methods for making sense of network data
15
How to do that?
COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of nodes Thousands of nodes
![Page 18: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/18.jpg)
15
How to do that?
COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of nodes Thousands of nodes
![Page 19: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/19.jpg)
15
How to do that?
COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of nodes Thousands of nodes
![Page 20: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/20.jpg)
15
How to do that?
COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of nodes Thousands of nodes
![Page 21: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/21.jpg)
15
How to do that?
COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of nodes Thousands of nodes
![Page 22: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/22.jpg)
15
How to do that?
COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of nodes Thousands of nodes
![Page 23: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/23.jpg)
Our research combines the Best of Both Worlds
16
Our Approach for Big Data Analytics
DATA MINING HCIAutomatic User-driven; iterative
Summarization, clustering, classification Interaction, visualization
>Millions of items Thousands of items
Human-Computer Interaction
![Page 24: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/24.jpg)
17
Our mission & vision:
Scalable, interactive, usable tools for big data analytics
![Page 25: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/25.jpg)
“Computers are incredibly fast, accurate, and stupid.
Human beings are incredibly slow, inaccurate, and brilliant.
Together they are powerful beyond imagination.”
(Einstein might or might not have said this.)18
![Page 26: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/26.jpg)
Course website (policies, syllabus, schedule, etc.)
https://poloclub.github.io/cse6242-2021fall-campus/(link also available on Canvas)
Discussion, Q&A, find teammates
Piazza (link/tab available on Canvas)
Assignment Submission
Canvas/Gradescope
Logistics
Make sure you’re in the right Piazza!(CSE-6242-O01, CSE-6242-OAN have
their Piazza forums too)
19
![Page 27: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/27.jpg)
Course HomepageFor syllabus, schedule, projects, datasets, etc.
If you Google “cse6242”, you will see many matches. Make sure you click the correct site!
20
![Page 29: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/29.jpg)
• We will announce events related to this class and data science in general
• Distinguished lectures, seminars
• Hackathons
• Company recruitment events
Important to join Piazza because…
22
![Page 30: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/30.jpg)
Since course is remote, add your photoCanvas Piazza
If you need help cropping headshot photo into square, use Magic Crop (https://poloclub.github.io/magic-crop/)
![Page 31: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/31.jpg)
Course Goals
24
![Page 32: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/32.jpg)
25
What is Data & Visual Analytics?
![Page 33: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/33.jpg)
25
What is Data & Visual Analytics?
No formal definition!
![Page 34: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/34.jpg)
25
Polo’s definition: the interdisciplinary science of combining computation techniques and interactive visualization to transform and model data to aid discovery, decision making, etc.
What is Data & Visual Analytics?
No formal definition!
![Page 35: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/35.jpg)
26
What are the “ingredients”?
![Page 36: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/36.jpg)
26
What are the “ingredients”?
Need to worry (a lot) about: storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc.
Wasn’t this complex before this big data era. Why?
![Page 37: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/37.jpg)
27http://spanning.com/blog/choosing-between-storage-based-and-unlimited-storage-for-cloud-data-backup/
![Page 38: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/38.jpg)
What is big data? Why care?
Many businesses are based on big data.
Search engines: rank webpages, predict what you’re going to type
Advertisement: infer what you like, based on what your friends like; show relevant ads
E-commerce: recommends movies/products (e.g., Netflix, Amazon)
Health IT: patient records (EMR)
Finance
28
![Page 39: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/39.jpg)
Good news! Many jobs!
Most companies are looking for “data scientists”
The data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be fulfilled better as a team- Gartner (http://www.gartner.com/it-glossary/data-scientist)
Breadth of knowledge is important.This course helps you learn some important skills.
29
![Page 40: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/40.jpg)
Collection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
Course Schedule (Analytics Building Blocks)
30
![Page 41: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/41.jpg)
Building blocks. Not Rigid “Steps”.
Can skip some
Can go back (two-way street)
• Data types inform visualization design
• Data size informs choice of algorithms
• Visualization motivates more data cleaning
• Visualization challenges algorithm assumptionse.g., user finds that results don’t make sense
Collection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
31
![Page 42: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/42.jpg)
• Learn visual and computation techniques and use them in complementary ways
• Gain a breadth of knowledge
• Learn practical know-how by working on real data & problems
Course Goals
32
![Page 43: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/43.jpg)
• [50%] 4 homework assignments• End-to-end analysis• Techniques (computation and vis)• “Big data” tools, e.g., Hadoop, Spark, etc.
• [50%] Group project — 4 to 6 people• [Bonus points] Pop quizzes
• Multiple over semester; ~10min each during DVA Live (except Q students)
• 1% course grade each; lowest score dropped• No Exams
Grading
33
![Page 44: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/44.jpg)
Policies. Very Important!(on course website)
Grading, plagiarism, collaboration, late submission, and the “warnings”
about the difficulty this course
34
![Page 45: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/45.jpg)
From Previous Classes…
• Class projects turned into papers at top conferences (KDD, IUI, etc.)
• Projects as portfolio pieces on CV
• Increased job and internship opportunities
• Former students sent me “thank you” notes
35
![Page 46: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/46.jpg)
IUI Full conference paper36
![Page 47: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/47.jpg)
KDD Workshop paper37
![Page 48: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/48.jpg)
IUI Poster paper38
![Page 49: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/49.jpg)
“I feel like the concepts from your class are like a rite of passage for an aspiring data scientist. Assignments lead to a feelings of accomplishment and truly progressing in my area of passion.”
“…thank you for the materials taught in DVA. As it was perfectly aligned with the what employers are looking out for. It made less challenging for me to secure this new job [Business Intelligence engineer at Amazon] in this competitive job market.”
“I really get more intuition about how to deal with data with some powerful tools in HW3 [uses AWS]. That feeling is beyond description for me.”
“I would like to say thank you for your class! Thanks to the skills I got from the class and the project, I got the offer.”
39
![Page 50: poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs](https://reader035.fdocuments.us/reader035/viewer/2022071609/61488d432918e2056c22c386/html5/thumbnails/50.jpg)
What we expects from you• Actively participate throughout the course!
• If you need help, let us know — the earlier you let us know, the more help we can offer
• Help your fellow classmates out, e.g., help answer questions on Piazza
• Share your ideas! Ideas for improving learning experiences, let us know
40