background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates...

83
Background Material Craig C. Douglas University of Wyoming [email protected]

Transcript of background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates...

Page 1: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Background Material

Craig C. Douglas University of Wyoming

[email protected]

Page 2: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Schedule

• Undergraduateso June 30, 2:00-6:00, Background materialo July 2, 2:00-6:00, Data findingo July 4, 2:00-6:00, Data finding and machine learningo July 6, 2:00-6:00, Machine learning

• Graduateso July 1, 2:30-5:30, Data findingo July 3, 8:30-11:30, Data finding and machine learningo July 3, 2:30-5:30, Machine learning

2

Page 3: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Outline

• Introduction to the courseo Useful references, software, history, and examples

• Mathematical techniqueso Basicso Gradient optimization

• Computer science techniqueso Hashing, sentences, fracking, and spreadsheets

3

Page 4: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Introduction

4

Page 5: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Useful References

• http://www.mgnet.org/~douglas/Classes/bigdata/2019su-index.html

• Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, Mining of Massive Datasets, 2nd ed. (version 2.1), Stanford University, 2014. The most up to date version is online at http://www.mmds.org. I will lecture from the 3rd edition draft as well.

• Andriy Burkov, The Hundred-Page Machine Learning Book, http://themlbook.com/wiki/doku.php, 2019.

5

Page 6: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Useful References

• Wooyoung Kim, Parallel Clustering Algorithms: Survey, Parallel Clustering Algorithms: Survey, http://grid.cs.gsu.edu/~wkim/index_files/SurveyParallelClustering.html, 2009.

• Deep Learning exercises using TensorFlow, https://www.coursera.org/learn/intro-to-deep-learning/home/welcome.o https://github.com/hse-aml/intro-to-dl

6

Page 7: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Useful Software

• TensorFlowo Version 1.13 is stable. Version 2.0.0-beta is not.o Anaconda or Miniconda environmentso Additional Python packages: jupyter, matplotlib,

pandas

• Tableau• MapReduce, Spark, and workflow systems• Many problems run 1000X faster on a GPU

7

Page 8: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Some Sources of Big Data

• Interactions with dynamic databases• Internet data• City or regional transportation flow control• Environment and disaster management• Oil/gas fields or pipelines, seismic imaging• Government or industry regulation/statistics• Closed circuit camera identification

8

Page 9: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Computational Sciences Morphs into Data Intensive Discovery

• Big Data has become a superset of computational sciences with applications in all walks of life with the overriding question, “What if you had all of the data?”

• Technology limits to some extent how big the Big Data can be, but over time the maximum size has increased by 1,000 fold every few years since the 1950’s.

9

Page 10: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

What Is Big Data?... It Depends

10

Unit Approximately 10n Related to

Kilobyte (KB) 1,000 bytes 3 Circa 1952 computer memory

Megabyte (MB) 1,000 KB 6 Circa 1976 supercomputer memory

Gigabyte (GB) 1,000 MB 9 Mid 1980’s disk controller memory attached to a mainframe (with 128 MB memory)

10 Gigabytes (GB) 10,000 MB 10 2013 typical memory stick (16 GB)

Terabyte (TB) 1,000 GB 12 2012 largest SSD in a laptop

Petabyte (PB) 1,000 TB 15 250,000 DVD’s or the entire digital library of all known books written in all known languages

Exabyte (EB) 1,000 PB 18 175 EB copied to disk in 2010 (est.)

Zettabyte (ZB) 1,000 EB 21 2 ZB copied to disk in 2011 (est.)

10 Zettabytes (ZB) 10,000 EB 22 Single NSA Big Dataset in 2013 (est.)

Page 11: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Big Data Paradigm• BD adds

o Structured query language (SQL) and Not only SQL (NoSQL) capabilities for storing and retrieving data from dynamically growing databases or data streams.

o Flexible methods for handling large quantities of data in highly parallel computing environments, e.g., MapReduce, a parallel merge-sort algorithm.

o Fast, parallel read-write capabilities, e.g., NetCDF or HDF5.o Extremely large, robust, distributed file systems.o Time dependent data retrieval based on a dynamically chosen

time windows so that automatic weighting of data based on how recent it was acquired can be easily adopted.

o Data visualization.

11

Page 12: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Data Warehousing

• Data entering goes through up to 6 steps:1. Retrieval: Get the data from sensors or changing

databases. This may mean receiving data directly from a sensor or database or indirectly through another computer or storage device.

2. Extraction: The data may be quite messy in raw form, thus the relevant data may have to be extracted from the transmitted information.

3. Conversion: The units of the data may not be appropriate for our application.

12

Page 13: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Data Warehousing

• Data entering goes through up to 6 steps:4. Quality control: Bad data should be removed or

repaired if possible. Missing or incomplete data must be repaired.

5. Store: If the data will be archived, it must go to the right medium, which might be permanent or semi-permanent storage. If the data will not be archived, then it must exist only briefly and then be discarded if not used immediately.

6. Notification: Any simulation using the data must be informed as new data enters the data center, which could necessitate (either a cold or warm) restart or start up.

13

Page 14: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Integrated sensor and processing (ISP)

• Senses• Computes• Provides error bounds• Reprogrammable dynamically• Eliminates steps 2 and 4 for data warehousing.• Example: Look for specific molecules underwater.

Once found, look for seabed leaks of other molecules. Find oil/gas/CO2 in shallow water.

14

Page 15: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Oil/Gas Pipelines

Picture courtesy of Miriam Webster Dictionary 15

Page 16: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Pipeline Network Properties

• Pipe diameters range from 2 inches to 5 feet.

• Rarely straight and level.• Contain

o Possibly different grades of oil or gas simultaneously.

o Pigs as separators.o Sensors (inside and

outside)• Not restricted to oil/gas

pipelines (water, etc.).

16

Page 17: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

1970’s Modeling

• Problem modeled mathematically based on time dependent, nonlinear coupled partial differential equations (two models).o Sensors on all pipeline components (recall the cartoon).o Distributed GRID computing with scattered phone booths:

• 2 minicomputers, 4 array processors, a heat pump on top, and a U.S. nickel soldered in place to allow “free” calls for telemetry.

• Sensors provided data (temperature, pressure, and velocity) dynamically based on need and anomalies and controlled by the environment and running model.

• No central computing, just central and distributed control sites.• 2,000 pieces of telemetry/minute in complete KSA network (1978).

17

Page 18: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Current Modeling

• 3D math models of pipelines with topography.• Central computing and fiber optic TCP/IP with

Gigabit Ethernet backup near pipelines.• Many more sensors plus ones to measure pipe

(shape) changes, internal pollutants and external gas leakages.

• When 1978 system replaced in KSA in 1998, 100,000 times the telemetry/minute. In 2014, a tsunami of uncountable data.

18

Page 19: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Monitoring Site Evolution

• In 1970’s, primitive center where “what if” scenarios were run to keep pipelines from breaking in parallel with regular monitoring.

• Now, large scale visualization is used to monitor pipelines in a multiscale framework. Individual high resolution monitors (1080p and 4K+) used for “what if” scenarios.

• Always trying to find anomalies in the data streams to avoid pipeline problems.

19

Page 20: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Aerospace: Smart Airfoils

20

Page 21: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Shape Memory Alloys (SMAs)

• 2011: first commercial flight of a comprehensive“smart” SMA airfoil + engines (ANA, Boeing 787).o Wings change shape based on an automatic sensor-

simulation-predictor-corrector artificial intelligence system that predicts air flow during turbulence. Data is collected at Boeing from in flight airplanes.• If you are by a window, you can watch the wings change

shape.• Two seconds to change shape:

– Number of shapes is a trade secret.• Also on the 747-8 wings and will be on all future Boeing jets.

o Engine intakes and exhaust are also SMAs, but this is older technology and also automatically controlled.

21

Page 22: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Mathematical Techniques

22

Page 23: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Sets

• Standard numeric setsoℝ real numberso ℂ complex numberso ℤ integersoℕ = ℤ& natural numbers (1,2,⋯ )

• Complicated setso 𝑆 = 𝑠., 𝑠/,⋯ . 𝑆 may be ordered or unordered.

It may have a finite or infinite number of 𝑠0.

23

Page 24: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Ordered Subsets with Ranges

• Ranges for sets with <,≤,=,>,≥ definedo [𝑎, 𝑏] = 𝑥|𝑎 ≤ 𝑥 ≤ 𝑏 , closed seto (𝑎, 𝑏) = 𝑥|𝑎 < 𝑥 < 𝑏 , open seto (𝑎, 𝑏] = 𝑥|𝑎 < 𝑥 ≤ 𝑏 , open/closed seto [𝑎, 𝑏) = 𝑥|𝑎 ≤ 𝑥 < 𝑏 , closed/open set

24

Page 25: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Scalars

• Individual members of a set, e.g., 𝜋 ∈ ℝ.• Arithmetic for 𝑥, 𝑦 ∈ 𝑆, where 𝑆 is a set of

numbers.o 𝑥 ± 𝑦, 𝑥𝑦, @

A. Note @

Bis undefined. @

A, 𝑥 − 𝑦?

• If 𝑇 = 𝑎𝑏𝑐, 𝑑𝑒𝑓𝑔ℎ, 𝑖𝑗𝑘𝑘 , then 𝑎𝑏𝑐 is a scalar value of 𝑇.

• If 𝐷 = 𝑑𝑜𝑐., 𝑑𝑜𝑐/,⋯ , a set of documents. Then any 𝑑𝑜𝑐0 ∈ 𝐷 can be considered a scalar.

25

Page 26: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Vectors

• A vector is a set of scalars and can be written in either row or column format.o 𝑣 = [1,−2.3, 𝑒] is a row 3-vector.

o𝑤 =1

−2.3𝑒

is a column 3-vector.

o𝑤 = 𝑣T, read a 𝑤 equals 𝑣 transpose.

26

Page 27: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Sums and Products

• Let 𝑣 = 𝑣., 𝑣/,⋯ , 𝑣U .

V0W.

U

𝑣0 = 𝑣. + ⋯+ 𝑣U

and∏0W.U 𝑣0 = 𝑣. Z ⋯ Z 𝑣U

27

Page 28: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Vector Arithmetic

• 𝑣 = 𝑣., 𝑣/,⋯ , 𝑣U and w= 𝑤., 𝑤/,⋯ ,𝑤U . Note that both vectors must be of the same length.o 𝑣 ± 𝑤 = [𝑣. ± 𝑤., 𝑣/ ± 𝑤/,⋯ , 𝑣U ± 𝑤U].o 𝛼𝑣 = [𝛼𝑣., 𝛼𝑣/,⋯ , 𝛼𝑣U].o Inner product 𝑣,𝑤 = 𝑣T𝑤 = ∑0W.U 𝑣0𝑤0.o No 𝑣𝑤 nor 𝑣/𝑤 definitions.

28

Page 29: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Matrices

• A matrix is a set of row and column vectors.

• 𝑀 = 1 2 311 12 13 , a 2×3 matrix.

• Let 𝑟. = [1 2 3] and 𝑟/ = [11 12 13]. Then

𝑀 =𝑟.𝑟/ .

• Let 𝑐. =111 , 𝑐/ =

212 , 𝑐a =

313 . Then

𝑀 = 𝑐. 𝑐/ 𝑐a .

29

Page 30: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Matrix Arithmetic

• Let 𝐴 = 𝑎0c , 𝐵 = 𝑏0c , 𝑖 = 1,⋯ ,𝑁, and 𝑗 =1,⋯ ,𝑀. o 𝐴T = 𝑎c0 .𝐴 issymmetricifandonlyif𝐴 = 𝐴T.o 𝐴 ± 𝐵 = 𝑎0c ± 𝑏0co 𝛼𝐴 = 𝛼𝑎0co No division 𝐴/𝐵

30

Page 31: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Matrix-Vector Multiplication

• Let 𝐴 be 𝑁×𝐿 and 𝑥 be 𝐿×1.• y= 𝑦0 is 𝐿×1 with

𝑦0 = VvW.

w

𝑎0v𝑥c = 𝑎0ZT𝑥,

where 𝑎0Z is row 𝑖 of 𝐴.

31

Page 32: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Matrix-Matrix Multiplication

• Let 𝐴 be 𝑁×𝐿 and 𝐵 be 𝐿×𝑀.• 𝐶 = 𝑐0c is 𝑁×𝑀 with

𝑐0c = VvW.

w

𝑎0v𝑏vc = 𝑎0ZT𝑏Zc

• There are faster methods (Strassen methods).

32

Page 33: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Functions

• Let 𝑓: 𝑆 → 𝑇, where 𝑓 represents a mapping from set 𝑆 (the domain of 𝑓) into the set 𝑇(the range of 𝑓).

• Suppose 𝑆 and 𝑇 are either ℝ or subsets of ℝ.o 𝑓(𝑥) is continuous if it is defined for all 𝑥 ∈ 𝑆 and

there are no “jumps” or “holes” in the range (this is not the formal definition of continuous).

o Continuous functions have global and local minima and maxima on all or a subset of 𝑆.

33

Page 34: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Minima and Maxima of Functions

34

[Burkov, 2019]

Page 35: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Differentiation of Functions

• Let 𝑓′ or |}|@

be the derivative of a continuous

function 𝑓. Second derivative is 𝑓′′ or |~}

|@~.

o 𝑓′(𝑥) > 0 means 𝑓 is increasing at 𝑥.o 𝑓′(𝑥) < 0 means 𝑓 is decreasing at 𝑥.o 𝑓� 𝑥 = 0 means 𝑓 is not changing value at 𝑥.

• 𝑓 𝑥 a minimum if 𝑓� 𝑥 = 0 and 𝑓′′(𝑥) > 0.• 𝑓 𝑥 a maximum if 𝑓� 𝑥 = 0 and 𝑓′′(𝑥) < 0.

35

Page 36: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Differentiation

• Functions 𝑢, 𝑣: 𝑆 → 𝑇o 𝑢 ± 𝑣 � = 𝑢� ± 𝑣�

o (uv)’=u’v+uv’

o��

�= �������

�~

o Chain rule: 𝑢 𝑣 � = 𝑢� 𝑣 𝑣′

36

Page 37: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Partial Differentiation

• Suppose 𝑓: ℝ/ → ℝ or f(x,y)=s and 𝑥, 𝑦, 𝑠 ∈ ℝ.

• Partial derivative is denoted �}�@

or �}�A

and means

the partial derivative with respect to the variable in the denominator.

• Example: 𝑓 𝑥, 𝑦 = 𝑎𝑥a + 𝑏𝑦/ − 𝑐𝑥𝑦.o�}�@= 3𝑎𝑥/ + 0 − 𝑐𝑦 since 𝑦 is not a function of 𝑥

o�}�A= 0 + 2𝑏𝑦 − 𝑐𝑥 since 𝑥 is not a function of 𝑦

37

Page 38: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Gradient Optimization

• The unconstrained optimization problem for a function 𝑓: ℝ� → ℝ is given by 𝑓 𝑥∗ ≤ 𝑓 𝑥for all 𝑥 ∈ ℝ�.

• The constrained optimization problem adds 𝑚constraint equations 𝑐0(𝑥) ≥ 0, 𝑖 = 1,⋯ ,𝑚.

• A model penalty function 𝑔(𝑥) can be added to either optimization problem subject to 𝜀 = 𝑓 𝑥 + 𝜆𝑔 𝑥 , 𝜆 > 0.

38

Page 39: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Gradient Optimization

• The term 𝜆 determines how much the penalty function is reduced versus a more expensive evaluation of the data objective function.

• Example: In seismic imaging: given an inverted velocity model 𝑥 and a sonic log velocity 𝑥����from a well log, then 𝑔 𝑥 =∥ 𝑥 − 𝑥���� ∥/. Large values of 𝜆 yield an inverted model that mostly agrees with the well log data even when there are disagreements.

39

Page 40: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Definitions

• The gradient of 𝑓(𝑥) is given by a column 𝑁-vector [∇𝑓(𝑥)]0=

�}�@�

.

• The 𝑁×𝑁 Hessian matrix 𝐻 = ∇∇T𝑓 𝑥 that is symmetric with 𝐻0c =

�~}�@��@�

.

• Euclidean norms: ∥ 𝑥 ∥/= ∑0W.� 𝑥0/ and

∥ 𝐿 ∥/= max@�B

∥w@∥~∥@∥~

.

40

Page 41: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Taylor Series

• Assume that 𝑓(𝑥) is at least twice differentiable in a neighborhood of points. Theno 𝑓 𝑥B + ∆𝑥 = 𝑓 𝑥B + 𝑔T∆𝑥 +./∆𝑥T∇∇T𝑓 𝑥B ∆𝑥 + 𝑜(∥ ∆𝑥 ∥a).

o Truncating after the third term gives us the quadratic model𝑓 𝑥B + ∆𝑥 ≈ 𝑓 𝑥B + 𝑔T∆𝑥 + .

/∆𝑥T𝐻∆𝑥.

41

Page 42: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Minima Existence

• To have either a local or a global minimum, we need two conditions to be met:o ∇𝑓 𝑥∗ = 0 ando ∆𝑥T∇∇T𝑓 𝑥∗ ∆𝑥 > 0.

42

Page 43: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Seismic Migration Example

• Approximate the Hessian by 𝐻 = 𝐿T𝐿 for a real valued migration matrix 𝐿T. The eigencomponents 𝜆0, 𝑥0 are 𝐿T𝐿𝑥0 = 𝜆0𝑥0. Then 𝑥0T𝐿T𝐿𝑥0 = 𝜆0𝑥0T𝑥0 ≥ 0 since ∥ 𝐿𝑥0 ∥≥ 0. Further, since 𝑥0T𝑥0 > 0 whenever 𝑥0 ≠ 0, we have 𝜆0 ≥ 0. In the space-time domain, the Hessian is at worst positive semidefinite.

43

Page 44: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

What to Compute

• Let ∆𝑥 = 𝑥 − 𝑥B.

• The 𝑘�� component in the gradient is �}(@)�@�

=�}(@�&∆@)

�@�≈ �} @�

�@�+ ∑cW.� �~} @�

�@��@�∆𝑥c.

• If 𝑥is evaluated at the minimum of 𝑓(𝑥), then �}(@)�@�

= 0.

44

Page 45: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

What to Compute

• Hence, we must solve the system of linear equations 𝑔 = −𝐻∆𝑥 and 𝑔v =

�}(@)ð@�

|@W@� ,

where ∆𝑥 is the unknown. So, ∆𝑥 = −𝐻�.𝑔.• A gradient optimization method 𝑥∗ = 𝑥B + ∆𝑥

minimizes the objective function 𝑓(𝑥).• When using a direct solver to calculate ∆𝑥

gives us what is known as a Newton method.

45

Page 46: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

What to Compute

• What to use instead of a direct solver:o Steepest descent or conjugate gradientso Any other appropriate iterative method.

• Define an iterative equation of the form ∆𝑥0

(v&.) = ∆𝑥0(v) − ∑c 𝛽0c

v 𝑔cv , where 𝛽0c is a

weight. For steepest descent, 𝛽0c(v) = �∆@�

�@�𝛼(v)

and 𝛼(v) is the step length at the 𝑘�� iterate.

46

Page 47: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Step Length Determination

• The exact step length is given by 𝛼∗ = �� ∆@∆@ ¡∆@

. In many seismic applications, this almost never works well for a variety of probemspecific reasons.

• Numerical line search methods estimate 𝛼(v)by evaluating the objective function at several points along a downhill direction.

47

Page 48: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Where Gradient Optimization IS Used

• Solving neural network problems in machine learning.

• Many machine learning algorithms have an optimization problem hidden inside.

• Systems like TensorFlow have gradient optimization methods built in as functions.

48

Page 49: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Computer Science Techniques

49

Page 50: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Hash Tables

• A hash table is a data structure with N buckets.o N is usually a prime number and may be quite

large.o Each bucket contains data.o Accessed using a hash function Key = h(x).• h(x) must be inexpensive to evaluate.• Key is an index 0, 1, …, N-1 into the hash table.• Data x can be found only in bucket h(x).

50

Page 51: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Storing a Hash Table

• If the data is very simple (numbers or short strings), then a spreadsheet may be optimal.

• If the data is arbitrary, then dynamically allocated memory techniques are common.o Common to use linked lists inside of each bucket.o Can be error prone.oMust remember to deallocate all of the hash table

when done, which can also be error prone.oMust decide if duplicates are allowed in a bucket.

51

Page 52: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Common Data Structure

52

012

N-2N-1

0

0

0

0

Buckets Data for each bucket

Variations:• doubly

linked lists• nested

tables• spreadsheet

Page 53: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Hash Table Functionalities

• Search• Add

o Uses Search• Delete

o Uses Search• Modify (optional)

o Uses Search • Change order of data in a bucket (optional)

o Uses Search and possibly Delete and Add53

Page 54: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Functionality

• Search(x)o Compute Key = h(x)o For each data stored in bucket Key, compare x to

the data.• If a match, then return something that allows the data

to be accessed.• If there is no match, return a Failure notice.

54

Page 55: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Functionality

• Add(x)o F = Search(x)o If F ≠ Failure, then• If no duplicates are allowed, return something that

allows the data to be accessed (and that it is already in the hash table).

o Otherwise,• Probably make a copy of x and add it to bucket h(x).

– Usually added as the first or last element in bucket h(x).– Usually have to modify the linked list for bucket h(x).

55

Page 56: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Functionality

• Delete(x)o F = Search(x)o If F ≠ Failure, then• Remove the data from bucket h(x). This usually means

deleting the copy of x and relinking inside the linked list. There may be other bookkeeping, too.• Return Success.

o Otherwise,• Return Failure.

56

Page 57: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Simple Examples

• Dataset D consists of combinations of a, b, c, …, x, y, z of exactly string length 3.

• We encode each letter by 00, 01, 02, ..., 23, 24, 25. So, abz is 000125 = 125.

• Consider two hash functions:o h1(x) = x mod 7o h2(x) = leading encoded letter in x

• We get two very different hash tables.

57

Page 58: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Example Dataset D

• D = { abc, def, acd,zaa, bbb, bzq,zxw, faq, cap,eld, ssa, bab }, or encoded

• D = { 102, 30405, 203,250000, 10101, 12516,252322, 50016, 20015,41103, 181800, 10001 }

58

Page 59: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

h1(x) for D

• The number of buckets is 7 (a prime).• This is not necessarily a well balanced hash

table since too many members of D go into bucket 0.

• We can store the hash table using linked lists.59

x h1(x) x h1(x) x h1(x)

102 4 30405 4 203 0

250000 2 10101 0 12516 0

252322 2 50016 0 20015 1

41103 6 181800 3 10001 5

Page 60: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Hash Table for h1(x)

60

0123456

Buckets Data for each bucket

203 10101 12516 10101 0

20015 0

250000 252322 0

181800 0

0102 30405

10001 0

41103 0

Page 61: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

h2(x) for D

• The number of buckets is 26 (not a prime).• This is a very different distribution of data

than for h1(x) and more balanced for our particular D.

• We can store it as a table or spreadsheet.61

x h2(x) x h2(x) x h2(x)

102 0 30405 3 203 0

250000 25 10101 1 12516 1

252322 25 50016 5 20015 2

41103 4 181800 18 10001 1

Page 62: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Hash Table for h2(x)

62

key value value value

0 102 203

1 10101 12516 10001

2 20015

3 30405

4 41103

5 50016

6

7

8

9

10

11

12

key value value value

13

14

15

16

17

18 181800

19

20

21

22

23

24

25 250000 252322

Page 63: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

A Complex Example

• Numeric data is frequently stored in spreadsheets.o GPS location (may be relative to a grid element)o Production figures for wells or fieldso Projectionso Features

• Many examples can be demonstrated using a similarity example involving words.

63

Page 64: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Sentence File Format

• One line per sentence with no punctuation• Each word is separated by one blank• All lower case• Multiple languages and gibberish• Watch for an extra blank at end of some lines• Even a 2.2 GB file is too large to work with

quickly using conventional comparison methods (e.g., Matlab or R)

64

Page 65: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Sample Sentence Files

• Small fileso 102, 103, 104, 105, 106 sentences

• Moderate fileo 5×10£ sentences

• Big fileo 25×10£ sentences

• Either an 𝑂(𝑁/) or 𝑂(𝑁𝑙𝑜𝑔𝑁) algorithm is too expensive. Need approximately 𝑂(𝑁).

65

Page 66: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Useful definitions

• Sentences are distance n if x can be transformed into y using only n neighboring word swaps, additions, or deletions

• A k-shingle is a (sub)string of length k• The Jaccard distance for sets x and y is

1 − 𝑥 ∩ 𝑦 / 𝑥 ∪ 𝑦• Minhashing is a stochastic method to find

common (sub)strings

66

Page 67: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Distance 1

• Examples:omy cat naps versus my cat likes napso he is a boy versus is a boy he versus is he a boyo Suppose you find a distance 1 sentence and

eliminate it. What happens to comparisons to the eliminated sentence later in the search?• he is a boy eliminates is he a boy followed by is dan a

boy• Need clarity about how to handle eliminations

67

Page 68: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Goals

• In a big file of sentences:o Eliminate similar sentences.o Find similar sentences of some distance or less.

• Either goal is expensive if the file has enough sentences.

• Both goals are of about the same hardness.• Methods in Chapter 3 of Ullman et al’s Data

Mining book useful.

68

Page 69: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Goal 1

• Eliminate all duplicate lines (distance 0)• Eliminate all sentences of distance n, n > 0

o Two sentences S1 and S2 are distance n if S1 can be transformed into S2 by adding, removing, or substituting at most n words.

oWhat happens if you eliminate sentence Sibecause of sentence Si-j, but you later find a sentence Sk that has distance 0 or 1 from Si?• Need to define how you handle this case.

69

Page 70: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Goal 2

• List all sentences that areo duplicateso distance n sentences

• List first one followed by all distance 0 or n sentences related to ito Can do as separate lists or just oneo Should be sorted

70

Page 71: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

The Datasets

• Description of problem and links to data:o http://www.mgnet.org/~douglas/Classes/commo

n-problems

• Major headaches:oMust read the data files quickly into memory or

some sort of user space that can be used with your solution.

o The best way to solve this problem is with a combination of methods you do not know yet.

71

Page 72: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Preprocessing

• Read all of the file and build a dictionary with each word given a natural number as an index:o Given sentence one here as the first one• 1 2 3 4 5 6 7 3

o Next sentence after sentence one• 8 2 9 2 3

o And so on here and after• 10 11 12 4 13 9

72

Page 73: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Implementation Suggestions

• Build and debug your code with small fileso Start with < 10 sentenceso Next try 100, 1000, and 10,000 sentenceso Then work up to 25,000,000 sentenceso Plot wall clock time versus data set size

• Consider using hash tables of considerable sizeo Hash table size should be a prime numbero Research hash functions for good candidates for the

sentence data• Consider using a MapReduce or Workflow system

73

Page 74: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

My Implementation

• I did some research and found a hash function specifically designed for big text files. There was a lot of research in this area in the late 1990’s in particular.o It disperses text widely in the buckets leaving very

few sentences in each bucket.o I also used a lot of buckets, the size depending on

how many sentences were in a file.

74

Page 75: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

My Implementation

• Searching through any one bucket is quick since there are at most a constant number of elements in each bucket for distance 0:

75

Elements/Bucket 100 1K 10K 100K 1M 5M 25M

1 35 312 2918 28138 264103 1127069 3670674

2 7 56 656 5724 49719 170149 312270

3 1 10 96 714 6154 17314 17911

4 1 9 80 592 1349 743

5 10 48 86 26

6 1 1 7

Page 76: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Sample Run Times Using Hashing

76

Page 77: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Sample Read Times Using Hashing

77

Page 78: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Generalizing

• Substitute n for 1 in distanceo Not much extra work to do soo Instead of looking at sentences of word length

difference 1, look at ones of difference no Makes a much more useful program

• Take arbitrary sentenceso Convert to one per line, each word separated by one

blanko Take lower and upper case into account and convert

to all lower case as preprocessing

78

Page 79: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Generalizing to Spreadsheets

• For energy data in a spreadsheet, many of the columns involve real numbers.

• Build a dictionary using number ranges to define words.o Comparisons between cells use fuzzy arithmetic.o Determine how close two values are to be similar.

• Apply the sentence techniques to your transformed data.

79

Page 80: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Fracking Data Example

• Open database maintained by the Pennsylvania State government based on the fractured oil and gas wells in the Marcellus Basin.

• There are about 8,000 wells that have been drilled and information is maintained about each in this database.

• Each state in the United States has at least one public database about fracking wells.

• 15.3 million Americans live within 1 mile (1.8 km) of a well drilled since 2000.

• Spreadsheets in the comma separated values format (.csv) or PDF common.

80

Page 81: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Fracking Data File Information

• Each file contains information for a period of time during 2000-2014o Locations of wellso Owner of propertyo Approximate latitude and longitude of each wello Drilling companyo Production information

§ Potential production§ Actual production (units: barrels for oil, 1000 cubic feet for

gas)§ Active/Inactive

o Much more information with some cells blank

81

Page 82: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Interesting Questions

• What are the production curves?o Are they uniform in regions or do they vary a lot?

• How long is there a good payout? (0, 12, 39-40, …, 120 months?)

• Are there some drillers whose wells are more likely to not be in production after some period of time?

• Where are clusters of wells?• How do you visualize the data?• How do you put the data into the right format in order

to ask the right questions and get answers quickly?

82

Page 83: background - MGNetdouglas/Classes/bigdata/lectures/2019su/backgro… · Schedule •Undergraduates oJune 30, 2:00-6:00, Background material oJuly 2, 2:00-6:00, Data finding oJuly

Data Files• Approximately 574 MB of files.• First things to do:

o Determine how to use the data (Excel, MongoDB, Hadoop, Matlab, R, etc.).

o Use the data to answer some simple, but interesting questions.

o Visualize the results (Excel, Matlab, R, Tableau, etc.).• Thereafter,

o Determine how to answer general, complex questions.o Use a general database approach that uses all of your

computer’s cores and GPUs.o Use the sentence approach to filter data.

83