Introduction to Research
description
Transcript of Introduction to Research
![Page 1: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/1.jpg)
Introduction to Research
Data Management and Databasehttp://www.cs.fsu.edu/~lifeifei
![Page 2: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/2.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 3: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/3.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 4: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/4.jpg)
A Short History Class
• Undergraduate study in Tsinghua University (1997) (China) + Nanyang Technological University (Singapore) (1998-2002)– B. Applied Science
• PhD study in Boston University (2002-2007)– Computer Science, Research Area: Database
• Now…
![Page 5: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/5.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 6: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/6.jpg)
Research Focus
• Data Management in General and (roughly in the order of the sequence I worked on) :
– Efficient indexing, querying and managing large scale databases, or high dimensional databases
– Spatio-temporal databases and applications– Sensor and Stream databases– Privacy preservation issues for data management– Query security for various types of data models– Uncertain databases and data cleaning
![Page 7: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/7.jpg)
Experience
• SDE intern at M$ SQL server group, summer 2005 (Redmond, WA)
• Research intern at IBM T. J. Watson Research Center (Hawthorne, NY), database research group, summer 2006
• Visiting research student at AT&T Labs Research (Florham Park, NJ), database research group, winter 2006 and spring 2007
• Research intern at MSR, database research group, summer 2007
![Page 8: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/8.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 9: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/9.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on
– Retrieving structured data from Web– Spatio-temporal databases– Sensor databases
• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 10: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/10.jpg)
The First Step• My FYP (Final Year Project), around 2000-2001
– Analyze and build structures of different websites• How to automate this process??• View a website as a tree structure and?• Given a group of similar websites, summarizing a suitable
schema…– Retrieve information from certain part(s) of a website
as specified by the user• With the structure information obtained at the first step• Why bother? Information integration, BBC in favors of Bush
and CNN ‘hates’ him, then what’s the response to event A?• Another issue: semi-structured data (HTML) to structured
data (XML)
![Page 11: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/11.jpg)
So What Happened
![Page 12: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/12.jpg)
Possible Research Problems• Automatic Schema Identification
– Given a collection of data sources, find a common schema that maximally describes the dataset.
• Information retrieval & search from Web – IR techniques (IR is a separate field by itself, unstructured
data) + database techniques (structured data), how to combine the two?
– Google • Information Integration
– Given data source A and data source B, both refers to the same schema, but with (slightly) different instances, how to link/combine the two?
![Page 13: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/13.jpg)
Then Boston
• Quite a pleasant transition in the summer: Singapore (90+ year round) to Boston (80 in the summer)
• Winter:
– Anyway…
![Page 14: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/14.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on
– Retrieving structured data from Web– Spatio-temporal databases– Sensor databases
• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 15: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/15.jpg)
What to Do Now?• Spatio-temporal databases and applications
– Why? My advisor was in this area and…• Examples:
– Indexing higher dimensional data:• 1d- B+ tree, 2d, 3d, 4d, …? kd-tree, R-tree• Space partitioning vs. Data partitioning
– Queries• eg: continuous nearest neighbor query– continuously find
the closest gas station when I am driving from Boston to NY.– Moving object
• On Euclidian space• On a road network
![Page 16: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/16.jpg)
Indexing High Dimensional Data: R-tree
• eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page
A
B
C
DE
FG
H
J
I
![Page 17: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/17.jpg)
Example
• F=4
A
B
C
DE
FG
H
I
J
P1
P2
P3
P4
F GD E
H I JA B C
![Page 18: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/18.jpg)
Example
• F=4
A
B
C
DE
FG
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
![Page 19: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/19.jpg)
R-trees:Search
A
B
C
DE
FG
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
![Page 20: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/20.jpg)
R-trees:Search
A
B
C
DE
FG
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
![Page 21: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/21.jpg)
Query in Spatio-Temporal Databases
• Trip Planning Queries (TPQ):– Given a starting location, a destination and
arbitrary points of interest try to find the best possible trip.
• Example:– Minimize the total traveling time from Boston to
Providence, while visiting a post office, a hardware store and a gas station.
![Page 22: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/22.jpg)
Visual Example
Home Work
Gas station
• We can minimize the total distance, time, etc.• We can have different categories of points of interest
(gas stations, hotels, etc.).
![Page 23: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/23.jpg)
The Nearest Neighbor Algorithm
S D
A1
A2
B3 B1
B2
C2
C1
Yields a 2m+1 - 1 approximation where m is the total number of categories.
![Page 24: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/24.jpg)
The Minimum Distance Algorithm
S D
A1
A2
B3 B1
B2
C2
C1
Yields an m-approximation where m is the total number of categories.
![Page 25: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/25.jpg)
Search over R-Tree and Road network
• R-Tree:– Euclidian space, how to utilize R-tree to speed up
the search?
• Road network:M
AS
Dp
![Page 26: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/26.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on
– Retrieving structured data from Web– Spatio-temporal databases– Sensor databases
• Current Interest and Activity• My Experience as a PhD student• Q&A
![Page 27: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/27.jpg)
Sensor Network Model
• Large set of sensors distributed in a sensor field.• Communication via a wireless ad-hoc network.• Node and links are failure-prone. • Sensors are resource-constrained
– Limited memory, battery-powered, messaging is costly.
![Page 28: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/28.jpg)
Sensor Databases
• Useful abstraction:– Treat sensor field as a distributed database
• But: data is gathered, not stored nor saved.
– Express query in SQL-like language• COUNT, SUM, AVG, MIN, GROUP-BY
– Query processor distributes query and aggregates responses
– Exemplified by systems like TAG (Berkeley/MIT) and Cougar (Cornell)
![Page 29: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/29.jpg)
A
B
C
D
E
G
HIJ
H
2
6
3
7
6
9
1
10
A Motivating Example
• Each sensor has a single sensed value.
• Sink initiates one-shot queries such as: What is the…– maximum value?– mean value?
• Continuous queries are a natural extension.
F124
![Page 30: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/30.jpg)
AVG Aggregation (no losses)
• Build spanning tree• Aggregate in-network
– Each node sends one summary packet
– Summary has SUM and COUNT of sub-tree
• Reliability problem when there are losses (common for sensor network)
A
B
C
D
E
G
HIJ
H
2
6
3
7
6
9
1
10
4F12
2,1
6,1
9,215,3
6,1 10,1
26,4
12,1
10,29,1
AVG=70/10=7
![Page 31: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/31.jpg)
AVG Aggregation (naive)
• What if redundant copies of data are sent?
• AVG is duplicate-sensitive– Duplicating data changes
aggregate– Increases weight of
duplicated data
A
B
C
D
E
G
HIJ
H
2
6
3
7
6
9
1
10
4F12
2,1
6,1
9,215,3
6,1 22,2
26,4
12,1
10,29,1
AVG=82/11≠7
12,1
![Page 32: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/32.jpg)
AVG Aggregation (TAG++)
• Can compensate for increased weight– Send halved SUM and
COUNT instead
• Does not change expectation!
• Only reduces variance
A
B
C
D
E
G
HIJ
H
2
6
3
7
6
9
1
10
4F12
2,1
6,1
9,215,3
6,1 16,0.5
20,3.5
6,0.5
10,29,1
AVG=70/10=7
6,0.5
![Page 33: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/33.jpg)
AVG Aggregation (LIST)
• Can handle duplicates exactly with a list of <id, value> pairs
• Transmitting this list is expensive!
• Lower bound: linear space is necessary if we demand exact results.
A
B
C
D
E
G
HIJ
H
2
6
3
7
6
9
1
10
4F12
A,2
B,6
B,6;D,3
A,2;G,7;H,6H,6 F,1;I,
10
C,9;E,1;F,1;H,
4 F,1
C,9;E,1
C,9
AVG=70/10=7
F,1
![Page 34: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/34.jpg)
COUNT Sketches
• Problem: Estimate the number of distinct item IDs in a data set with only one pass.
• Constraints: – Small space relative to stream size.– Small per item processing overhead.– Union operator on sketch results.
• Exact COUNT is impossible without linear space.• First approximate COUNT sketch in [FM’85].
– O(log N) space, O(1) processing time per item.
![Page 35: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/35.jpg)
Counting Paintballs
• Imagine the following scenario:– A bag of n paintballs is
emptied at the top of a long stair-case.
– At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way.
Looking only at the pattern of marked steps, what was n?
![Page 36: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/36.jpg)
Counting Paintballs (cont)
• What does the distribution of paintball bursts look like?– The number of bursts at
each step follows a binomial distribution.
– The expected number of bursts drops geometrically.
– Few bursts after log2 n steps
1st
2nd
S th
B(n,1/2)
B(n,1/2 S)
B(n,1/4)
B(n,1/2 S)
![Page 37: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/37.jpg)
Counting Paintballs (cont)
• Many different estimator ideas [FM'85,AMS'96,GGR'03,DF'03,...]
• Example: Let pos denote the position of the highest unmarked stair,
E(pos) ≈ log2(0.775351 n)
2(pos) ≈ 1.12127
• Standard variance reduction methods apply• Either O(log n) or O(log log n) space
![Page 38: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/38.jpg)
Application to Sensornets Each sensor computes k independent sketches of itself using
its unique sensor ID.– Coming next: sensor computes sketches of its value.
Use a robust routing algorithm to route sketches up to the sink.
Aggregate the k sketches via in-network XOR.– Union via XOR is duplicate-insensitive.
The sink then estimates the count.
Similar to gossip and epidemic protocols. How about SUM and other aggregates??
![Page 39: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/39.jpg)
COUNT vs Link Loss (grid)
![Page 40: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/40.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity
– Privacy Preservation– Query Security
• My Experience as a PhD student• Q&A
![Page 41: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/41.jpg)
Privacy Preservation
Feifei Piyush PhD A Dave Univ. President Master
$300 $400 $200 $1000 $5000 $100
It is not legal to query about individual person’s salary. However, we areInteresting (and often time legal) at retrieving the avg. what do you do?
Perturb the data… How?
$200 -$100 $600 -$500 -$500 $300
Add random noise…in a particular way
Feifei Piyush PhD A Dave Univ. President Master
$500 $300 $800 $500 $4500 $400
Sum=$7,000
Sum=$0
Sum=$7,000
Basic Intuition:Add Identical Independent Distributed Random (IID)
Noise with Zero Mean
![Page 42: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/42.jpg)
How about Multiple Attributes (multi-dimensional data)?
• Is IID noise really preserving the privacy??
![Page 43: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/43.jpg)
43
Principal Component Analysis: PCA
i.i.d Noise
![Page 44: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/44.jpg)
44
Principal Component Analysis: PCA
Correlated Noise
![Page 45: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/45.jpg)
45
PCA Based Data Reconstruction
A
A~
Removed Noise
Principal Direction
Remaining Noise
Privacy
A*
σ2
Added Noise: Utility
Projection Error
A*: Perturbed Data
A: Original Data
A~: Reconstructed Data
![Page 46: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/46.jpg)
46
PCA Based Data Reconstruction
A
A~ Principal Direction
Remaining Noise
Privacy
A*σ2
Added Noise: Utility
Projection Error
A*: Perturbed Data
A: Original Data
A~: Reconstructed DataCorrelated Noise!
![Page 47: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/47.jpg)
47
Data Perturbation: main idea
• Observations– The amount of the random noise controls
privacy/utility tradeoff– i.i.d (identical independently distributed) noise
does not preserve the privacy! Not well enough
Lesson learned
– Noise should be correlated with original data
![Page 48: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/48.jpg)
How about Streaming Data?
• Streaming data: Data continuously arrives , no global data is available, hence cannot get the global trends.
i.i.d Noise Correlated Noise
OnlineCorrelated
Noise
![Page 49: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/49.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity
– Privacy Preservation– Query Security
• My Experience as a PhD student• Q&A
![Page 50: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/50.jpg)
Example of Data Publishing
50
www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hadjieleftheriou:Marios.html
www.sigmod.org/dblp/db/indices/a-tree/h/Hadjieleftheriou:Marios.html
![Page 51: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/51.jpg)
Outsourced Database for Better Query Services
51
Servers that are close to local clients and maintained by local business partners
Company with headquarters in US
![Page 52: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/52.jpg)
Publishing Data and Outsourcing Query Service
52
NetworkNetwork
Gigascope:analysis tool by
IP Traffic Streamcoming from
0 1 1 0 0 1 … 1 1 0 …
statistics
Results
![Page 53: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/53.jpg)
Data Publishing Model [HIM02]
SD
53
Owner: publish dataServers: host (or monitor) the data and provide query servicesClients: query the owner’s data through servers
ownerserversclients /
H. Hacigumus, B. R. Iyer, and S. Mehrotra, ICDE02
(possibly = owner)
![Page 54: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/54.jpg)
Information Security Issues
54
• The third-party (server) cannot be trusted
– Malicious intent
– Compromised equipment
– Unintentional errors (e.g. bugs)
![Page 55: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/55.jpg)
Problem 1: Injection
55
SD
Select * from T where 5<A<11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
Returns 7, 8, 9
owner
server
client
![Page 56: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/56.jpg)
Problem 2: Drop
56
SD
Select * from T where 5<A<11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
Returns 7
owner
server
client
9ri+1
![Page 57: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/57.jpg)
Query Authentication: Goals
57
• Query Correctnessresults do exist in the owner's database
• Query Completenessno records have been omitted from the result
• Query Freshnesslatest available answer (in case of updates)
![Page 58: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/58.jpg)
General Approach
SD
58
ownerserversclients
A B
r1 …
… …
ri-1 4
ri 7
Authenticated Structures
Query results
Verification Object (VO)
![Page 59: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/59.jpg)
Merkle Hash Tree[M89]
59
m1 m2 m3 m4 m5 m6 m7 m8
h1 h2 h3 h4 h5 h6 h7 h8
h12 h34 h56 h78
h1..4 h5..8
h1..8
Sign(h1..8,SK)
h12=H(h1|h2)
R. C. Merkle. CRYPTO, 1989
m6
h78
h5 h6
m5
h56
h5..8h1..4
h1..8
Ver(h1..8, ,pK)=valid?
Collision resistant hash function any change in the tree will lead to a different hash value for the rootDigital signature of the root no one except the ownercould produce the signatureHash function is publicly knownSingle signature to sign many messages
![Page 60: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/60.jpg)
Merkle B(MB) Tree: Natural Extension for Range Query
60
• Use a B+-tree instead of a binary search tree:
… Ki…
250 320 410 600 720
410 720
t1 t2 t3 t4 t5
…
Extend it with hash information:
hi=H(ti) Kj hj=H(tj) leaf node
![Page 61: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/61.jpg)
Merkle B(MB) Tree: Natural Extension for Range Query
61
h0 p1 k1p0 h1 … pf kf hf
h10 p11 k11p10 h11h1=H(h10|…|h1f)
For root node, =Sign(h0|…|hf)
h1
h10 h11
![Page 62: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/62.jpg)
How about Streaming Data?Outsourced stream model: stock trading
monitoring
62
Provider: A stock broker
Servers(bloomberg)
Q
Register Queries: Sliding window query and/or
One shot query
Clients
![Page 63: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/63.jpg)
And Other Model?Revisiting the CISCO – AT&T Example
63
NetworkNetworkGigascopeIP Traffic Stream
0 1 1 0 0 1 … 1 1 0 …
statistics
lawyers: sign the trust agreementCould we help? (computer scientists)
CISCO owns the Network Traffic Data: He is both the data owner and the client!
![Page 64: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/64.jpg)
Outline
• Background• My Research Focus and Experience• Some Problems I have worked on• Current Interest and Activity
– Uncertainty Database and Data Cleaning , Talk to me if you’d like to learn more..
• My Experience as a PhD student• Q&A
![Page 65: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/65.jpg)
My Lessons
• Have a strong motivation– You are doing a PhD for yourself, not for FSU, not
for your advisor
• Find a topic that attracts you the most– PhD could be frustrating and boring + You are
almost broke as a student.. So why not do sth that you have the greatest interest at?
![Page 66: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/66.jpg)
My Lessons
• Meet your advisor as often as possible– He is almost the only one that really cares about
your PhD and knows what you are doing…
• Make connections– Whenever possible, conferences, industry labs etc
and work on your communication skills (including writing the papers)
![Page 67: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/67.jpg)
My Lessons
• Read Papers– As much as you can! It’s never enough
• Finally– Work hard , enjoy your life and good luck!!
![Page 68: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/68.jpg)
Questions?
• Thank you!
![Page 69: Introduction to Research](https://reader036.fdocuments.us/reader036/viewer/2022062315/56814fa8550346895dbd67a6/html5/thumbnails/69.jpg)
Back to COUNT Sketches
• The COUNT sketches of [FM'85] are equivalent to the paintball process.– Start with a bit-vector of all
zeros.– For each item,
• Use its ID and a hash function for coin flips.
• Pick a bit-vector entry.• Set that bit to one.
• These sketches are duplicate-insensitive:
1 0 0 0 0{x}
0 0 1 0 0{y}
1 0 1 0 0{x,y}
A,B (Sketch(A) XOR Sketch(B)) = Sketch(A B)