Modeling Web Content Dynamics Brian Brewington ([email protected])[email protected] George Cybenko...
-
date post
22-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Modeling Web Content Dynamics Brian Brewington ([email protected])[email protected] George Cybenko...
Modeling Web Content Dynamics
Brian Brewington ([email protected])George Cybenko ([email protected])
IMA February 2001
Observing changing information sources An index of changing information sources
must re-index items periodically to keep the index from becoming out-of-date.
What does it mean for an observer or index to be “up-to-date” or “current”?
Our work on the web has two parts:– Estimation of change rates for a large
sample of web pages– Re-indexing speed requirements with
respect to a formal definition of “up-to-date”.
Your brain is good at this
Where is your visual attention directed when driving a car? Why?
Form state estimates;re-observe when uncertainty becomes too large
Ingredients
1. A formal definition of “up-to-dateness”
2. Data
3. Scheduling to optimize “up-to-dateness”
A meaning for “up to date”
An index entry is current if it is correct to within a grace period of time , with probability at least .
To be “-current”:
No alteration allowed in gray region for index entry to be “-current”
(time)
(grace period)
(nex
t obs
erve
d)
(las
t obs
erve
d)
tn
(now)t0 t
0+T t
n-
currency has meaning in many contexts
Any source has a spectrum of possibilities; here are some possible values (guesses)– Newspaper: (0.9, 1 day)– Television news: (0.95, 1 hour)– Broker watching stocks: (0.95, 30 min)– Air traffic controller: (0.95, 20 sec)– Web search engine: (0.6, 1 day)– An old web page’s links: (0.4, 70 day)
Collecting web page data Our web page data comes
from a web monitoring service.
The Informant runs periodic standing user queries against four search engines and monitors user-selected URLs. When new or updated results appear, users are notified via email.
We download ~100,000 pages per day for ~30,000 users.
See http://informant.dartmouth.edu
Sampling issues
Biased towards search engine results in the top 10 for users’ queries
No more than one observation of a page per day, pages are usually observed once every three days.
Queries and page checks are run only at night, so sample times are correlated.
Filesystem timestamps are available for about 65% of our observations.
Data in our collection As of March 2000, we had observations of about
3 million web pages. Data in paper spans 7 mo. Each page is observed an average of 12 times,
and the average time span of observation is 38 days.
Each observation includes:– “Last-Modified” timestamps, when available– Observation time (using remote server’s if possible)– Document summary information
» Number of bytes (“Content-Length”)» Number of images, tables, forms, lists, banner ads» 16-bit hash of text, hyperlinks, and image references
“Lifetimes” vs. “ages” We can model objects as having
independent, identically-distributed time periods between modifications. We call these “lifetimes.”
The “age” is the time since the present lifetime began.
By analogy, thinkBy analogy, thinkof replacement parts,of replacement parts,each with an each with an independentindependentlifetime length.lifetime length.
L1 L2
(Each “(Each “” is a ” is a change)change)
0 0.5 1 1.5 2 2.5 3 3.5 4
Life
time=
1.53
Life
time=
1.14
Life
time=
0.62
Life
time=
0.84
Time
Age
1...
Determining dynamics from the time dataTwo ways to find the distribution of change rates:
1. Observe the time between successive modifications. (Lifetimes)
GoodGood: direct measurement of time between changesBadBad: aliasing possible; needs repeat observations
2. Observe the time since the most recent modification. (Ages)
GoodGood: doesn’t have aliasing problems, works without having to make repeat observationsBadBad: requires that we accurately account for growth
Sampling the lifetime distribution
There are two problems with trying to sample the difference of successive change times:
timex xo oxx x
1. 1. Second observation (o) will miss two changes (x)
x=modificationo=observation
timex x xo o o o o
2. 2. Observation window not big enough to see any changes (x)
o
(Observation timespan)
(Actual lifetime)
(Observed lifetime)
Web page age CDFC
um
ula
tive P
r
Age [days, log scale]
1 d
ay
10 d
ays
100
days
• Median age 120 days• upper 25% > 1 year• lowest 25% < 1 month
0
1
0.5
0.1
0.2
0.3
0.4
0.6
0.7
0.8
0.9
Empirical lifetime distribution
0 200 400 600
10-4
10-3
10-2
Lifetime [days]
Pro
babi
lity
den
sity
100 102
0.2
0.4
0.6
0.8
1
Lifetime [days]
Cum
ulat
ive
prob
abil
ity
Lifetime PDF Lifetime CDF
When do changes happen?Change times, mod 247 hours, show more changes happen
during the span of US working hours (8AM to 8PM, EST)
0 50 100 1500
1
2
3
4x 10
-3
time since Thursday 12:00 GMT [hours]
Rel
ativ
e fr
equ
ency
Wed
s af
tern
oon
Thu
rsda
y
Fri
day Sa
turd
ay
Sund
ay
Mon
day
Tue
sday
Wed
s m
orni
ng
Distribution of mean change times The Weibull distribution, a
generalized exponential, models mean lifetimes fairly well:
This can be used to find an age or lifetime CDF for any shape parameter and scale parameter . But for the age CDF, a growth model is needed, so age-based estimates can be inaccurate.
1
/1 tmean mean
tf f t e
100 101 102 1030
0.2
0.4
0.6
0.8
1Lifetime CDF: F (=1.4, =152.2)
Lifetime [days]
Cu
mu
lati
ve p
roba
bili
ty
Trial Reference
1
/tte
()currency for Poisson sourceA single source has Poisson changes at rate . If re-indexed every T time units, the expected probability of the index entry being -current is:
1
1
1
T
z
e
T T
e
z
,
/
z T
T
10-2 100 10 2
0.2
0.4
0.6
0.8
Expected changes per check period, T
Pro
babi
lity
,
=0.9
=0.25
=0.6
=0.0
1/T
Probability of currency over a collectionExpected probability of a random index
entry being -current (given distribution f(t) of mean change times t):
/
0
1
/
t T t
t
ef t dt
T T t
1
/( ) ttf t e
Distribution ofavg. lifetimes
Probability of being -current given avg. lifetime
Index performance surface: as a function of T, /T
Surface formed by integrating out the rate dependence
Large period T implies =
Plane shown for =0.95%, intersects at a level set (,T)
101 10210-1
100
101
102
Re-indexing period, T [days]
Gra
ce p
erio
d,
[da
ys] Age-based
Lifetime-based
T =50 days
=1 week
=1 month
=1 year
T =23 days
T =59 days
T =8.5 days
T =18 days
=1 day T =11.5 days
95% level set: (T,) pairs
Bandwidth needed for (0.95, 1-week) currency
For (0.95, 1 week) currency of this collection:– Must re-index with period around 18 days.– A (0.95, 1-week) index of the whole web (~800
million pages) processes about 50 megabits/sec.– A more “modest” (0.95, 1-week) index of 150
million pages will process 9 megabits/sec.
For fixed-period checks, we can estimate processing speed requirements.
Empirical search engine currency
10 0 101
102
1030.4
0.5
0.6
0.7
0.8
0.9
1
[days]
Google Infoseek AltaVista Northern Light
A calculus for currency
If x is current andy is current, then
(x,y) ismaxcurrent.
Extend this to other atomic operationson information, eg composition.
Summary About one in five pages has been
modified within the last 12 days. (0.95, 1-week) on our collection: must
observe every 18 days Ideas: More specialty search engines?
Distributed monitoring/remote update? Other work: algorithms for scheduling
observation based on source change rate and importance
Mathematics of “Semantic Hacking”
Problem
Denial of Service Attacks Infrastructure
System attacks Systems
Semantic attacks Information
easy todetect
hard todetect
Distribution of information
“Gaussian”is expected.
Outliers
Collusion?
What makes a good mystery/thriller?
“Correct”conclusion
“Wrong”conclusion
A wrong conclusion can be reached by onelarge, detectable bad decision or a sequenceof small, undetectably perturbed decisions.
Understand the whole sequence of decisions not justone in isolation.
Ongoing research
Develop a model of such “semantic attacks”.
Develop a way to quantify such things.
Develop some tools for detecting/managingcomplex decision sequences.
Make information/decision systems morerobust.
Acknowledgements
DARPA contractF30602-98-2-
0107
DoD MURI (AFOSR contract F49620-97-1-
03821)
NSF KDI Grant 9873138