How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and...
Transcript of How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and...
How do I develop and found a search engine?
Files and file search in the Internet
from the perspective of FindFiles.net
Claudius Gros
Institute for Theoretical PhysicsGoethe University Frankfurt, Germany
http://www.findfiles.net
1
overview
data in the Internet
– Mime types and statistics
– the file search engine FindFiles.net
science with data files
– neuropsychological constraints to human data production
2
Internet – Statistics
3
Internet hosts
Hosts – Domains – Sites
[source: Netcraft.com]
• 2011 ∼ 100 active Mio domains
4
Internet users – email
users in 2010
• 2 ·109 worldwide
• 825 ·106 Asia
• 475 ·106 Europe
• 266 ·106 North American
emails in 2010
• 107 ·1012 – number of emails sent
• 89% – share of spam emails (success rate: 1:12 Mio ?)
• 1.9 ·109 – number of email users[source: Pingdom.com]
5
Internet – social media
2010
• 152 ·106 – number of blogs
• 25 ·109 – number of tweets on Twitter
• 600 ·106 – Facebook acounts
⊲ 30 ·109 – pieces of content: links, notes, images, ...
⊲ 20 ·106 – number of activated apps (per day)[source: Pingdom.com]
6
social media – images and videos
streaming videos – tubes
• 2 ·109 – watched per day on Youtube (one per Internet user)
• 35 – hours of video uploaded (every minute)
images, pictures
• 5 ·109 – photos hosted by Flickr
• 3000 – photos uploaded (per minute)
[source: Pingdom.com]
7
social media – blogs & bookmarking
blogs are everywhere
• 2010: 50−100 ·106 blogs
social is everthing
• Digg, Mister Wong, Delicious, ...
• social shopping, ...
http://www.delicious.com
8
Internet – rules of thumb
2010
• 1 movie per day per user (Youtube)
• 1 search query per day per every 2 users (Google)
• every Internet user uses email
• 5 ‘true’ emails per day per Internet user
• 25% of Internet users use novel social media
• most domains are blogs
9
Internet Startups
10
slow beginnings for startups in the Internet
• linear scale – exponential or linear growth?
⊲ Apr 2011 – 155 daily tweets
11
growth is generically not exponential
2000 2002 2004 2006 2008 2010year
0
5
10
15
20
25
30
reve
nues
per
yea
r (b
illio
ns, U
S-$) Google yearly revenues
• linear scale
12
Internet: the winner takes all
flow of attention in complex networks
www.big.com
www.medium.com
www.small .com
www.small .orgwww.smal l .ne t
www.smal l .de
• in-degree distribution pk
⊲ heavy tails
• preferential attachment
13
in-degree distribution
power law – scale invariant
100
101
102
103
104
105
106
number of incomming links
-4
-2
0
2
4
6
8lo
g(nu
mbe
r of
hos
ts)
number of incomming linkslinear fit, slope -2.2: 7.51-2.2*x
[source: Findfiles.net]
• scaling constant for 20 years – starting at one!14
limiting diverging in-degree distribution
pk ∝1kα , 〈k〉 ∝
∫kkα dk ∼
k2−α
2−α
∣
∣
∣
∣
∞
Kc
• diverging mean in-degree limα→2
〈k〉 → ∞
⊲ Internet: α ≈ 1.9−2.2
⊲ limiting dominating tail
» limiting winners take all «
• makes life difficult for small startups
15
the big two uphill fights
a new Internet startup needs to ...
• fight for attention
• fight for novelty
traffic and quality
heavy tail in-degree distribution makes it difficult to attract traffice
extremly high service standards act as effective entry barriers
16
FindFiles.net – a new file search engine
17
public data on the Internet
280 Million domains in 2011
⊲ 10-30 data files per domain
Internet Media type – Mime type
• categorization of all file types
⊲ email attachments
⊲ browser add-ons
⊲ about ∼ 600 Mime types in use
18
Mime types
major Mime categories
• together: 99%
33.2% application/2.9% audio/
58.0% image/5.1% text/0.7% video/
Mime types – examples
application/pdf audio/mpegapplication/msword audio/midiapplication/vnd.android.package-archive chemical/x-pdbapplication/vnd.ms-powerpoint image/jpegapplication/jar image/vnd.djvuapplication/x-deb text/xmlapplication/x-gzip model/vrml
19
FindFiles.net
search engine for data filesG. Kaczor & C. Gros 2011
• supports all Mine typeshttp://www.findfiles.net
20
FindFiles.net – some stats
daily queries
[source: FindFiles.net]
• 400 Mio data files 20 Mio host crawled
⊲ 10 Million mp3 files⊲ 10 000 apps for Symbian/Android smartphones⊲ ...
21
blogs, legal issues & financing
blog & press coverage
http://www.findfiles.net/publicrelations
copyright & non-legal files
• files protected by copyright/licence are not indexed (nofollow)
• links to pirate files removed from index
financing
• network – Unibator
• banks are cautious – most startups fail
22
Science with Data Files
23
the Wikipedia/DMOZ corpus
all outgoing links of
• Wikipedia (all languages)
• DMOZ – open directory project (all languages)
⊲ 7.7 Mio hosts (domains)
⊲ 252 Mio data files (FindFiles.net crawler)
analysis of file size distribution
• tails & scaling behaviour
24
number of files per domain
files per host vs. in-degree
• most files hosted on small domains
25
file size distribution
number of files of given size
1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 G100 B10 Bfile size [Bytes]
-4
-2
0
2
4
6
8lo
g(nu
mbe
r of
file
s)
all Mime categoriesMime category application/Mime category audio/Mime category image/Mime category text/Mime category video/
• 252 Mio files in total – 9 orders of magnitude
26
power-law scaling of image-size distribution
• compression gif: lossless; jpeg: lossy
1 K 10 K 100 K 1 M 10 M 100 M 1 Gfile size [Bytes]
-4
-2
0
2
4
6lo
g(nu
mbe
r of
file
s)
all Mime categoriesMime type image/jpeglinear fit, slope -2linear fit, slope -4Mime type image/giflinear fit, slope -2.45
• kink at 4 Mbytes: amateur – professional
27
lognormal multimedia size distribution
all audio and video Mime types
1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 Gfile size [Bytes]
-4
-2
0
2
4lo
g(nu
mbe
r of
file
s)
all Mime categoriesMime category video/quadratic fit (lognormal distribution)Mime category audio/quadratic fit (lognormal distribution)
• quadratic fit – lognormal distribution
28
lognormal distribution vs. powerlaw scaling
files-size distribution p(s)
e[log(s)−µ]2/σ2s−α
not a Taylor-series correction
log(p(s)) ∝ α log(s) − β log2(s)
⊲ images: α < 0, β = 0.
⊲ audio/video: α > 0, β > 0
1 K 10 K 100 K 1 M 10 M 100 M 1 Gfile size [Bytes]
-4
-2
0
2
4
6
log(
num
ber
of f
iles)
all Mime categoriesMime type image/jpeglinear fit, slope -2linear fit, slope -4Mime type image/giflinear fit, slope -2.45
1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 Gfile size [Bytes]
-4
-2
0
2
4
log(
num
ber
of f
iles)
all Mime categoriesMime category video/quadratic fit (lognormal distribution)Mime category audio/quadratic fit (lognormal distribution)
29
one vs. two-dimensional cost functions
economical cost functions for data production
• size
⊲ storage costs
⊲ production costs
psychophysical cost functions for data production
• size (images)
⊲ time needed to take an image is independent of resolution
• size and time (audio & video)
⊲ time and resolution are psychophysical distinct variables
30
Weber-Fechner law
• neuopsychological cost functions are logarithmic in
⊲ sensory stimulus intensity
⊲ number of objects
⊲ time perception
music: tone pitch ∝ log(frequency) (octave)photometry: brightness ∝ log(intensity) (lumen)acoustics: sound level ∝ log(intensity) [decibel]
⊲ information production: number of objects / time
31
information entropy
Shannon information entropy
−
∫p(s) log(p(s))ds
∫p(s)ds= 1
for a distribution function p(s)
• a measure for the information content
Shannon coding theorem
Mimimal amount of bytes needed to encode a transmission isgiven by the information entropy of the signal statistics
32
neuropsychological cost functions
conditional entropy maximization
δ[
−∫
p(s) log(p(s))ds− λ∫
p(s)c(s)ds
]
= 0
Shannon information entropy: −∫
p(s) log(p(s))dscost function: c(s)
file size distribution: p(s)
maximal file size distributions
p(s) ∝ e−λc(s) ∼
exponential c(s) ∝ s physicalpower law c(s) ∝ log(s) 1-dim neurolognormal c(s) ∝ log2(s) 2-dim neuro
33
physical vs. neuropsychological cost functions
1 K 10 K 100 K 1 M 10 M 100 M 1 Gfile size [Bytes]
-4
-2
0
2
4
6
log(
num
ber
of f
iles)
all Mime categoriesMime type image/jpeglinear fit, slope -2linear fit, slope -4Mime type image/giflinear fit, slope -2.45
1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 Gfile size [Bytes]
-4
-2
0
2
4
log(
num
ber
of f
iles)
all Mime categoriesMime category video/quadratic fit (lognormal distribution)Mime category audio/quadratic fit (lognormal distribution)
images
physical exponential [not seen]1-dim neuro power law [linear]
audio/video
physical exponential [not seen]2-dim neuro lognormal [quadradic]
34
global human data production
basic assumptions
• information production as underlying driving force
⊲ information entropy as a suitable measure
• law of large numbers
⊲ average over production processes / producting agents
⊲ compression/technology correspond to rescaling
data production on a global level characterized byneuropsychological cost functions and not be eco-nomic constraints
35
the Internet & complex system theory
complex system theory – still an emergent field
⊲ many models and paradigms yet to be formulated
⊲ network theory / game theory / allocation problems
⊲ macroecology / systems biology / cognitive systems theory
⊲ ...
• information entropy maximization
⊲ human data production on a global level
⊲ neuropsychological cost functions
⊲ ...
36
graduate level textbook
• Information theory and complexity
• Phase transitions andself-organized criticality
• Life at the edge of chaos andpunctuated equilibrium
• Cognitive system theoryand diffusive emotional control
second edition 2010
37