how do i develop and found a search engine? files …gros/talks/webscience11/...how do i develop and...

37
How do I develop and found a search engine? Files and file search in the Internet from the perspective of FindFiles.net Claudius Gros Institute for Theoretical Physics Goethe University Frankfurt, Germany http://www.findfiles.net 1

Upload: others

Post on 17-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

How do I develop and found a search engine?

Files and file search in the Internet

from the perspective of FindFiles.net

Claudius Gros

Institute for Theoretical PhysicsGoethe University Frankfurt, Germany

http://www.findfiles.net

1

Page 2: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

overview

data in the Internet

– Mime types and statistics

– the file search engine FindFiles.net

science with data files

– neuropsychological constraints to human data production

2

Page 3: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet – Statistics

3

Page 4: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet hosts

Hosts – Domains – Sites

[source: Netcraft.com]

• 2011 ∼ 100 active Mio domains

4

Page 5: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet users – email

users in 2010

• 2 ·109 worldwide

• 825 ·106 Asia

• 475 ·106 Europe

• 266 ·106 North American

emails in 2010

• 107 ·1012 – number of emails sent

• 89% – share of spam emails (success rate: 1:12 Mio ?)

• 1.9 ·109 – number of email users[source: Pingdom.com]

5

Page 6: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet – social media

2010

• 152 ·106 – number of blogs

• 25 ·109 – number of tweets on Twitter

• 600 ·106 – Facebook acounts

⊲ 30 ·109 – pieces of content: links, notes, images, ...

⊲ 20 ·106 – number of activated apps (per day)[source: Pingdom.com]

6

Page 7: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

social media – images and videos

streaming videos – tubes

• 2 ·109 – watched per day on Youtube (one per Internet user)

• 35 – hours of video uploaded (every minute)

images, pictures

• 5 ·109 – photos hosted by Flickr

• 3000 – photos uploaded (per minute)

[source: Pingdom.com]

7

Page 8: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

social media – blogs & bookmarking

blogs are everywhere

• 2010: 50−100 ·106 blogs

social is everthing

• Digg, Mister Wong, Delicious, ...

• social shopping, ...

http://www.delicious.com

8

Page 9: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet – rules of thumb

2010

• 1 movie per day per user (Youtube)

• 1 search query per day per every 2 users (Google)

• every Internet user uses email

• 5 ‘true’ emails per day per Internet user

• 25% of Internet users use novel social media

• most domains are blogs

9

Page 10: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet Startups

10

Page 11: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

slow beginnings for startups in the Internet

Twitter

• linear scale – exponential or linear growth?

⊲ Apr 2011 – 155 daily tweets

11

Page 12: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

growth is generically not exponential

Google

2000 2002 2004 2006 2008 2010year

0

5

10

15

20

25

30

reve

nues

per

yea

r (b

illio

ns, U

S-$) Google yearly revenues

• linear scale

12

Page 13: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Internet: the winner takes all

flow of attention in complex networks

www.big.com

www.medium.com

www.small .com

www.small .orgwww.smal l .ne t

www.smal l .de

• in-degree distribution pk

⊲ heavy tails

• preferential attachment

13

Page 14: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

in-degree distribution

power law – scale invariant

100

101

102

103

104

105

106

number of incomming links

-4

-2

0

2

4

6

8lo

g(nu

mbe

r of

hos

ts)

number of incomming linkslinear fit, slope -2.2: 7.51-2.2*x

[source: Findfiles.net]

• scaling constant for 20 years – starting at one!14

Page 15: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

limiting diverging in-degree distribution

pk ∝1kα , 〈k〉 ∝

∫kkα dk ∼

k2−α

2−α

Kc

• diverging mean in-degree limα→2

〈k〉 → ∞

⊲ Internet: α ≈ 1.9−2.2

⊲ limiting dominating tail

» limiting winners take all «

• makes life difficult for small startups

15

Page 16: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

the big two uphill fights

a new Internet startup needs to ...

• fight for attention

• fight for novelty

traffic and quality

heavy tail in-degree distribution makes it difficult to attract traffice

extremly high service standards act as effective entry barriers

16

Page 17: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

FindFiles.net – a new file search engine

17

Page 18: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

public data on the Internet

280 Million domains in 2011

⊲ 10-30 data files per domain

Internet Media type – Mime type

• categorization of all file types

⊲ email attachments

⊲ browser add-ons

⊲ about ∼ 600 Mime types in use

18

Page 19: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Mime types

major Mime categories

• together: 99%

33.2% application/2.9% audio/

58.0% image/5.1% text/0.7% video/

Mime types – examples

application/pdf audio/mpegapplication/msword audio/midiapplication/vnd.android.package-archive chemical/x-pdbapplication/vnd.ms-powerpoint image/jpegapplication/jar image/vnd.djvuapplication/x-deb text/xmlapplication/x-gzip model/vrml

19

Page 20: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

FindFiles.net

search engine for data filesG. Kaczor & C. Gros 2011

• supports all Mine typeshttp://www.findfiles.net

20

Page 21: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

FindFiles.net – some stats

daily queries

[source: FindFiles.net]

• 400 Mio data files 20 Mio host crawled

⊲ 10 Million mp3 files⊲ 10 000 apps for Symbian/Android smartphones⊲ ...

21

Page 22: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

blogs, legal issues & financing

blog & press coverage

http://www.findfiles.net/publicrelations

copyright & non-legal files

• files protected by copyright/licence are not indexed (nofollow)

• links to pirate files removed from index

financing

• network – Unibator

• banks are cautious – most startups fail

22

Page 23: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Science with Data Files

23

Page 24: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

the Wikipedia/DMOZ corpus

all outgoing links of

• Wikipedia (all languages)

• DMOZ – open directory project (all languages)

⊲ 7.7 Mio hosts (domains)

⊲ 252 Mio data files (FindFiles.net crawler)

analysis of file size distribution

• tails & scaling behaviour

24

Page 25: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

number of files per domain

files per host vs. in-degree

• most files hosted on small domains

25

Page 26: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

file size distribution

number of files of given size

1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 G100 B10 Bfile size [Bytes]

-4

-2

0

2

4

6

8lo

g(nu

mbe

r of

file

s)

all Mime categoriesMime category application/Mime category audio/Mime category image/Mime category text/Mime category video/

• 252 Mio files in total – 9 orders of magnitude

26

Page 27: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

power-law scaling of image-size distribution

• compression gif: lossless; jpeg: lossy

1 K 10 K 100 K 1 M 10 M 100 M 1 Gfile size [Bytes]

-4

-2

0

2

4

6lo

g(nu

mbe

r of

file

s)

all Mime categoriesMime type image/jpeglinear fit, slope -2linear fit, slope -4Mime type image/giflinear fit, slope -2.45

• kink at 4 Mbytes: amateur – professional

27

Page 28: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

lognormal multimedia size distribution

all audio and video Mime types

1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 Gfile size [Bytes]

-4

-2

0

2

4lo

g(nu

mbe

r of

file

s)

all Mime categoriesMime category video/quadratic fit (lognormal distribution)Mime category audio/quadratic fit (lognormal distribution)

• quadratic fit – lognormal distribution

28

Page 29: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

lognormal distribution vs. powerlaw scaling

files-size distribution p(s)

e[log(s)−µ]2/σ2s−α

not a Taylor-series correction

log(p(s)) ∝ α log(s) − β log2(s)

⊲ images: α < 0, β = 0.

⊲ audio/video: α > 0, β > 0

1 K 10 K 100 K 1 M 10 M 100 M 1 Gfile size [Bytes]

-4

-2

0

2

4

6

log(

num

ber

of f

iles)

all Mime categoriesMime type image/jpeglinear fit, slope -2linear fit, slope -4Mime type image/giflinear fit, slope -2.45

1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 Gfile size [Bytes]

-4

-2

0

2

4

log(

num

ber

of f

iles)

all Mime categoriesMime category video/quadratic fit (lognormal distribution)Mime category audio/quadratic fit (lognormal distribution)

29

Page 30: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

one vs. two-dimensional cost functions

economical cost functions for data production

• size

⊲ storage costs

⊲ production costs

psychophysical cost functions for data production

• size (images)

⊲ time needed to take an image is independent of resolution

• size and time (audio & video)

⊲ time and resolution are psychophysical distinct variables

30

Page 31: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

Weber-Fechner law

• neuopsychological cost functions are logarithmic in

⊲ sensory stimulus intensity

⊲ number of objects

⊲ time perception

music: tone pitch ∝ log(frequency) (octave)photometry: brightness ∝ log(intensity) (lumen)acoustics: sound level ∝ log(intensity) [decibel]

⊲ information production: number of objects / time

31

Page 32: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

information entropy

Shannon information entropy

∫p(s) log(p(s))ds

∫p(s)ds= 1

for a distribution function p(s)

• a measure for the information content

Shannon coding theorem

Mimimal amount of bytes needed to encode a transmission isgiven by the information entropy of the signal statistics

32

Page 33: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

neuropsychological cost functions

conditional entropy maximization

δ[

−∫

p(s) log(p(s))ds− λ∫

p(s)c(s)ds

]

= 0

Shannon information entropy: −∫

p(s) log(p(s))dscost function: c(s)

file size distribution: p(s)

maximal file size distributions

p(s) ∝ e−λc(s) ∼

exponential c(s) ∝ s physicalpower law c(s) ∝ log(s) 1-dim neurolognormal c(s) ∝ log2(s) 2-dim neuro

33

Page 34: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

physical vs. neuropsychological cost functions

1 K 10 K 100 K 1 M 10 M 100 M 1 Gfile size [Bytes]

-4

-2

0

2

4

6

log(

num

ber

of f

iles)

all Mime categoriesMime type image/jpeglinear fit, slope -2linear fit, slope -4Mime type image/giflinear fit, slope -2.45

1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 Gfile size [Bytes]

-4

-2

0

2

4

log(

num

ber

of f

iles)

all Mime categoriesMime category video/quadratic fit (lognormal distribution)Mime category audio/quadratic fit (lognormal distribution)

images

physical exponential [not seen]1-dim neuro power law [linear]

audio/video

physical exponential [not seen]2-dim neuro lognormal [quadradic]

34

Page 35: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

global human data production

basic assumptions

• information production as underlying driving force

⊲ information entropy as a suitable measure

• law of large numbers

⊲ average over production processes / producting agents

⊲ compression/technology correspond to rescaling

data production on a global level characterized byneuropsychological cost functions and not be eco-nomic constraints

35

Page 36: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

the Internet & complex system theory

complex system theory – still an emergent field

⊲ many models and paradigms yet to be formulated

⊲ network theory / game theory / allocation problems

⊲ macroecology / systems biology / cognitive systems theory

⊲ ...

• information entropy maximization

⊲ human data production on a global level

⊲ neuropsychological cost functions

⊲ ...

36

Page 37: How do I develop and found a search engine? Files …gros/TALKS/WebScience11/...How do I develop and found a search engine? Files and file search in the Internet from the perspective

graduate level textbook

• Information theory and complexity

• Phase transitions andself-organized criticality

• Life at the edge of chaos andpunctuated equilibrium

• Cognitive system theoryand diffusive emotional control

second edition 2010

37