jcdl2015: how well are arabic websites archived?

How Well Are Arabic Websites Archived?

Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle Old Dominion University

Department of Computer Science Norfolk, Virginia 23529 USA

JCDL 2015 Knoxville, TN

June 21-25, 2015

Archived events on English sites vs. Arabic sites

2

http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-bloody-memorial-day-weekend-in-baltimore-capping-off-deadliest/

Search: Baltimore (one week old)


3



Search: Yemen Houthis (one week old)

http://www.yemenakhbar.com/yemen-news/178683.html


4





5 http://www.yemenakhbar.com/yemen-news/178683.html

English sports websites are more archived than Arabic

www.espn.go.com www.kooora.com 8

English e-Marketing websites are more archived than Arabic

www.amazon.com www.haraj.com.sa 9

English encyclopedia websites are more archived than Arabic

en.wikipedia.org ar.wikipedia.org 10

Top ten languages in the Internet

World Language Map Source: Quick Maps of the World immigration - http://www.allcountries.org/maps/world_language_maps.html

Source: Internet World Stats - http://www.internetworldstats.com/stats7.htm

11

2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration

1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%

Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %

Source: http://www.internetworldstats.com/stats19.htm

Arabic speaking Internet users

12





2009 Arabic Total=17.5% World Total=26.6%


13







14


Ø  The number of Arabic speaking Internet users has grown rapidly

Ø  There has been previous work on the coverage of web archives

Ø  Little has been done in terms of Arabic language content

15

Why are we doing this?

How Much of the Web Is Archived?

Ø  Sample of URIs from four different sources (DMOZ, Delicious, Bitly, Search engine indexes)

Ø  The archival percentages ranged from 16% to 79%

2013, A follow-on study: Ø  Archival percentages had increased

from 33% to 95%

Ø  These studies were not focused on content from specific countries or content in specific languages

16

A fair history of the Web? Examining country balance in the Internet Archive Ø  Examined country balance in the

Internet Archive:

Country Domain Archived US .com 92% Taiwan .com.tw 73% China .com.cn 58% Singapore .com.sg 73%

17

Ø  This work focused on TLD rather than content language or location

Characterization of National Web Domains

Ø  Used 10 national web domains §  120 million pages §  24 countries §  They studied page sizes,

degrees, link based scores, etc. §  They found that depth,

response code were similar

Ø  In this work, additional methods are required to determine if a site belongs to a particular country

18

Characterizing a National Community Web

Ø  Used Portuguese dataset: §  (.pt) ccTLD §  (.com,.net,.org,.tv) in Portuguese

language that has at least one incoming link from (.pt) ccTLD

Ø  They identify, collect, and characterize the Portuguese Web

19

GeoIP only ccTLD only

Both Neither

²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar)

How do we classify Arabic websites?

20


Both Neither

²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland)


21



Both Neither



22

²  Educational: uoh.edu.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Arabic country (SA)



Both Neither

²  News: alarabiya.net ²  ccTLD: Not Arabic (.net) ²  GeoIP: Not Arabic country (US)



23

²  Educational: uoh.edu.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Arabic country (SA)


Selecting seed URIs

Name Registered Year URI count DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743

•  15,092 unique seed URIs •  11,014 URIs that existed in the live web

24

Determining a webpage language •  HTTP header Content-Language •  HTML title tag language •  Trigram method •  Language detection API client

25

> curl –I www.alquds.com HTTP/1.1 200 OK Server: nginx/1.6.2 Date: Wed, 03 Jun 2015 19:11:31 GMT Content-‐Type: text/html; charset=utf-‐8 Connection: keep-‐alive X-‐Powered-‐By: PHP/5.3.3 X-‐Drupal-‐Cache: HIT Etag: "1433361507-‐0" Content-‐Language: ar …

HTTP header Content-Language example#1

26

> curl –I www.alquds.com HTTP/1.1 200 OK Server: nginx/1.6.2 Date: Wed, 03 Jun 2015 19:11:31 GMT Content-‐Type: text/html; charset=utf-‐8 Connection: keep-‐alive X-‐Powered-‐By: PHP/5.3.3 X-‐Drupal-‐Cache: HIT Etag: "1433361507-‐0" Content-‐Language: ar …


27

> curl –I www.raddadi.com HTTP/1.1 200 OK Server: nginx/1.8.0 Date: Sat, 06 Jun 2015 22:47:09 GMT Content-‐Type: text/html Connection: keep-‐alive …


28

> curl –I www.raddadi.com HTTP/1.1 200 OK Server: nginx/1.8.0 Date: Sat, 06 Jun 2015 22:47:09 GMT Content-‐Type: text/html Connection: keep-‐alive … > curl www.raddadi.com

<!DOCTYPE html PUBLIC "-‐//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-‐transitional.dtd"> <html dir="rtl" xmlns="http://www.w3.org/1999/xhtml" xml:lang="ar" lang="ar" > <head>


29

> curl –I www.raddadi.com HTTP/1.1 200 OK Server: nginx/1.8.0 Date: Sat, 06 Jun 2015 22:47:09 GMT Content-‐Type: text/html Connection: keep-‐alive … > curl www.raddadi.com

<!DOCTYPE html PUBLIC "-‐//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-‐transitional.dtd"> <html dir="rtl" xmlns="http://www.w3.org/1999/xhtml" xml:lang="ar" lang="ar" > <head>


30

https://code.google.com/p/guess-language/

> curl www.star28.com … <META name="Copyright" content="© 2011 www.star28.com"> <META name="DISTRIBUTION" content="GLOBAL"> <META name="REVISIT-‐AFTER" content="1 DAYS"> <TITLE> الشامل العرب دليل </TITLE> <META name="description" content=" للمواقع دليل

باستمرار يحدث, العاملية املواقع أفضل و العربية "> <META name="keywords" content=" دليل مواقع, جتارة,جتارة, مواقع دليل

العاب, جافا سكربت, رياضة, منتديات, علوم, كومبيوتر, اسالم, اخبار,اخبار, اسالم, كومبيوتر, علوم, منتديات, رياضة, سكربت جافا, العابتوظيف, زواج, تعليم, سياحة, تلفزيون, صحف ">

…

HTML title tag language

31

> curl www.star28.com … <META name="Copyright" content="© 2011 www.star28.com"> <META name="DISTRIBUTION" content="GLOBAL"> <META name="REVISIT-‐AFTER" content="1 DAYS"> <TITLE> الشامل العرب دليل </TITLE> <META name="description" content=" للمواقع دليل

باستمرار يحدث, العاملية املواقع أفضل و العربية "> <META name="keywords" content=" دليل مواقع, جتارة,جتارة, مواقع دليل

العاب, جافا سكربت, رياضة, منتديات, علوم, كومبيوتر, اسالم, اخبار,اخبار, اسالم, كومبيوتر, علوم, منتديات, رياضة, سكربت جافا, العابتوظيف, زواج, تعليم, سياحة, تلفزيون, صحف ">

…


Then we use guess-language Python library to determine the language

HTML title tag language

32


Ø  curl -‐s www.gulfup.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > gulfup_title.txt

33

HTML title tag language example#1

https://code.google.com/p/guess-language/ 34


> Python >>> myfile=open("gulfup_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'ar'




> Python >>> myfile=open("gulfup_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'ar'



Ø  curl -‐s www.cnn.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > cnn_title.txt




> Python >>> myfile=open("cnn_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'en'


§  Built in C++ and wrapped as a python module §  Identification is performed through basic trigram lookups

paired with unicode character set recognition §  Accuracy is high for even short sample texts

https://github.com/decultured/Python-Language-Detector

Trigram method

39


> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)

Trigram method example#1

40



>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) ar

41



>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) ar


42



> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)

43




>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) en

44




>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) en

45


Language detection API client •  Returns detected language codes and scores •  You have to setup your personal API key,

(http://detectlanguage.com) •  Example of output:

https://detectlanguage.com

{"data":{"detections":[{"language":"ar","isReliable":true,"confidence":9.54}]}}

46

•  Returns detected language codes and scores •  You have to setup your personal API key,

(http://detectlanguage.com) •  Example of output:


{"data":{"detections":[{"language":"ar","isReliable":true,"confidence":9.54}]}}

•  how much text you pass

•  how well it is identified

False means that the confidence is low

Language code

47

Language detection API client


> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)

Language detection API client example#1

48



>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"ar","isReliable":true,"confidence":8.32},{"language":"tk","isReliable":false,"confidence":0.01}]}}

49




>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"ar","isReliable":true,"confidence":8.32},{"language":"tk","isReliable":false,"confidence":0.01}]}}

50



> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)

51




>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"en","isReliable":true,"confidence":6.14}]}}

52




>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"en","isReliable":true,"confidence":6.14}]}}

53


Language test intersection testing for Arabic language

54

~41%

55

~38% ~41%


56

~41% ~38%

~36%


57

~41% ~38%

~36% ~39%


58

~41% ~38%

~36% ~39%

872

~8%



59

~41% ~38%

~36% ~39%

Total Arabic = 7,976

Crawling Arabic seed URIs

Unique:663,443

60


61

62


Total Arabic URIs Dataset = (7,976+292,670) = 300,646 63


17,536 Unique domains

Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport

64


First Arabic GeoIP location is at rank 17 65



6 out of 10 top unique domains are news websites 66



Popular western pages are in the top unique domains 67


TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94%

Almost 58% are .com

68


Almost 58% are .com

69


Small percentage of Arabic TLD

70

TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82%


71

TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82%


72

Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02%

More than 57% are of depth 0 and 1

73

Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02%

74

More than 57% are of depth 0 and 1

53.77% of Arabic URIs are archived

•  January-March 2015 •  ODU CS Memento Aggregator

Median=16

75

URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine

Most of the top archived URI-Rs are news websites

76

URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine

77

Most of the top archived URI-Rs are news websites

Archiving has accelerated since 2011

78

March 2015

79

Archiving has accelerated since 2011

Two methods to determine the presence in each archive

1.  Percent of URI-Rs present in each archive e.g. http://aljazeera.net

2.  Percent of URI-Ms present in each archive e.g. http://wayback.archive-it.org/all/20070727215420/http://www.aljazeera.net/ e.g. http://web.archive.org/web/20150618104846/http://aljazeera.net/

80

Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10

Presence in each archive example

81

1- Percent of URI-Rs present in each archive

Archive Total Percentage

Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160%


82





2- Percent of URI-Ms present in each archive



83



Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86%

Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100%

84



Presence in each archive



85







86



Average archiving period (days)

Average archiving period = (LM-FM) / number of mementos

16,732 URIs have only one memento

Median=48 days

87

Values less than 1 indicate that the URI is archived multiple times per day

The larger the period, the more irregularly the URI was captured by the archives

Median=48 days

Average archiving period = (LM-FM) / number of mementos

16,732 URIs have only one memento 88

Average archiving period (days)

Creation date for archived Arabic URIs

Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html

We used CarbonDate for creation date estimate 89


We used CarbonDate for creation date estimate

18 years

90



2013 is the most frequent year

We used CarbonDate for creation date estimate

18 years

91


Archive Percent

United States 57.97%

Arabic Countries 10.53%

Germany 9.75%

Netherlands 5.29%

France 4.37%

Canada 3.31%

United Kingdom 3.07%

Other 5.71%

Top GeoIP locations

92

Archive Percent



Germany 9.75%

Netherlands 5.29%

France 4.37%

Canada 3.31%


Other 5.71%

Top GeoIP locations

93

Archive Percent



Germany 9.75%

Netherlands 5.29%

France 4.37%

Canada 3.31%


Other 5.71%

Archive Percent

Saudi Arabia 4.75%

Egypt 1.97%

Jordan 1.42%

Kuwait 0.71%

United Arab Emirates

0.67%

Top GeoIP locations

94

Archive Percent



Germany 9.75%

Netherlands 5.29%

France 4.37%

Canada 3.31%


Other 5.71%

Archive Percent

Saudi Arabia 4.75%

Egypt 1.97%

Jordan 1.42%

Kuwait 0.71%

United Arab Emirates

0.67%

Top GeoIP locations

95

Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%

Status of Arabic seed URIs

96


(Good) discovered and saved

97



(Good) discovered and saved

(Bad) undiscovered and not saved

98



31% were not indexed by Google

99


18% have creation dates over 1 year before the first memento was archived

19.48% of the URIs have an estimated creation date that is the same as first memento date

Difference between creation date and first memento

100

Seed Data Set

Arabic Archived Indexed

DMOZ 34.43% 95.52% 82.13%

Raddadi 19.88% 45.44% 65.83%

Star28 45.69% 41.54% 65.23%

DMOZ URIs are more likely to be found and archived

101

Seed Data Set


DMOZ 34.43% 95.52% 82.13%

Raddadi 19.88% 45.44% 65.83%

Star28 45.69% 41.54% 65.23%

102


Seed Data Set


DMOZ 34.43% 95.52% 82.13%

Raddadi 19.88% 45.44% 65.83%

Star28 45.69% 41.54% 65.23%

103


Full Data Set

Total Archived Category Total Archived

Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%

AR GeoIP 10.53% 13.11%

AR both 7.81% 59.50%

Neither 66.82% 65.22% Neither 66.82% 65.22%

Hosted in Western countries would be more likely to be archived

104

Full Data Set

Total Archived Category Total Archived

Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%

AR GeoIP 10.53% 13.11%

AR both 7.81% 59.50%

Neither 66.82% 65.22% Neither 66.82% 65.22%

105

Hosted in Western countries would be more likely to be archived

Seed Data Set

Total Indexed Category Total Indexed

Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%

AR GeoIP 2.37% 73.54%

AR both 6.03% 85.24%

Neither 84.99% 65.22% Neither 84.99% 67.09%

URIs that had some Arabic location had a higher indexing rate

106

Seed Data Set

Total Indexed Category Total Indexed

Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%

AR GeoIP 2.37% 73.54%

AR both 6.03% 85.24%

Neither 84.99% 65.22% Neither 84.99% 67.09%

URIs that had some Arabic location had a higher indexing rate

107

The spread of memento was not affected by location or ccTLD

Ø  Kolmogorov-Smirnov test

Category Mean Ar GeoIP 0.5010 Ar ccTLD 0.5013 Both 0.5016 Neither 0.5005

Category D-Value P-Value Ar ccTLD vs. neither

0.017 <0.002

Ar GeoIP vs. neither

0.014 <0.002

108

Just because a webpage is older it does not mean that it is archived more

Because of low historical archiving rates

109

We look in the last three years 110


We look in the last three years 111


In the last three years the older the resource is the more memento it has

112

Full Data Set Seed Data Set Path Depth Total Archived Total Indexed

0 17.30% 86.29% 86.05% 74.60%

1 40.42% 53.49% 9.77% 38.91%

2 24.45% 45.57% 3.72% 17.85%

3+ 17.83% 34.24% 0.50% 57.50%

Top level URIs are more likely to be archived and indexed

113


0 17.30% 86.29% 86.05% 74.60%

1 40.42% 53.49% 9.77% 38.91%

2 24.45% 45.57% 3.72% 17.85%

3+ 17.83% 34.24% 0.50% 57.50%

114



0 17.30% 86.29% 86.05% 74.60%

1 40.42% 53.49% 9.77% 38.91%

2 24.45% 45.57% 3.72% 17.85%

3+ 17.83% 34.24% 0.50% 57.50%

115


•  Collected URIs from three Arabic directories (7,976): Ø DMOZ Ø Raddadi.com Ø Star28.com

•  Crawl seed dataset (1,299,671) •  Check if they are unique (663,443) •  Check if they are live (482,905) •  Check for Arabic Language (300,646)

Summary of collection methods

116

§  Our Arabic language dataset was not largely located in Arabic countries Ø Only 14.84% had an Arabic ccTLD Ø Only 10.53% had a GeoIP in an Arabic country Ø Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the

top 10 §  Arabic webpages are not particularly well archived or indexed

Ø  46% were not archived Ø  31% were not indexed by Google

§  An Arabic webpage is more likely to be... Ø  indexed if it is present in a directory Ø  archived if it is present in DMOZ Ø  archived if it has neither Arabic GeoIP nor Arabic ccTLD

For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ

Findings

117

Backup Slides

119

GeoIP Location

•  We obtained the IP addresses of the hostnames using nslookup, (which uses DNS to convert the hostname to its IP address)

•  We used the MaxMind GeoLite29 database to determine location from the IP address. (Which tests at 99.8% accuracy at the country level)

h,p://dev.maxmind.com/geoip/geoip2/geolite2/ h,p://dev.maxmind.com/faq/how-‐‑accurate-‐‑are-‐‑the-‐‑ geoip-‐‑databases/

120

jcdl2015: how well are arabic websites archived?

Internet

arabic sites

english sites

week old http

week old search

week old archived events

deadliest search

arabic websites

world immigration http