language and the internet assessing linguistic bias measuring the information society wsis, tunis,...
TRANSCRIPT
Language and the InternetLanguage and the InternetAssessing Linguistic BiasAssessing Linguistic Bias
Measuring the Information SocietyMeasuring the Information Society
WSIS, Tunis, November 15, 2005WSIS, Tunis, November 15, 2005
John C. Paolillo, Indiana UniversityJohn C. Paolillo, Indiana University
OverviewOverview
• Sources of Linguistic BiasSources of Linguistic Bias• Linguistic Bias: examplesLinguistic Bias: examples
– Text CommunicationText Communication– Internet Host NamesInternet Host Names– Web ProgrammingWeb Programming
• Global Linguistic DiversityGlobal Linguistic Diversity– Who bears the costs?Who bears the costs?
• ConclusionsConclusions
Sources of Linguistic BiasSources of Linguistic Bias
(Friedman and Nissenbaum 1997)(Friedman and Nissenbaum 1997)
• Pre-existingPre-existing– originate from outside the technical systemoriginate from outside the technical system
• National, trans-national and institutional policiesNational, trans-national and institutional policies• Technology companiesTechnology companies
• TechnicalTechnical– are built into the technical system itselfare built into the technical system itself
• Developers’ language backgrounds, national originsDevelopers’ language backgrounds, national origins• Legacy standards, “backward” compatibilityLegacy standards, “backward” compatibility
• EmergentEmergent– arise in specific contexts of use of a technical systemarise in specific contexts of use of a technical system
• Economics of technology industry (marketing, monopoly power, Economics of technology industry (marketing, monopoly power, unstable markets, etc.)unstable markets, etc.)
• Rapid technologizationRapid technologization
Text CommunicationText Communication
• Requires an Requires an encodingencoding and its support and its support– Assign code numbers to script charactersAssign code numbers to script characters
• ASCII (American English)ASCII (American English)• ISO-8859-1 (European Languages)ISO-8859-1 (European Languages)• Unicode (most languages, but support is uneven)Unicode (most languages, but support is uneven)
– Support means many thingsSupport means many things• Fonts, rendering, sorting, spell-checking etc.Fonts, rendering, sorting, spell-checking etc.
• Computer-Mediated CommunicationComputer-Mediated Communication– Web pages, Email, chat, etc.Web pages, Email, chat, etc.– Language use is not uniform in these modesLanguage use is not uniform in these modes
• Multilinguals tend to favor different languages for specific Multilinguals tend to favor different languages for specific purposespurposes
• Represents both technical and emergent biasesRepresents both technical and emergent biases
Unicode Status: ExamplesUnicode Status: ExamplesLanguageChineseEnglishFrenchGermanSpanishFinnishRussianArabicHindiSinhalaS. Azerbaijani
Unicodeyesyesyesyesyesyesyesyesyesyesno
Browsergoodgoodgoodgoodgoodgoodgood (late)
good (late)
poornonenone
ScriptChineseRomanRomanRomanRomanRomanCyrillicArabicIndicIndicArabic
Pop.1,240M
400M81M82M
358M5M
132M247M213M
15M26M
Good support
Poor support
No support
Internet Host NamesInternet Host Names
• The Domain Name SystemThe Domain Name System– Uses a 30-year old 7-bit ASCII standardUses a 30-year old 7-bit ASCII standard
• Now supports Punycode (a variant of Unicode)Now supports Punycode (a variant of Unicode)• Imposes a maximum name lengthImposes a maximum name length
– Run by ICANN under US Dept of Commerce contractRun by ICANN under US Dept of Commerce contract• More concerned with trademark protectionMore concerned with trademark protection• Host/domain naming is widely abused (e.g. tv domain)Host/domain naming is widely abused (e.g. tv domain)• Names provided by the DNS are not that usefulNames provided by the DNS are not that useful
• An example of emergent biasAn example of emergent bias– Technical originTechnical origin– Economic and political forces amplify and sustain itEconomic and political forces amplify and sustain it
Web Programming and UnicodeWeb Programming and Unicode
• Markup & web scripting languagesMarkup & web scripting languages– Unicode is standardUnicode is standard– Browser support, fonts, etc. lag behindBrowser support, fonts, etc. lag behind– Databases and development environments tend to Databases and development environments tend to
lack proper Unicode supportlack proper Unicode support– End-user oriented, End-user oriented, notnot programmer oriented programmer oriented
• All of the most important technologies are Open-All of the most important technologies are Open-Source software (FLOSS)Source software (FLOSS)– User extensible/modifiableUser extensible/modifiable– Language localization of these is possible but rareLanguage localization of these is possible but rare
Linguistic Bias in Web Linguistic Bias in Web ProgrammingProgramming
• English is the source language for most English is the source language for most programming & markup languagesprogramming & markup languages– Keywords Keywords – Operator-argument orderOperator-argument order– Programming constructs, etc.Programming constructs, etc.
• Programming as a linguistic actProgramming as a linguistic act– Complex concepts are rendered into textComplex concepts are rendered into text– Different languages have different ways of doing Different languages have different ways of doing
thisthis• Emergent language biasesEmergent language biases
Linguistic Properties of Linguistic Properties of ProgrammingProgramming
• LISPLISP– Predicates precede their arguments Predicates precede their arguments
• Like Arabic, Celtic, Hebrew, etc.Like Arabic, Celtic, Hebrew, etc.
(defun fact (x)(if (<= x 0) 1 (* x (fact (- x (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1)))))1)))))
• PostscriptPostscript– Predicates follow their argumentsPredicates follow their arguments
• Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.
/factorial { dup 1 gt { dup 1 sub factorial mul } if } def/factorial { dup 1 gt { dup 1 sub factorial mul } if } def
The Linguistic Digital DivideThe Linguistic Digital Divide
• Language issues go beyond contentLanguage issues go beyond content– WSIS repeatedly re-affirms principles ofWSIS repeatedly re-affirms principles of
• TransparencyTransparency• Self-determinationSelf-determination• Open access to participation for all partiesOpen access to participation for all parties
These principles cannot be guaranteed unless speakers of These principles cannot be guaranteed unless speakers of different languages can manipulate different languages can manipulate allall aspects of IT use in a way aspects of IT use in a way that is native-likethat is native-like
• The linguistic divide has broader consequencesThe linguistic divide has broader consequences– Costs are borne in Costs are borne in
• Education — great for non-English speaking peopleEducation — great for non-English speaking people• Technical development — small, in comparisonTechnical development — small, in comparison(there is a trade-off)(there is a trade-off)
Language DiversityLanguage Diversity
Who bears the costs?Who bears the costs?
Distribution of language groups by size
0
200
400
600
800
1000
1200
1400
0.0001 0.01 1 100 10000 1000000
Population (in thousands)
Number of groups
(source data: www.ethnologue.com)
A typical language group has around 10-50 thousand people80% of language groups have fewer than 100 thousand members
Cumulative proportion of world's population
0
0.2
0.4
0.6
0.8
1
0.00010.0010.010.11101001000
Language group population (millions)
Cumulative proportion
(source data: www.ethnologue.com)
90% of the world’s population belongs to a language group with at least 1 million people (416 groups)
Many languages with hundreds of milloins of speakers lack adequate support
Worldwide Linguistic Diversity by Region
W AsiaSC Asia
S America
Europe
SE AsiaOceania
Africa
USA
N America
E Asia
(source data: www.ethnologue.com)
Per-Country Linguistic Diversity by Region
USAE Asia
W Asia
SC Asia
S AmericaSE Asia
Oceania
Africa
N America
Europe
ConclusionsConclusions
• Linguistic Bias is manifest in many waysLinguistic Bias is manifest in many ways– Technical biases are sometimes overtTechnical biases are sometimes overt– Emergent biases can be subtleEmergent biases can be subtle
• All potential sources of bias need to be All potential sources of bias need to be examined and questioned if we are to uphold examined and questioned if we are to uphold principles affirmed by WSIS principles affirmed by WSIS
• Without this effort, the linguistic digital divide Without this effort, the linguistic digital divide will simply amplify existing disparities in will simply amplify existing disparities in wealth and powerwealth and power
Language DiversityLanguage Diversity
On The InternetOn The Internet
Estimated populations of Internet users (millions)
1
10
100
1000
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
EnglishChineseJapaneseSpanishGermanKoreanFrenchItalianPortugueseScandinavianDutchOther
Global Reach
Linguistic DiversityLinguistic Diversity
Based on Entropy:Based on Entropy:
Diversity = –2 ∑pDiversity = –2 ∑pii ln p ln pii
Diversity is the long-run per-individual Diversity is the long-run per-individual average variance in language categoryaverage variance in language category
(similar to log-likelihood)(similar to log-likelihood)
Linguistic Diversity of Internet Users
0
1
2
3
4
5
6
7
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Diversity Index
minimum
maximum
Languages on the Web
English72%
other2%
Swedish1%
Russian1%
Finnish1%
German7%
Japanese3%
French3%
Spanish3%
Portuguese2%
Chinese2%
Dutch1%
Italian2%
O’Neill, Lavoie and Bennett, 2003
Internet Hosts (www.isc.org/ds)
0
20
40
60
80
100
120
140
160
180
200
1995-01 1996-01 1997-01 1998-01 1999-01 2000-01 2001-01 2002-01 2003-01
Millions
www.isc.org/ds
Host growth by region (millions)
0.001
0.01
0.1
1
10
100
19951996199719981999200020012002
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfricaOther
www.isc.org/ds
Users per host by region
0.1
1
10
100
1998 1999 2000 2001
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica
www.isc.org/ds, ITU
Proportion of random .com hosts
United StatesCanadaNetherlandsAustraliaUnknownUnited KingdomHongkongIsrael
Proportion of random .net hosts
United StatesAustraliaNetherlandsUnknownCanadaGermanyJapan
User growth by region (millions)
1
10
100
1000
1998 1999 2000 2001
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica
ITU
Proportion of Internet hosts by region
0
0.1
0.2
0.3
0.4
0.5
0.6
1995 1996 1997 1998 1999 2000 2001 2002
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfricaOther
www.isc.org/ds
Host growth by region (millions)
0.001
0.01
0.1
1
19951996199719981999200020012002
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfricaOther
www.isc.org/ds
Internet hosts per thousand inhabitants by region
0.001
0.01
0.1
1
10
100
1000
1995199619971998 1999200020012002
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica
www.isc.org/ds, UNPD
Users per thousand inhabitants by region
1
10
100
1000
1998 1999 2000 2001
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica
ITU, UNPD
User growth by region (millions)
1
10
100
1000
1998 1999 2000 2001
USAN AmericaS America/CaribbeanEuropeE AsiaSE AsiaS Central AsiaW AsiaOceaniaAfrica
ITU
www.isc.org/ds, ITU
1067 Random Hosts (all domains)