bon cop, bad cop a tale of two...
TRANSCRIPT
Bon cop, bad cop: A tale of two cities
Kelsey Ball, Barbara E. Bullock, Gualberto Guzmán,Rozen Neupane, Kristopher S. Novak,
& Jacqueline Larsen SerigosThe University of Texas
Transcultural urban spaces: where geography meets language 16/17 October 2015 Berne
Outline1. Language, geography, contact, and multilingualism
○ Mobility and mixing
2. The Bilingual Annotation Tasks (BATs) Force○ Investigating multilingual texts with automated tools
3. Significance○ Insight into how bilingual speakers use their linguistic resources as they move between different
urban spaces
2
Language, geography, contact, and multilingualism
3
Superdiversity
● In zones of high contact, such as urban centers, innovative language use is ubiquitous, reflective of superdiversity (Vertovec 2007).
● Contemporary urban speech characterized by mixing is referred to by a variety of terms:
○ urban youth languages
○ ethnolects
○ translanguaging/code-switching
○ hybrid language practices
4
Language mixing and multilingual corpora
● As speakers move between and within spaces, interacting with different interlocutors, they move between and within languages.
● What is the nature of contemporary mixed language forms?○ Linguistic studies : How do they operate structurally?
○ Cultural and urban studies: What kinds of meanings and identities do they convey?
5
Polylingual performances
6
● Not limited to borrowing words from Language X to insert into Language Y; can include:
○ Non-normative usages
○ Local forms
○ Polylingual language play: memes
○ Convergence in structures and meaning between languages
● In order to make sense of this, we need data—lots of it
● But these features make it difficult to annotate these types of text
Attempts to study multilingual data
● LIDES/LIPPS Group
● BilingBank
○ accessible via http://www.talkbank.org/
● Contacts de Langues: Analyses Plurifactorielles (CLAPOTY) ○ Isabelle Léglise (CNRS SEDYL/CELIA) - http://clapoty.vjf.cnrs.fr/index.html
7
Studying multilingual data: Traditionally● typically manually annotated● set base/matrix language● discrete categorization of contact features
○ Established borrowings (vs. single word code-switch)
■ “mon chum”
○ Syntactic calques (vs. a dialectal extensions of existing patterns in a language)
■ “take a coffee” < have coffee
○ The language of a proper name (Named Entity)
■ Robert
8
Studying multilingual data: More recently● Probabilistic approaches to contact features
○ Allows for gradient/blended classification
● Flexible word sequences (ngrams)
● Automatic annotation
9
Why automate?● Eliminate obstacles of cost and time● Eliminate errors/biases introduced by manual coding● Increase accountability through replicability● Process other people’s data to create open resources● Accelerate production of findings● Encourage collaboration
10
Bilingual Annotation Tasks (BATs) Force/|\ (°v°)/|\
11
Goals of the BATs● Present a method and tools for analyzing larger data sets than linguists usually
work with
● Demonstrate that it is possible to automatically identify the languages in a data set without the need for human intervention
● Apply a quick metric to calculate the degree to which a text is mixed so that it can be compared with other texts
12
13
A first step: written corpus
● Killer Crónicas: Bilingual Memories
○ ~49,000 words
○ colloquial language
○ dialectal forms
○ highly bilingual
○ extensive mixing
A second step: oral corpus
● Bon Cop Bad Cop
○ 13,502 tokens
○ colloquial language
○ dialectal forms
○ extensive mixing
○ highly bilingual
14
Movie Clip
15https://www.youtube.com/watch?v=lLISTwHX88k&feature=youtu.be
Bon cop, Bad Cop, 2006● Plot
○ Officers from Ontario and Quebec join forces to solve a murder on the border
○ Film revolves mixing of languages and cultures
■ Bouchard: Francophone
■ Ward: Anglophone
● Bilingualism○ Touted as Canada’s first bilingual film, very successful at the box office
○ Filmed with a French and an English script; later edited for final mixed version
16
Methods
17
Corpus Creation1. Monolingual subtitles: open subtitle.com (English), subsmax. com (French)
18
Corpus Creation2. Manually combined the two subtitles to create a transcript
19
Annotation Tags for Manual Gold Standard
20
Language Tag Example
English years, country
Non-standard English Hopportunity
French frère, ville
Non-standard French mononcle, tabarnak
Named Entities Montréal, Toronto, David
Punctuation . ! , ? –
Number 2001, VII
Example of the Gold Standard
21
Building a language identification model● Numbers and Punctuation
○ Rule-based
● Named Entity○ Stanford Named Entity Recognizer
● English and French ○ Character Ngram
○ HMM
22
Character N-gram Model● What’s the probability that a given word is English? French?● Model a language using training data (La Presse corpus, CALLHOME corpus)● Identify language of an unknown word based on how often it occurs in training
data● Example:
23
Hidden Markov Model● What’s the probability that this word is English, considering that the previous
word was English?● Code-switching is less frequent (therefore less probable) than not switching.
-Me occurs in both English and French training data → ambiguous -Because previous word was English, me is more likely to be English
24
Annotation of Bon Cop, Bad Cop by model
25
Results
26
Evaluation of our model
27
Tag Accuracy Precision Recall
English 98.3% 98.3% 96.6%
French 97.5% 97.4% 98.6%
Named Entity 69.4% 78.7% 85.5%
Over-all 97.0%
Over-all Language Distribution across Bon Cop Bad Cop
28
Language Tag
Named Ent.
Punct.
Eng
French
Token count
Characters’ Usage of French Across Cities
29Montréal Toronto
Logistic Regression
30
● Relative to Martin Ward...
○ Others: not a significant predictor
○ David Bouchard: 15 times more likely to use French
○ Tattoo Killer: 1.6 times more likely to use English
● Relative to Toronto...
○ In Montréal: 8 times more likely to use French
Can we predict language based on character and location?
Bilingual metric applied to Bon Cop Bad Cop
31
● The M-metric is developed by the LIPPS Group (2000):
where,
k=total number of languages/categories
pj=total number of words in the given language over total words in the corpus
● The metric allows us to determine how bilingual a text is: the closer the value is
to 1, the more ‘bilingual’ the text
M-metric for Bon Cop Bad Cop
32
Switch points
33
Types of Switches Example
Eng to French He has had an unfortunate contretemps
French to English On peut aller demander à ses partners
Punct. only Hey, where you going?Chez nous.
Named Ent. and Punct. Bye, Patrick. Merci d'avoir appelé.
Switch Pattern
34
Over ⅔ of Direct switches involved single word borrowings (eg. ses affaires cheap américaines)
Direct Switches
Indirect Switches
Indirect Switches
Discussion and Conclusions
35
What we’ve achieved● Presented a profile of language switching in a large bilingual text
○ locale
○ speaker
● Demonstrated that it is possible to automatically identify language switches without manually coding all the data
○ with a high degree of accuracy: 97%
● Calculated the degree to which segments of the corpus are mixed and how they are mixed
○ M score of .71 for overall mixing where 1 = perfectly mixed
36
What we’ve learned about who switches, where
● Context matters— a lot
■ In Toronto, only the Québécois cop (Bouchard) uses French; all others perform in English
■ In Montreal, the Torontonian cop (Ward) and Tattoo Killer perform bilingually
● Speaker matters— a lot
■ The most bilingual character appears to be Tattoo Killer
■ The least is Bouchard
37
What else we’ve learned about who switches● The anglophone speakers accommodate more to the francophone than vice
versa.○ Given the politics of language in Quebec with regard to French language maintenance, and the
fact that the director and producer of the film are Québécois, this might not be surprising.
38
What we’ve learned about how they switch
● Most switching coincides with punctuation, indicating
○ a change in speaker turn
○ or a switch at a phrasal boundary
● Within a phrase○ nearly balanced French↔English
39
ConclusionsThis work shows how NLP tools can be fruitfully recruited to shed light on contemporary mixed language use.
40