bon cop, bad cop a tale of two...

Post on 16-Apr-2018

236 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bon cop, bad cop: A tale of two cities

Kelsey Ball, Barbara E. Bullock, Gualberto Guzmán,Rozen Neupane, Kristopher S. Novak,

& Jacqueline Larsen SerigosThe University of Texas

Transcultural urban spaces: where geography meets language 16/17 October 2015 Berne

Outline1. Language, geography, contact, and multilingualism

○ Mobility and mixing

2. The Bilingual Annotation Tasks (BATs) Force○ Investigating multilingual texts with automated tools

3. Significance○ Insight into how bilingual speakers use their linguistic resources as they move between different

urban spaces

2

Language, geography, contact, and multilingualism

3

Superdiversity

● In zones of high contact, such as urban centers, innovative language use is ubiquitous, reflective of superdiversity (Vertovec 2007).

● Contemporary urban speech characterized by mixing is referred to by a variety of terms:

○ urban youth languages

○ ethnolects

○ translanguaging/code-switching

○ hybrid language practices

4

Language mixing and multilingual corpora

● As speakers move between and within spaces, interacting with different interlocutors, they move between and within languages.

● What is the nature of contemporary mixed language forms?○ Linguistic studies : How do they operate structurally?

○ Cultural and urban studies: What kinds of meanings and identities do they convey?

5

Polylingual performances

6

● Not limited to borrowing words from Language X to insert into Language Y; can include:

○ Non-normative usages

○ Local forms

○ Polylingual language play: memes

○ Convergence in structures and meaning between languages

● In order to make sense of this, we need data—lots of it

● But these features make it difficult to annotate these types of text

Attempts to study multilingual data

● LIDES/LIPPS Group

● BilingBank

○ accessible via http://www.talkbank.org/

● Contacts de Langues: Analyses Plurifactorielles (CLAPOTY) ○ Isabelle Léglise (CNRS SEDYL/CELIA) - http://clapoty.vjf.cnrs.fr/index.html

7

Studying multilingual data: Traditionally● typically manually annotated● set base/matrix language● discrete categorization of contact features

○ Established borrowings (vs. single word code-switch)

■ “mon chum”

○ Syntactic calques (vs. a dialectal extensions of existing patterns in a language)

■ “take a coffee” < have coffee

○ The language of a proper name (Named Entity)

■ Robert

8

Studying multilingual data: More recently● Probabilistic approaches to contact features

○ Allows for gradient/blended classification

● Flexible word sequences (ngrams)

● Automatic annotation

9

Why automate?● Eliminate obstacles of cost and time● Eliminate errors/biases introduced by manual coding● Increase accountability through replicability● Process other people’s data to create open resources● Accelerate production of findings● Encourage collaboration

10

Bilingual Annotation Tasks (BATs) Force/|\ (°v°)/|\

11

Goals of the BATs● Present a method and tools for analyzing larger data sets than linguists usually

work with

● Demonstrate that it is possible to automatically identify the languages in a data set without the need for human intervention

● Apply a quick metric to calculate the degree to which a text is mixed so that it can be compared with other texts

12

13

A first step: written corpus

● Killer Crónicas: Bilingual Memories

○ ~49,000 words

○ colloquial language

○ dialectal forms

○ highly bilingual

○ extensive mixing

A second step: oral corpus

● Bon Cop Bad Cop

○ 13,502 tokens

○ colloquial language

○ dialectal forms

○ extensive mixing

○ highly bilingual

14

Movie Clip

15https://www.youtube.com/watch?v=lLISTwHX88k&feature=youtu.be

Bon cop, Bad Cop, 2006● Plot

○ Officers from Ontario and Quebec join forces to solve a murder on the border

○ Film revolves mixing of languages and cultures

■ Bouchard: Francophone

■ Ward: Anglophone

● Bilingualism○ Touted as Canada’s first bilingual film, very successful at the box office

○ Filmed with a French and an English script; later edited for final mixed version

16

Methods

17

Corpus Creation1. Monolingual subtitles: open subtitle.com (English), subsmax. com (French)

18

Corpus Creation2. Manually combined the two subtitles to create a transcript

19

Annotation Tags for Manual Gold Standard

20

Language Tag Example

English years, country

Non-standard English Hopportunity

French frère, ville

Non-standard French mononcle, tabarnak

Named Entities Montréal, Toronto, David

Punctuation . ! , ? –

Number 2001, VII

Example of the Gold Standard

21

Building a language identification model● Numbers and Punctuation

○ Rule-based

● Named Entity○ Stanford Named Entity Recognizer

● English and French ○ Character Ngram

○ HMM

22

Character N-gram Model● What’s the probability that a given word is English? French?● Model a language using training data (La Presse corpus, CALLHOME corpus)● Identify language of an unknown word based on how often it occurs in training

data● Example:

23

Hidden Markov Model● What’s the probability that this word is English, considering that the previous

word was English?● Code-switching is less frequent (therefore less probable) than not switching.

-Me occurs in both English and French training data → ambiguous -Because previous word was English, me is more likely to be English

24

Annotation of Bon Cop, Bad Cop by model

25

Results

26

Evaluation of our model

27

Tag Accuracy Precision Recall

English 98.3% 98.3% 96.6%

French 97.5% 97.4% 98.6%

Named Entity 69.4% 78.7% 85.5%

Over-all 97.0%

Over-all Language Distribution across Bon Cop Bad Cop

28

Language Tag

Named Ent.

Punct.

Eng

French

Token count

Characters’ Usage of French Across Cities

29Montréal Toronto

Logistic Regression

30

● Relative to Martin Ward...

○ Others: not a significant predictor

○ David Bouchard: 15 times more likely to use French

○ Tattoo Killer: 1.6 times more likely to use English

● Relative to Toronto...

○ In Montréal: 8 times more likely to use French

Can we predict language based on character and location?

Bilingual metric applied to Bon Cop Bad Cop

31

● The M-metric is developed by the LIPPS Group (2000):

where,

k=total number of languages/categories

pj=total number of words in the given language over total words in the corpus

● The metric allows us to determine how bilingual a text is: the closer the value is

to 1, the more ‘bilingual’ the text

M-metric for Bon Cop Bad Cop

32

Switch points

33

Types of Switches Example

Eng to French He has had an unfortunate contretemps

French to English On peut aller demander à ses partners

Punct. only Hey, where you going?Chez nous.

Named Ent. and Punct. Bye, Patrick. Merci d'avoir appelé.

Switch Pattern

34

Over ⅔ of Direct switches involved single word borrowings (eg. ses affaires cheap américaines)

Direct Switches

Indirect Switches

Indirect Switches

Discussion and Conclusions

35

What we’ve achieved● Presented a profile of language switching in a large bilingual text

○ locale

○ speaker

● Demonstrated that it is possible to automatically identify language switches without manually coding all the data

○ with a high degree of accuracy: 97%

● Calculated the degree to which segments of the corpus are mixed and how they are mixed

○ M score of .71 for overall mixing where 1 = perfectly mixed

36

What we’ve learned about who switches, where

● Context matters— a lot

■ In Toronto, only the Québécois cop (Bouchard) uses French; all others perform in English

■ In Montreal, the Torontonian cop (Ward) and Tattoo Killer perform bilingually

● Speaker matters— a lot

■ The most bilingual character appears to be Tattoo Killer

■ The least is Bouchard

37

What else we’ve learned about who switches● The anglophone speakers accommodate more to the francophone than vice

versa.○ Given the politics of language in Quebec with regard to French language maintenance, and the

fact that the director and producer of the film are Québécois, this might not be surprising.

38

What we’ve learned about how they switch

● Most switching coincides with punctuation, indicating

○ a change in speaker turn

○ or a switch at a phrasal boundary

● Within a phrase○ nearly balanced French↔English

39

ConclusionsThis work shows how NLP tools can be fruitfully recruited to shed light on contemporary mixed language use.

40

top related