understanding community needs: scalable sms processing for unicef nigeria and burundi

Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and

Burundi

Jessica LongSenior software engineer at Idibon

Overview• Who’s involved in this project?• What is Natural Language Processing (NLP)?• What are the challenges of creating NLP for minority languages

and multilingual societies?• How has the digital age changed how we curate language data?• UNICEF’s U-Report program, and Idibon’s collaboration• Lessons learned from automatic labeling in English and minority

languages• Conclusions

Acknowledgements• Robert Munro, CEO of Idibon• Caroline Barebwoha, U-Report Nigeria project lead• Aboubacar Kampo, U-Report Nigeria project lead• Sarah Atkinson, U-Report Burundi project lead• Kidus Fisaha Asfaw, Global head of U-Report• Evan Wheeler, CTO of UNICEF Innovation / RapidPro• Nicholas Gaylord, data scientist at Idibon

My background

Symbolic Systems BSComputer Science MS

Health systems manager inrural Burundi

Internationalization engineer

Second language acquisition research

NLP engineer

What is Natural Language Processing (NLP)?

Natural language processing is a branch of artificial intelligence specifically concerned with making automatic judgments about free text

Flavors of NLP• Automatic categorization• Machine translation• Named entity recognition• Sentiment Analysis• Semantic Role Labeling• Opinion Mining• Parsing• Question Answering• Search

– 15% of Google’s daily search queries have never been issued before!

• Part of Speech Tagging• Textual Entailment• Discourse Analysis• Natural language

Generation• Speech Recognition• Word sense

disambiguation• Text summarization

Underlying algorithms• Semi-supervised machine learning– Start with labeled training data that’s similar to what you

want to generate– Use this to “teach” the computer what features to look for

when making a decision about the text

Cat Cat

Cat

???

Dog Dog

DogTraining set Prediction

Semi-supervised machine learning example• “Using Wikipedia for Automatic Word Sense Disambiguation,”

by Rada Mihalcea (2007)

Paris, France

Paris, Texas

Paris, France Paris, France

Paris, Texas

Tokenization and feature extraction (n-grams)

“tomb”, “of”, “the”, “unknown”, “soldier”, “beneath”, “arc”, “de”, “triomphe”

“tomb of”, “of the”, “the unknown”, “unknown soldier”, “beneath the”, “the arc”, “arc de”, “de triomphe”

“tomb of the”, “of the unknown”, “the unknown solider”, “unknown soldier beneath”, “beneath the arc”, “the arc de”, “arc de triomphe”

Other features- Punctuation- Stemming- Parsing- Capitalization- Dictionary matching- Stopwords- …

Paris, France

Source text

Source label

Extracted features

Who uses NLP?Apple’s Siri does speech recognition on human voices, as well as question answering

IBM Watson answers Jeopardy questions

Overview• Who’s involved in this project?• What is Natural Language Processing (NLP)?• What are the challenges of creating NLP for minority

languages and multilingual societies?• How has the digital age changed how we curate language data?• UNICEF’s U-Report program, and Idibon’s collaboration• Lessons learned from automatic labeling in English and minority


Language resources for UNICEF Uganda

30+ Languages Spoken in Uganda

Google Translate Supported Languages

Why is NLP difficult for minority languages?• Lots of code-switching breaks usual paradigm of language-specific

textual analysis• Lack of existing digital tools: spell check, autocomplete, access to

internet• Minority language speakers lack purchasing power• Tokenization

– Consider:• “ntibazoronka.”: “nta” “i” “ba” “zo” “ronka” “.” (Kirundi)• “they will not obtain.”: “they” “will” “not” obtain” “.” (English)

• Encoding issues– “I can text you a pile of poo , but I can’t write my name” by Aditya

Mukerjee in Model View Culture

But most of all. . . • Minority languages lack appropriate training datasets.

– They tend to be primarily spoken, and lack the digital and even written content necessary for statistical machine learning

• Google Translate relies on parallel corpora from UN proceedings to help create machine translation products– The UN does not dual broadcast in Wolof.

• Textual reviews matched to star ratings on Yelp helps researchers calibrate sentiment analysis– Yelp is literally non-functional in most of Africa.

“Raw data is an oxymoron.” - Lisa Gitelman

Curation of language data, old & new

Compiled by Webster

Collective wisdom, at scale

Compiled by experts,

Supplemented by OED Reading Programme

* Shout out! Go see Martin Benjamin’s talk on The Kamusi Project tomorrow at 13:45, for more information on dictionary curation

Creating new structured data with crowdsourcing

• “Are two heads better than one? Crowdsourced translation via a two-step collaboration of non-professional editors and translators”, Yan et al– Creating parallel corpuses with crowd workers is much faster

and cheaper than using professional translators

• Now, more than ever, we have the ability to rapidly create new labeled language data– …as long as we can find proficient writers of minority

languages with digital literacy, electricity, and internet access

Cell phone access• Nearly 6 billion people in the world have

access to a cell phone• In 2013, the UN famously reported that more

people have access to a cell phone than to a toilet


and multilingual societies?• How has the digital age changed how we curate language data?• UNICEF’s U-Report program, and Idibon’s collaboration• Lessons learned from Idibon’s automatic labeling in English and

minority languages• Conclusions

UNICEF’s U-Report • Crowd wisdom, in real time, in developing countries• In 2012, UNICEF Innovation team started building a real-time SMS

polling service for UNICEF Uganda. As of 2015, U-Report operates in over 15 countries

• Polls are sent out once a week on topics like:– Has ur community addressed social inclusion issues affecting women,

youth, and children?– If you get water from a well, borehole, or community tap, is it working

today?– Go to your local health center and tell us: Do they give free HIV / AIDS

tests? Report YES or NO and HEALTH CENTER NAME

UNICEF’s U-Report • Eventually, UNICEF started receiving urgent, unsolicited

messages– FLOOD.villages of X, Y sub.county suffering.

• UNICEF Nigeria alone now receives 10,000+ unsolicited messages per day

• UNICEF needs a way to:– Identify topically relevant messages to share with specific partners– Prioritize which messages to respond to first

• Idibon labels messages with urgency, category label, and language, in real time


and multilingual societies?• How has the digital age changed how we curate language data?• UNICEF’s U-Report program• Lessons learned from Idibon’s automatic labeling in English

and minority languages• Conclusions

Lesson #1: It’s difficult to predict how many new people will use your product / service when you start supporting a new language

Non-English Languages of Nigeria

Igbo

Yoruba

Hausa

Pidgin En

glish

Ijaw Tiv

Fulfu

deIbibio Ed

oKan

uri0

5

10

15

20

25

30

Lesson #1: It’s difficult to predict how many new people will use your product / service when you start supporting a new language

# unsolicited Hausa messages per day

Hausa polls begin * But we don’t see the same effect for Yoruba

Lesson #2: Language mixing in an African context has different considerations for

classification algorithms vs European language code-switching

• Downside: complex tokenization

• Upside: radically different word structure

Lesson #3: Geopolitical context affects how we interpret short messages, and it’s

constantly changing

Lesson #4: Mutually exclusive categories are elusive. To automatically label

messages is to discover the endless ambiguity in human discourse.

- Is a washed out road more related to infrastructure or personal safety?

- Is education scoped to a particular time in life? Does post-graduate education count? What about education outside of a scholastic context?

- If a town’s full name is “Mbale Village,” is “Mbale” a valid place name?

- How specific do messages need to be to constitute a security threat? Does “these days some of our young people are not safe” count?

Conclusions• Crowdsourcing, machine learning, and the

proliferation of cell phones make amazing new communication tools and digital language data possible

• Invest in translators and analysts

Thank you!

understanding community needs: scalable sms processing for unicef nigeria and burundi

Technology