[séntisis] sentiment analysis in usa spanish

3
Spanish language in the USA Processing texts in Spanish from the USA is a challenge because of the special features that Spanish has in this country. Let’s see some of them. Sentiment analysis and Spanish language in the USA Sentisis’ NLP engine can process Spanish and its geographic variants to provide semantic analysis of large textual datasets. The engine is fed with a database currently composed of linguistic constructions (language units and combinations that go beyond single words) typical of Peninsular Spanish, other variants like Mexican and Colombian Spanish and more recently Spanish in the United States. Spanish speakers all over the world share more than 75% of the language. Lexicon, much more than grammar, is the feature of language that introduce more variants between Spanish speakers in Spain and America, and between speakers from different parts of America. To make the language processing engine understand Spanish language variants, our language team began to work on this changing lexicon. Spanish in the USA is spoken by groups of people that, in many cases, come from other parts of America. Each speaker imports a native language variant. Therefore, Spanish in USA is composed of a myriad of words and expressions from different regions. There is a very large set of differential lexicon in the United States. At this point, our work has been intensive and is based on corpus compilation and annotation. Instead of annotating every word or expression, we focus on lexicon from sectors that use Sentisis as a monitoring tool, such as politics, food industry, corporate reputation… LEXICAL DISPERSION For example: ¡qué chévere estar aquí con los chicos!

Upload: sentisis-analytics-sl

Post on 07-Aug-2015

39 views

Category:

Marketing


2 download

TRANSCRIPT

Page 1: [Séntisis] sentiment analysis in usa spanish

WHO WE ARE

Spanish language in the USA

Processing texts in Spanish from the USA is a challenge because of the special features that Spanish has in this country. Let’s see some of them.

Sentiment analysis and Spanish language in the USA

Sentisis’ NLP engine can process Spanish and its geographic variants to provide semantic analysis of large textual datasets. The engine is fed with a database currently composed of linguistic constructions (language units and combinations that go beyond single words) typical of Peninsular Spanish, other variants like Mexican and Colombian Spanish and more recently Spanish in the United States.

Spanish speakers all over the world share more than 75% of the language. Lexicon, much more than grammar, is the feature of language that introduce more variants between Spanish speakers in Spain and America, and between speakers from di�erent parts of America. To make the language processing engine understand Spanish language variants, our language team began to work on this changing lexicon.

Spanish in the USA is spoken by groups of people that, in many cases, come from other parts of America. Each speaker imports a native language variant. Therefore, Spanish in USA is composed of a myriad of words and expressions from di�erent regions.

There is a very large set of di�erential lexicon in the United States. At this point, our work has been intensive and is based on corpus compilation and annotation. Instead of annotating every word or expression, we focus on lexicon from sectors that use Sentisis as a monitoring tool, such as politics, food industry, corporate reputation…

LEXICAL DISPERSION

For example:

¡qué chévere estar aquí con los chicos!

Page 2: [Séntisis] sentiment analysis in usa spanish

Spanish in USA incorporates a large number of English words that cohabitate with Spanish words leading to the Spanglish phenomenon.

If the word or expression in English is important for the engine to grasp correctly the topic or sentiment from the message, we add it to the database as a typical construction in Spanish in USA.

An added di�culty is that the English lexicon presents di�erent morphologi-cal features from Spanish (di�erent nouns or verbal endings). In these cases, our engine can detect the di�erent forms associated to a single lexical root, as it does with Spanish.

LEXICON IN ENGLISH

For example:

“cool”: es muy cool tu nueva colección de llaveros”

“mall”: Los más hermosos y exóticos bonsai serán expuestos durante este fin de semana en el mall ¡No te lo pierdas! #InspiradoEnMamá

North American Spanish has its own lexicon, as the other Spanish variants from Europe or America. This lexicon is influenced by English, thus calques and transliterations are very common.

If these words express the topic or sentiment from the message, they are added to our database as an entry in Spanish but tagged with the geographi-cal label of USA.

Estadounidismos or native words from Spanish in USA

For example:

“jugársela frío”: from the English expression “to play it cool”, which means “tomárselo con calma”.

No te compliques la vida y juegala cool"

“llamar para atrás”: from the English expression “to call me back”, which means “devolver la llamada”.

Ella agarra mi iPhone y llama para atrás a todos los numeros q llame solo para ver quien

es, muchos diran q es loca…para mi eso es amor <3Ve mi llamada perdida y no llama para atrás.

“fowardear”: from the English verb “to forward”, which means “reenviar un correo electrónico” RT @Lauritten:

Por favor podran fowardear el cuento? me gustaría recibir sus criticas en el blog http://laurit… (cont)

http://deck.ly/~IdDxs 10min de espera en callcenter #chilectra, una simpática srta verifica que yo soy

yo y me dice que ella solo filtra y me tiene que fowardear

Page 3: [Séntisis] sentiment analysis in usa spanish

The main di�culty of processing Spanish in USA lays in a huge lexical variation, as it is composed of words and expressions from di�erent sources. Although it may require large language resources, we address this issue by focusing the analysis on domains of knowledge.

When analysis is oriented towards areas or domain of application, lexical work becomes easier. Besides, precision and recall levels improve in our monitoring tool.

STUDY OF LEXICON BY THEMATIC FIELDS

“frizar”: from the English verb “to freeze”, which means “congelar” Salí a buscar algo y se me

frizo hasta el social securityn

“janguear”: from the English verb “to hang out”, which means “salir a tomar algo” La mayoria de la

gente llegando de un jangueo y yo pues llegando a la Uní. Hoy no se janguea tengo

examen mañana hoy uno se amanece.