atestofmediacapturebyusingmachinelearning techniques ...€¦ ·...

29
A Test of Media Capture by Using Machine Learning Techniques Evidence from Italian Television News 2010-2014 Andrea De Angelis * Alessandro Vecchiato September 9, 2016 VERY PRELIMINARY. PLEASE DO NOT CIRCULATE. Abstract A central question in political communication refers to the existence and extent of me- dia bias and how that could affect political stability and electoral outcomes. We address this question by using novel methodologies based on machine learning techniques and an origi- nal dataset collecting the entire corpus of national TV news outlets (including Rai, Mediaset, and “LA7” TV networks) in Italy from 2010 to 2014. Textual models can perform linguistic and substantive analysis of large corpus of text by exploiting variation in language use across and within authors, documents and time. In this paper we first estimate ideology scores for each TV outlet and analyze their change in the period under study. Secondly, we exploit the discontinuity in public TV ownership in our data, to determine existence of media bias in the news outlets as a way to make the public TV a more “favorable” environment for the incum- bent right-wing government. Finally, we identify key news topics and track their saliency over time and across networks. This methodology allows us to test whether political lean- ing in the news is due to strategic issue selection in news coverage or to differing frames in communication of the news stories. * Andrea De Angelis, European University Institute, Dpt. of Political and Social Sciences, via dei Roccettini 9 I- 50014, San Domenico di Fiesole (Italy); [email protected]; New York University, Wilf Family Dept. of Politics, 19 W 4th Street, Room 302, 10012 New York, NY; alessan- [email protected]; 1

Upload: others

Post on 07-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

A Test of Media Capture by Using Machine LearningTechniques

Evidence from Italian Television News 2010-2014

Andrea De Angelis∗

Alessandro Vecchiato†

September 9, 2016

VERY PRELIMINARY. PLEASE DO NOT CIRCULATE.

Abstract

A central question in political communication refers to the existence and extent of me-dia bias and how that could affect political stability and electoral outcomes. We address thisquestion by using novel methodologies based on machine learning techniques and an origi-nal dataset collecting the entire corpus of national TV news outlets (including Rai, Mediaset,and “LA7” TV networks) in Italy from 2010 to 2014. Textual models can perform linguisticand substantive analysis of large corpus of text by exploiting variation in language use acrossand within authors, documents and time. In this paper we first estimate ideology scores foreach TV outlet and analyze their change in the period under study. Secondly, we exploit thediscontinuity in public TV ownership in our data, to determine existence of media bias in thenews outlets as a way to make the public TV a more “favorable” environment for the incum-bent right-wing government. Finally, we identify key news topics and track their saliencyover time and across networks. This methodology allows us to test whether political lean-ing in the news is due to strategic issue selection in news coverage or to differing frames incommunication of the news stories.

∗Andrea De Angelis, European University Institute, Dpt. of Political and Social Sciences, via dei Roccettini 9 I-50014, San Domenico di Fiesole (Italy); [email protected];†New York University, Wilf Family Dept. of Politics, 19 W 4th Street, Room 302, 10012 New York, NY; alessan-

[email protected];

1

Page 2: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

1 IntroductionSeveral papers find a remarkable correlation between the set up of the media landscape andpolitical outcomes in the US and around the world (see e.g. Djankov et al. (2003); McMillanand Zoido (2004); Reinikka and Svensson (2005); Gentzkow and Shapiro (2010); Durante andKnight (2012); DellaVigna et al. (2016)).Tounderstand the complex interactionbetweenmedia outlet andpolitical actors, both economistsand political scientists have examined the specific factors that shape incentives to manipu-late the quality and content of news that reach voters: media owners can exploit politicalconnections to reach funding resources outside the advertisment market, politicians bene-fit from a favorable media arena that reduces voters monitoring ability and therefore theiraccountability1 (Besley and Prat, 2006; Prat and Stromberg, 2013).

One crucial problem is how the ownership structure affects news selection and framing tofavor a specific political party Strömberg (2015). This paper follows the literature on mediaand explores the role of news and issue framing as an alternative measure of bias. Researchin economics suggests that media outlet show a strong liberal leaning. Groseclose and Mi-lyo (2005) reports that in the universe of US news media, only Fox News and the WashingtonTimes received scores at the right of the center. Bias can be driven from owners preferencesor audience demand2. Media outlets can deliberately modify their language by includingpolitical slant in order to attract readers with similar political views Gentzkow and Shapiro(2010). Alternatively, they may follow the ideological sensibility of their ownership or a po-litical party their are trying to support. In this last case, bias emerges as a result of politicalcapture of the media. We focus on this case and provide evidence of media capture by ex-ploiting quasi-exogenous variation in media ownership.

To investigate this relationship we face several challenges. First, media ideological po-sition is very difficult to estimate. A number of papers have resorted to various techniquesthat detect ideological correlates in newspaper language usage. Groseclose and Milyo (2005)considers a group of 200 prominent think tanks or policy groups and counts the times a par-ticular member of the Congress cited one of them. They perform the same procedure fora number of newspapers and other media outlet and assign an ideological score to each ofthem on the basis of the frequency in which each think tank was nominated. This procedureallows them to link the ideological bias of media outlets to the one of others political actorsand so derive an ideologicalmeasure ofmedia. Amore recent paper by Gentzkow and Shapiro(2010) focuses in a similar fashion on media slant. To assign a particular ideological leaningto specific language they refer to the Congressional Records and identify those set of phrasesthat are usedmuchmore frequently by one party than the other. Secondly, they calculate thenumber of time each outlet resorts to particular language that may sway voter to the left orthe right of the political spectrum and assign to each of them the corresponding ideologicalscore. However, while thesemeasuresmay be able to capture newspapers ideological leaning,are focused on a specific way in which the media outlet may try to influence reporting.

We overcome this difficulty by adopting a new unsupervised machine learning techniquefirst introduced by Slapin and Proksch (2008). WORDFISH is a scaling algorithm that estimatespolicy positions based on word frequencies in texts. Following the naïve Bayes assumption

1For a comprehensive review of the results on media and politics read Strömberg (2015).2See below for more extensive results on theoretical models of bias.

2

Page 3: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

prevalent on text analysis literature (Eyheramendy et al., 2003), this algorithm represents atext as a vector of words counts. Individual words are assumed to be distributed at random,and word frequencies to be generated by a Poisson process. The procedure treats each pieceof text as expression of a separate ideological position, and based on their word frequencies,estimates for each of them the relative weight of words in discriminating among ideologicalpositions, together with the ideology score of the document. All parameters are estimatedsimultaneously for the entire corpus of text. The model can be expressed as follows:

yijt ∼ Poisson(λijt) (1)

λijt = exp(αit + ψj + βj · ωit) (2)

where: i indexes documents, j the tokens (i.e. words stems when only unigrams are used,or ngrams), and t to time. The only Poisson parameter λijt is modelled as a function of threelatent components: αit are the document-specific fixed effect, ψj are word-specific fixed ef-fects (capturing the relative frequency of the words), βj are the word-specific discriminationweight parameters, capturing the ability of words to discriminate between ideological posi-tions, and finally ωit estimate the latent position of the document. This process allows notonly to estimate relative scores across political actor based on relative word frequencies, butalso, by assuming independence across texts, over time. The omega scores thus representsmeasures of ideology within the spectrum of the available texts and are allowed to changeover time.

A second challenge comes from the intrinsic unobservable nature of media capturingpractices. Politicians may influence the media through bribes, legal favoritisms or by ap-pointing sympathetic managers. McMillan and Zoido (2004), for instance, uses a secret po-lice account of government bribes to investigate corruption in Peru. They find that mediaand especially TV channels were receiving the largest shares of bribes during the Fujimoriregime. Tella and Franceschelli (2011) uses government advertisement practices as a proxyfor favoritism. They find that newspapers with government advertising are less likely to talkabout government corruption, with one standard deviation increase in government adver-tising being associated with a decrease in coverage of corruption scandals of 0.23 of a frontpage per month. Nevertheless, these measure do not allow for a systematic study of capturegiven their rarity or their underestimation of the extent of corruption (advertisement maybe only one of the strategies the government use to favor specific actors).

Our setup provides a number of solutions to these challenges. The Italian media land-scape has been widely criticized for its general lack of impartiality, to the point of blatantpolitical bias. The public ownership of three media channel, Rai 1, Rai 2 and Rai 3, gives thegovernment the prerogative of nominating their CEOs and newscasts editors. Grasso (2004)describes the historical process that led to the politicization of the Italian State television indetail. In 1975, the government reformed3 the national television system, shifting the pre-rogative of nominating the TV managers from the government to the parliament. This ledto a practice called ‘lottizzazione’ (lotting) that resulted in the consistent nomination as di-rectors and editors of figures that were politically affiliated with specific national parties.Thus, each channel had a well-defined (though not formally stated) political affiliation: Rai1

3Legge n. 103 del 14 aprile 1975 in matters of national television broadcasting.

3

Page 4: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 1: Patrick Chiappori on Berlusconi Resignation. Source: New York Times

typically supported the incumbent government (Democrazia Cristiana), Rai2 supported theSocialist Party, andRai3 the Communist Party (or the far left). This practice evolved after 1993toward a stronger focus for television newscasts. After each election, with a new parliamentand government in place, the editors of the national channels (particularly Rai1 and Rai2)were typically replaced with journalists or political figures closer to the new establishment.

This problem was exacerbated by the candidacy of Silvio Berlusconi, a media and realestate tycoon, who entered the political arena in the aftermath of the Mani Pulite scandalin 1993. When in power, Silvio Berlusconi, never completely divested its properties of threeTV channels, Canale 5, Italia 1 and Rete 4, officially controlling either directly or indirectly86% of the media market in Italy. DellaVigna et al. (2016) study the Italian case from theadvertisement market perspective. Given that in Italy government officials are not requiredto divest business holdings, in this paper they are able to test whether during the years ofBerlusconi’s governments there was strategic advertisement toward its networks as a formof political support. They find that during Berlusconi political tenure Mediaset profits (hisTV company) increased by one billion Euros.

Our setup exploits a similar empirical strategy to associate government influence on tele-vision. In the aftermath of the financial crisis of 2008, Berlusconi’s government was facingincredible pressure from the international markets to reduce the increasing national debt.His failures to handle the situation and reduce the national bond spread over the Euro zoneled to his resignation on November, 12 2011. We exploit this quasi-exogenous variation ingovernment power due to his rapid decline to detect change in media ideological leaning asa consequence of capture. By comparing TV news ideological score before and after his resig-nation, and at each time a network director was replaces, we are able to provide significantevidence of capture in the Italian media.

4

Page 5: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Another advantage of our setup comes from the use of a novel dataset of TV news tran-scriptions collected by the Italian service of Teche Rai. The data collect the universe of TVnews in Italy in the period 2010-2014. That is, we are able to derive an ideological score foreach national TV network news service4 that aired each day in the specified time frame. Theextent of this data matched with our methodology gives us numerous advantages with re-spect to previous research. First, we do not have to resort to methods that hand-code net-work ideology and are thus sensitive to researcher discretion. Secondly, we do not have toselect specific significant words as representative of ideological leaning from which to esti-matemedia ideology. Instead, we simultaneously estimatewords ideological score andwe areable to select ex-post which are the most ideologically charged expressions. Finally, we areable to track ideological movements over time and with the highest degree of granularity5.

This paper provides a number of new results. We find that Italian newscasts have signif-icantly different political leaning, that map generally similarly to the Italian political spec-trumwhere Berlusconi’s TVnetworksmap toward the rightwhile historically leftist channelsto the left. This evidence is consistent with previous results by Gentzkow and Shapiro (2010);Groseclose and Milyo (2005) that report significant political leaning for American newspa-pers. Differently from previous research, our work shows that this differentiation is not dueto viewers demand but political capture. We support this conclusion with two different em-pirical strategies. First, we provide evidence of strategic reporting by the TV networks. Morepro-government networks tend to substitute economic issue that were particularly problem-atic during the financial crisis with crime and entertainment reporting. We thus do not finddifferences in issue framing across network but only onwhich issues to providemore empha-sis. Secondly, we identify specific structural breaks in news reporting and match them withthe political agenda. We find that consistently with the change in government in 2011, theaverage score of the TV newscast shifted significantly to the left of the spectrum. We takethese results as robust evidence of capture in the Italian television system.

This paper is structured as follows. Section 2 describes the data and the methodology indetail. Section 3 provides the baseline results results on ideology for both media and MPs.Section 4 shows the ideological change around the temporal discontinuity in 2011 and runsa battery of tests to confirm the graphical result. Section 5 concludes.

2 Data

2.1 Teche Rai News DatabaseThe prevalent source of data for this study were obtained from the Teche Rai Service, thehistorical archive of Italian Radio and Television broadcasts implemented by the State Broad-casting company RAI. These data were developed from the audio-video files with a platformof Automatic Newscast Transcript System, ANTS, especially targeted to news programs. Theobtained transcription quality is about 90% correct recognition. Also, since the text is syn-chronizedwith themultimedia signal, given aword the researcher canhave immediate access

4Italy has sevenmain national TV channels, three publicly owned (Rai1, Rai2, and Rai3) and four private (Canale5,Rete4, Italia1 and La7). The media landscape hasn’t change since then.

5As we later explain, we proceed by grouping TV transcription by week.

5

Page 6: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

to the segment where it is pronounced. In addition, a validator performs a segmentation ofthe signal based on the speech footprint of the speaker. The obtained transcripts are goodenough to be used for text-based search and information discovery by, for instance, full-textsearch engines and artificial intelligence techniques. The project has now completed thetranscription of the corpus of newscast from 2010 to 2014.

These data present a unique and verbatim source for text analysis, with themost detailedlevel of granularity. In its original form, a text document is represented by a segment on aparticular topic during a newscast. Each of the seven Italian broadcasting networks typicallyrun three daily news cast of about 30 minutes (morning, noon and evening service). Thetotal amount of data collected by this database therefore amounts to almost 20.000 hours ofbroadcasting divided into 319,895 segments.

The dataset present a number of convenient features. Additionally to the transcriptionof the newscast services, the dataset contains a number of meta-data that organize them andease analysis, like network, starting and ending time of each segment and subject6.

In order tomake sure that our ideological scores come exclusively from journalistic report-ingwe delete from the dataset all transcriptions referring to speech performed by politicians.That is, all politicians interviews, remarks and otherwise recorded speech is excluded fromour analysis. We keep instead any journalistic commentary on those parts. Overall, all thetext we analyzed consists of chronicles and opinions from journalists and reporters In thissense, our study assumes a hypothetical perfect pluralism as for the presence of politiciansin TV, which is the main criteria adopted by the Italian Communication Authority (AGCOM)7.

2.2 Legislative SpeechesWe preliminarily test our methodology with an application on unambiguously ideologicalactors: the Italian MPs. For the analysis on the legislative speeches we exploit an originaldataset directly developed by the authors. We scraped from the official website of the ItalianChamber of Deputies8 the entire corpus of the debates of the XVI legislature (correspondingto the years 2008-2013) recorded in the Italian lower chamber (Camera dei Deputati). Yet,we limit our focus to the time frame in which the Berlusconi’s cabinet was in charge (2008-2011). In fact, the end of the Berlusconi’s experience led to the formation (16 of November2011) of theMonti cabinet. The latter periodwas characterized byhighly exceptional politicalcircumstances. The technocratic government’s political backing changed twice during itsterm9 This implies that whenever theMPswould address the government in their speech, thestatistical algorithm could not possible distinguish between the previous references to theBerlusconi government and the successiveMonti one. Switching opposition andmajority this

6By subject we refer to general journalistic categorization into politics, chronicles, economy, arts and sports.7More detailed information regarding themonitoring of political pluralism, the aims of the AGCOMauthority, and

the Law n. 249/1997 (the “par condicio” law) regulating the issue of pluralism on the media is available (in Italian)from the following AGCOM link: https://www.agcom.it/ldisciplina-della-par-condicio-

8Thewebsite is at the following link: http://leg16.camera.it/207. The web scraping was performed relying on thePython’s package Beautiful Soup.

9Monti was initially supported by both the two main Italian parties (the Democratic Party and Berlusconi’s Peo-ple’s Freedom Party), and this means that two previously opposing forces would start a radically new political phasein which they supported the same government.

6

Page 7: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

frequently would have increased significantly the noise in our estimates and compromisedthe validation strategy we are using.

Our corpus includes all the official sessions of the legislature, including all the secondarydiscussions that concerned the legislative activities (decree presentations, discussionof amend-ments, final debates and voting statements). We excluded the works of the various commit-tees, Q&A sessions with the members of the government, as well as the parliamentary inves-tigations. Overall, we count the transcripts of 739 parliamentary sessions that correspond to18,356 single interventions. All the interventions of the President of the Chamber (On. Gian-franco Fini) were removed from the corpus. For each transcript, we harvested informationregarding the presenters’ names, the date and the session of the intervention, and the partyaffiliation of the presenter. In this way we were able to list the legislative speeches by MP,leading to a final dataset of 518 documents (not all the MPs made at least one intervention,see Appendix A for a summary list of MPs by number of interventions).

3 Analysis and preliminary resultsThis section introduces the main findings of our analytical effort. It is organized in threesubsections. In the first place, we realize a preliminary test of ideological scaling of non-policy text. This is motivated by the fact that theWORDFISH algorithm is typically applied tomanifesto documents containing explicit and (questionably) exhaustive or at least systematictext regarding the main policy positions of political parties and movements. Our applicationdeviates from these applications, as out text corpus does not contain policy positions.

Subsection 3.1 shows that it is indeed possible to detect latent political positions even ifthe political text does not directly involve policy positions. Next, in subsection 3.2 we applythe text scaling algorithm to the corpus of TV news transcripts and report details on the es-timation process and as well as on the main results. We anticipate that the estimates revealsystematic differences between the main Italian TV channels that are compatible with ourprior knowledge on the ideological leanings in the Italian media system. Finally, subsection3.3 investigates the two potential mechanisms of transmission of the political signal: strategicissue selectionworks through systematic differences in the amount of time devoted to distinctissues between the TV outlets; issue framing operates instead through the usage of systemati-cally different words to present the same issue. Our findings seem compatible with the issueselection mechanism while no supportive evidence is found for the framing mechanism.

3.1 A preliminary test of text scaling using parliamentary de-bates, Italy 2008-2011We present a preliminary analysis of the Italian Chamber of Deputies’ parliamentary debates.This scaling exercise has two purposes. First, it will let the reader familiarize withWORDFISHand the process of latent scores’ estimation. Second, it shows that is possible to extract facevalid latent positions from non-policy text.

The “parliamentary-text test” is a particularly harsh one for text scaling. In fact, par-liamentary debates typically involve lengthy procedural discussions; debates involving ref-erences to legislative measures rather than explicitly to political issues and positions; the

7

Page 8: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 2: Box plot of political groups based on scaling of MP interventions, Italy 2008-2011

language that is spoken in the parliament does not privilege clarity and simplicity, rather itinevitably involves articulated considerations, references to previous debates or events, am-biguous statements, and a generally allegorical and sophisticated style of communication10.Thus, our expectation is that if we will observe systematic differences among parliamentarygroups based on the language that the single MPs have been using during the debates, thenit will be possible to extract political positions also from other sources of non-policy text aswell.

We pretreat the corpus of legislative speeches with a procedure that will be detailed insubsection 3.2. This pretreatment involves the removal of all punctuation and characters andnumbers, the reduction of the words to stemmed tokens11, the computation of a list of uni-grams (e.g. ‘govern’, ‘berluscon’) and bigrams (e.g. ‘govern_berluscon’), and the removal of Italianstopwords12. When all this text preparation steps are fulfilled, we convert the documents ofMPs speeches into a document-feature matrix presenting all the tokens that are identified inrows and the documents ordered in columns. The entries of the data-feature matrix are thusword frequencies (i.e. absolute word counts for each document).

The results of our scaling exercise are presented as divided according to the politicalgroups in the lower chamber. The Box Plot in Figure 2 represents the documents’ (or MP-specific) omega scores thatwere computedwithWORDFISH on the corpus of the interventionsat the Chamber of Deputies.

Results range from an average estimated position of ωIdV = −0.782 for the group of the10The language employed by Italian politicians is known in the press jargon as “politichese”.11Stemming is the process through which every word in the text (e.g. conjugated verbs, plural substantives, de-

rived forms) are reduced to their root form. A related process is lemmatization, that reduces the words to theirmorphological root. For instance, stemming the vector of Italian words {‘governo′, ‘governare′, ‘governativo′}we obtained the stemmed vector {‘govern′, ‘govern′, ‘govern′}.

12Stopwords are those words that are functional to the creation of syntactic structures of the text. They are omit-ted because unrelated to the content.

8

Page 9: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Italy of Values (IdV), to the ωLN = 0.324 estimated for the group of the Northern League(LN). The value of the People’s Freedom Party is very close to the one of LN, with a value ofωPdL = 0.318. This makes sense considering that the two parties were government partners.All the opposition political groups display negative average omega values, as it could be rea-sonably expected. Also, the ranking of the opposition groups follows our prior expectation.The most left-wing political group appears to be the Italy of Values (with the recalled valueof −0.782). The IdV used to be a centrist, anti-corruption movement. However, after the2009 European Parliament elections, the party undertook a populist turn and strengthenedthe relationships with the parties of the radical left 13. The party also led to the election of aleft-wing major in Naples (Luigi De Magistris) in alliance with the Federation of the Left, andagainst the Democratic Party. Considered the historic absence of a Communist or Socialist po-litical group in the XVI legislature, we can reasonably think of the IdV as the most left-wingpolitical group, or at least the most neatly opposed to Berlusconi and his government.

Between the two poles of the IdV and the groups supporting the government, the algo-rithm ranks respectively Future and Freedom (ωFLI = −0.259), the Democratic Party (ωPD =−0.233), the mixed group of non-iscrits (ωMisto = −0.230), and the centrist Union of Cen-ter (ωUdC = −0.085). Future and Freedom was a liberal-conservative group and a negativescore may seem unreasonable. Yet, we believe that a negative score is indeed a meaning-ful one. First, the high dispersion of MPs’ scores seems to suggest substantial uncertaintyaround the position of FLI. Second, this is the party of President Gianfranco Fini’s followers.FLI was created after the split from the Peoples’ Freedom Party due to the recurring critiquesto the Berlusconi’s government. Famously, Fini defended more progressive stances on socialissues such as immigration (defending the right to vote for resident immigrants at the localelections14) demanding a stronger role of his faction within the PdL. It is thus likely that thelatent scores computed through parliamentary debates are indeed better capture the degreeof opposition towards the government rather than an overall ideology score.

We read the results of this preliminary test as evidence thatWORDFISH is able to estimatemeaningful latent scores from non-programmatic political texts such as the transcripts ofthe parliamentary debates. In the next section 3.2 we will apply the scaling algorithm to ourcorpus of TV news transcriptions.

3.2 Scaling the fourth estate: extracting latent positions fromTV news, Italy 2008-2014The Italian TV news corpus consists of a total number of 319,895 recorded news stories15

broadcast from the main Italian TV news programs: TG1, TG2, TG3, TG4, TG5, Studio Apertoand TG7. In this subsection we offer a detailed account of the data preparation and estima-tion process and discuss the main findings. In the next subsection 3.3 we investigates themechanism of TV programs’ political leaning.

13The more centrist faction of IdV split from the party in November 2009.14“Fini: ‘Sì al voto agli immigrati’ ”, Il Sole 24 Ore, 04 September 2008. Link here.15The initial total number of news stories is 412,039, but we remove a number of news stories for which the date

is not available.

9

Page 10: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Data preparationThe analytical task requires a number of preliminary steps. First, we subset the news storiesto include only politically relevant topics. To this end we exclude from the corpus the fol-lowing news topics: sport (we thus exclude also football news16), music, shows, culture andscience news, whether and natural disasters, and reported cases of death or illness of impor-tant people. This leaves us with 276,266 valid news transcripts to consider in the analysis.Secondly, we undertake a number of operations that are required for the estimation. In fact,text analysis with WORDFISH, as with other “bag of words” approaches, involves convertingthe corpus of the raw documents into a document-feature matrix of discrete occurrences ofall the tokens in all the documents.

We place extreme care in the creation of the data-featurematrix, because it represent thecore of the estimation process. For all these pre-estimation operations we rely on the tm Rpackage17. The sequence of steps that led to the creation of the data-feature matrix is thefollowing:

1. Wemerge all the news stories’ text by week for each TV channel. This effectively shiftsthe unit of analysis from the single news story to the [TV channel × week] level. Analternative aggregation scheme could have been the single TV news edition (three perday) or the single day (merging the three editions). Yet, the weekly solution in ouropinion is more efficient in that it does not imply a dramatic loss of information, whileit substantially shrinks the number of parameters to be estimated. This, in turn, con-siderably reducing the estimation time.

2. Italian language includes apostrophes, thus conventional punctuation removal tools 18

would behave stripping off the apostrophe and linking the leading character to the pre-vious one19. We thus preliminarily apply a regular expression to substitute all punctu-ation characters not with an empty character, but with a single space character instead.

3. We strip all extra spaces from the text documents.

4. We transform all text into lower case characters.

5. We remove stopwords in the text20.

16Although one may argue that the coverage of Berlusconi’s football team, A.C. Milan, may not be consideredpolitically irrelevant.

17The quanteda package would have been a valid alternative. We opt for tm because it allows tokenization tofollow stopwords removal and this leads to faster preprocessing, while quanteda preliminarily requires tokenizationin order to preprocess text.

18Such as the removePunctuation function from the tm R package that we use.19This works on English text corpora because apostrophes indicate possession and are thus always following the

word, as in “the professor’s hat”. The punctuation removal would result in “professors hat”, and once the words arestemmed the additional “s” character would be removed as well. In the Italian language apostrophes very often areassociatedwith the elision of the article when the followingword starts with a vowel. Thus, the punctuation removalof “l’amica” [the female friend] and “un’amica” [a female friend]wouldwrongly result in twodifferent tokens: “lamica”and “unamica”, and this would be unaffected by the following stemming process.

20We have created a custom Italian stopwords list composed as the union of: 1) all the snowball stopwords listthat are typically included in the most common R packages (as in tm and quanteda); 2) all the words included inthe Ranks NL stopwords list; 3) the following list of 34 additional stopwords chosen after visual inspection: {l,poi, far, quest, qual, tant, quel, dic, so, quell, avev, piu, fa, vorrebb, gia, puo, s,sar, d, nun, ce, n, foss, x, b, va, ogni, vuol, andar, propr, fatt, vann, www, fonte}.

10

Page 11: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

6. We remove numbers in the text.

7. We stem the document using the Porter’s stemming algorithm.

8. We create the document-feature matrix including all the unigrams and the bigrams inthe text corpus.

9. We remove the top 5% and the bottom 5% tokens by frequency21. This shrinks thenumber of tokens from 29, 040 to 26, 172.

The final document-feature matrix has thus dimensions [26172 × 881], with 26, 172 singletokens (unigrams and bigrams) and 881 documents (one for each TV channel per week) andmatrix entries represented by the absolute frequency of tokens’ occurrence. The TV newsreported range from the week starting on the 26 of July 2010 until the week starting on the30 of September 2014.

ResultsWe run theWORDFISH’s iterative EM algorithm until it reached a tolerance threshold of 1e−722. The results of the estimations are reported in Figure 3. Every dot represents a weeklyTV program’s omega score. The plot shows on the horizontal axis the date of the specifictranscript, and the vertical axis expresses the estimated omega scores for each [TV program×week] document. We apply a LOESS smoothing filter together with a 99% confidence bandto emphasize the trend of the estimates for each TV program.

The plot points to three important results. First, we can identify differing central ten-dencies and trends in the series of omega scores, given that the confidence intervals mostlydo not overlap. This means that indeed the scores signal the presence of systematic differ-ences in the word usage. Secondly, we observe a quite consistent and face valid ranking ofthe omega scores, that ranges from the position of Studio Aperto and Rete 4 on one pole, to thepositions of TG3 and of TG7 on the opposite side of the identified latent space. This links theprevious point to the political world. Because even if the latent scores based on the reportednews are not direct estimates of the ideological slant of TV channels (because our corpus doesnot include programmatic policy statements), the fact that the three Mediaset (i.e. Berlus-coni) owned news programs appear on the positive-omegas pole, while the independent TG7and the historically left-leaning TG3 on the negative-omegas side, provides a strong signalthat our TV omega scores are indeed highly correlated with the underlying political slant.Indeed, the Italian Authority for Communications (AGCOM), using very different criteria23,in October 2010 warned Tg4, Studio Aperto, and TG1 for excessive political unbalance anddisproportionate visibility of right-wing political leaders24. While we will deal with TG 4 and

21This is a standard procedure in quantitative text analysis that aims at cutting non-informative long tails in thedistribution of words.

22To scale upWORDFISH to estimate latent positions in a large data settingwe run themodel on the EUI HPC cluster.We run the model using the wordfish function in R. The quanteda’s textmodel function could have been a viablealternative. The total estimation time is of 3.93 hours.

23Their judgement was only based on the direct presence of politicians on the various news programs. The readershould thus notice that since all politicians’ interviews have been removed, our corpus would represent a case of‘perfect balance’ in the information adopting the standards of AGCOM

24‘Telegionali e pluralismo, ecco i dati’, Corriere della Sera, 21 October 2010 [link here]; ‘Dall’AGCOM arriva diffida al TG1:“Forte squilibrio a favore del governo”. Richiamo al TG4 e a Studio Aperto’, Corriere della Sera, 21 October 2010 [link here]

11

Page 12: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 3: Wordfish scores of Italian TV news programs, 2010-2014

Studio Aperto thoroughly in the following sections, the case of TG 1 will be analyized in Sub-section 3.4. Finally, we notice a trend of growing differentiation over the years. We arguethat the growing differentiation over time can be traced to fact that “hard news” programsdevoted more coverage to the economic crisis, as will be shown in 3.3.

The representation of the TV-specific trends in the omega scores seems to indicate thepresence of a differentiated media content supply in Italy. TV news outlets in Italy appearto be using systematically different words, which leads us to think that TV channels in factdiscuss about different topics, or about different aspects of the same topics. We will providemore detail on this two potential explanations of theWORDFISH scores divergence in the nextSection (3.4). To better understand the contente of the longitudinal shifts that we observe forall the news programs, it can be useful to inspect the words-specific parameters (i.e. the βs)that are associated with the two poles of the latent space identified in the estimation process.Table 1 shows respectively the list of the 25 more tightly linked to the pole of positive valuesof the latent scores, and the 25 words most connected with the opposite pole of negativeomega values. The full distribution of words can be visually inspected in Appendix B.

We notice that words associated with positive and negative beta scores refer to very dif-ferent kind of news content. Positive scores are associated with crime stories (francesc_mort,mort_yar) or accidents reports (foll_veloc), while negative scores are associatedwith hard newsregarding political (i.e. ricandidatur, impegn_europe) and economic (e.g. stagnazion, miliard_-men) issues. This result may lead to the conclusion that more right-wing (that is of the sameleaning as the government) networks supply softer and less politically-charged topics. Wecan conjecture that this could be functional to downplaying the saliency of more thorny ordifficult issues. The next Subsection 3.3 will also address this point.

12

Page 13: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Table 1: List of top 25 words associated respectively with positive and negative beta scores

Tokens b psiesperient_govern -1.50 -2.11ex_vertic -1.50 -2.46econom_grill -1.50 -2.05crosett -1.50 -2.27merc_unic -1.50 -2.73interess_deb -1.51 -2.65vot_test -1.51 -2.63risors_destin -1.51 -2.80resping_mittent -1.51 -2.42intes_riform -1.51 -2.42ricandidatur -1.51 -2.42posit_arriv -1.52 -2.77impegn_europe -1.52 -2.68segretar_carrocc -1.52 -2.49tropp_alti -1.52 -2.42confin_turc -1.52 -2.44commission_barros -1.52 -2.42intes_part -1.52 -2.34acquist_ben -1.52 -2.13miliard_men -1.52 -2.32azion_maggior -1.52 -2.57stagnazion -1.52 -2.41ex_capogrupp -1.53 -2.39govern_grec -1.53 -2.33men_previst -1.53 -2.74

Tokens b psibenven_stud 1.55 -3.52angel_machiavell 1.18 -3.11gabriell_simon 1.02 -2.12ser_grad 1.02 -3.16machiavell 1.01 -2.60rem_croc 0.98 -3.01ser_retequattr 0.94 -2.85francesc_mort 0.92 -3.21yar_stat 0.83 -2.32apert_sent 0.83 -3.04luc_pesant 0.79 -2.53carmin_martin 0.79 -2.94andiam_rom 0.79 -3.03lecces 0.76 -2.40prim_scompars 0.74 -2.79ser_stud 0.74 -2.70massim_canin 0.72 -2.57massimil_dio 0.67 -2.56mort_yar 0.62 -2.10giorn_stud 0.61 -2.83luis_ross 0.58 -2.20feder_gatt 0.58 -2.28bepp_gandolf 0.58 -2.34foll_veloc 0.55 -2.87marc_graz 0.53 -2.56

3.3 Strategic issue selection or framing: assessing the mecha-nism of media political signal’s transmissionThe previous section showed the presence of consistent and systematic differences in thelanguage used by the Italian news programs and connected these differences to their politicalcoordinates. Yet, it remains unclear whether the political leaning of the news is associatedwith differences in the content (i.e. what is being told to the audience) or whether the scoresare driven by differences in the news framing (i.e. how the news are presented).

To address this point, we runWORDFISH on two different subsets of the original TV corpus.In both cases, we focus our attention to the time period in which the Berlusconi IV cabinetwas in charge (thus from the summer 2010 until October 2011). In the first subset we onlyconsider economic news, while in the second we only consider the reports of news relatedto law and order (including a set of crime news regarding immigration, murders, terrorism,and other generic news stories unrelated to politics and the economy). The idea at the centerof this design is that in case we are still able to identify systematic differences among newsprogramswhen the same issue is covered, then thiswould imply that the political signal is beingchannelled through the news frames rather than through the issue saliency.

13

Page 14: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

3.3.1 The economy in the news

In order to investigate issue framing we thus focus on two central subjects and exploit thetext analysis features to detects the use of different language and especially of qualifiers. Theresults of this second analysis are reported in Figure 4.

We notice a remarkably different pattern in the scaling of economic news with respectto the case of the entire corpus. In face, we can immediately notice that the variation in theomega scores does not occur between-programs, but rather over time. In fact, we observe aradical movement towardmore negative omega values for all the news programs considered,and this corresponds to the worsening of the financial crisis in the second half 2011.

Figure 4: Wordfish scores of Italian news programs (economic news only), 2010-2011

Similarly to what we did before, to understand the substantive content of the latent “eco-nomic news” scores, Appendix D reports the list of words associated with the highest andlowest β discrimination values. Upon inspection, we notice that the “left” pole of negativevalues and the “right” one of positive scores respectively deal with problems and issues inthe financial crisis, and with more standard and traditional industrial policy issues.

Having found no systematic differences in the economic news’ frames of reference, weturn our attention towards the coverage of economic news. The idea is that Mediaset newsprogramsmay strategically downplay the importance of economic issues, given the bad newsand the fact that Berlusconi in our time frame is incumbent. If this is observed, then we mayconclude that it is not how the news are reported, but rather what kind of news are coveredthat explains the political signals.

This view is indeed is corroborated by the evidence provided in Figure 5. This graph repre-sents the cumulative airtime devoted to economic issues by network. We can indeed observehow economic issues are more intensely covered on the news programs previously shown asassociated with “left-leaning” omega scores in the full model. Differently, Mediaset channels

14

Page 15: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

appear to have silenced economic issues during the Berlusconi government, as they consis-tently underreport the financial crisis compared to other TV channels.

Figure 5: Cumulative airtime (in days) devoted to economic news

This provide suggestive evidence of strategic selection of news topics by political affili-ation, where networks ideologically closer to the right-wing government were strategicallyattempting to decrease the saliency of ‘uncomfortable’ issues related to the financial crisis.

3.3.2 Crime stories in the news

If the conjecture corroborated by the findings in the previous subsection is valid, then weshould also observe the same “strategic issue selection” mechanism at work for political is-sues that are owned by the Italian right (i.e. favourable). Thus, in this section we providesimilar evidence for crime news. If our argument is correct, we expect Mediaset channels toprivilege news stories associated with the traditionally ‘right-wing’ issue of law and order:crimes, murders, immigration, and terrorism news reports.

In the first place, we again investigate whether the TV frames of reference for such crimestories are presented in a similar or diversified fashion. Figure 6 shows that there are nodetectable framing effects also for crime news.

Then, we again turn our attention toward the cumulative airtime, this time referring tocrime news. We find striking to find that the patterns of coverage presented in Figure 7 arethemirror-image of those previously presented with respect to the coverage of the economy.We find this evidence consistent with our previous argument. While there are not significantdifferences in the way (how) different newscasts reports either economic news (see AppendixC and D for evidence on omega scores and representative words) or crime news and currentaffairs (refer to Appendix E and F), wefind a complementary patternwith respect to coverage.

15

Page 16: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 6: Wordfish scores of Italian news programs (crime news and current affairs only), 2010-2011

Figure 7: Cumulative airtime (in days) devoted to crime news and current affairs

To recap, in this section we have investigated two potential mechanism of transmissionof the political signal through the news. Strategic issue selection involves downplaying theissues that are more political uncomfortable for the referent political actors and simultane-ously to overemphasize those who could potentially strengthen the electoral support of the

16

Page 17: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

political referent group. Indeedwe show this empirically, and our results point to the validityof Agenda Setting theories and the relevance of issue saliency for media and political actors.Differently, we find no empirical traces of a systematic framing power of the media, as wefind that for a given topic, the language that is used is basically not differentiated. In thefinal empirical subsection 3.4 we leverage on the longitudinal variation of our omega scoresto strengthen the internal validity of our estimates.

3.4 Structural breaksIn this subsection we investigates the WORDFISH scores from a time-series perspective, im-ing at identifying meaningful structural changes in the omega scores that are related to realworld events. We have already related variation in the language used by the news programsto report crime and economic news, but this largely relied on anecdotal evidence. In thissubsection, we more systematically link structural changes in the omega time-series to doc-umented real-world events.

Again, we consider the cases of economic and current affair news. Finally, we shift ourfocus from all the TVs to the major (and first) Italian public TV: RAI1. Our structural breaks’analysis shows that a break in the language adopted by the TG1 news programs occurredwhen Director Augusto Minzolini, notoriously a Berlusconi supporter25, was led to abandonthe direction of TG 1. We use the changepoint R package to automatically identify the optimalpositioning and the number of breaking points in the series.

3.4.1 Structural breaks in the economic news

Once we run the algorithm of structural change detection on the series of omega scores iden-tified for the subset of economic news, we identify one break, occurring in the week startingon the 4th of July 2011. Figure 8 graphically represents this break.

Indeed, the the financial crisis hit Italy generating a sudden increase in the German 10yyields of the BUND/Italian 10y BTP yield’s ratio on the 9 of July 201126. This corroboratesour previous understanding of the economic news latent scores as ranging between ‘phys-iologic’ economic times, where news reports center on industrial policy, and the economiccrisis, where financial issues become prevalent on the media. Figure 9 further reinforces thisfinding showing the titles of the cover page of the main Italian paper, the Corriere della Sera,before and after the identified structural break.

3.4.2 Structural breaks in the crime and current affairs news

The same exercise is now repeated for the subset of omega scores produced for all the TVnews programs with respect to crime news and current affairs. Once we run the break pointanalysis, we are able to identify two breakpoints on the weeks starting respectively on the

25The nomination of Augusto Minzolini was not supported by the center-left members of the RAI board, who leftthe board room at the moment of the vote and later called a press conference decscribind the appointment “un-acceptable” supporting information. The Italian Communication authority in April 2010 also warned Augusto Min-zolini for excessive unbalance and lack of pluralism supporting information. Finally, the AUDITEL data docmented adecrease in the audience share of TG1. See also footnote 24.

26‘Us Hedge Funds bet against Italian bonds’, Financial Time, 10 Juy 2011 [link here].

17

Page 18: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 8: Structural breaks in the omega series (all TVs, economic news, 2010-2011

Figure 9: Cover page of Corriere della Sera across the structural break in economic news’ scores

(a) CdS before the structural break (08-07-2011) (b) CdS after the structural break (11-07-2011)

7 of February 2011, and on the 4th of April of the same year. Figure 10 presents these twostructural breaks (all the TVnews programs are considered for the period inwhich Berlusconigovernment is incumbent).

We realize that indeed the two breaks delimit the peak of the Egyptian Revolution of 2011,

18

Page 19: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 10: Structural breaks in the omega series (all TVs, crime news, 2010-2011

as HosniMubarak resigns on the 11 February 2011. Figure 11 further provides evidence to thisclaim presenting the titles of Corriere della Sera on the corresponding dates.

Figure 11: Cover page of Corriere della Sera across the structural break in current affairs news’scores

(a) CdS before the structural break (10-02-2011) (b) CdS after the structural break (11-02-2011)

19

Page 20: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

3.4.3 The case of TG1: structural breaks and director’s turnover

Having become more confident about the ability ofWORDFISH to identify real changes in theunderlying political and economic conditions, we run afinal test to checkwhetherwe are alsoable to identify a “structural change” in the direction of the news programs. In particular,we have recalled the role played by Augusto Minzolini as government supporter at the timehe was directing the main Italian news programs: the TG 1.

In this subsectionwe thus run the structural break detection algorithm on the time-seriesof omega scores computed for TG1 in themainmodel that considered all the news covered. Asshown in Figure 12, we could identify one structural break in the series in the week startingon date 17 January 2011.

Figure 12: Structural breaks in the omega series (TG1 only, 2010-2011)

Oncewe superimpose on the plot the dates in which the three TG 1 directors considered inthe timespan of the analysis took charge, the association of the structural break in the over-all omega scores of the news program and the resignation of Augustion Minzolini becomesevident. We provide it in the dedicated Figure 13. The timing of the events lends supportto the ability of the algorithm to detect changes in the political leaning of a news outlet. Asalready pointed out in Section 3.2, in October 2010 the TG 1 received a warning from the Com-munication Authority for strong imbalance in the airtime presence of politicians, favouringthe right-wing government. In December 2010, Minzolini is dismissed from the direction ofTG 1 and shortly after a new director (Alberto Maccari is appointed). The red vertical line inFigure 13 signals the official start of the Maccari direction, but the moment Minzolini wasdismissed is actually shortly antecedent to the structural change we identify.

20

Page 21: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Figure 13: Structural breaks in the omega series (TG1 only, 2010-2011)

4 ConclusionsThis paper provides a number of new results. Italian Television networks content is differ-entiated consistently across networks. Historically more left-leaning networks tend to focuson serious news content, like the state of the economy and political affairs. On the otherhands, networks historically associated with the conservative parties provide more empha-sis to chronicles, gossip and crime news. We therefore find robust evidence of strategic issueselection throughout our dataset, but scant evidence of differential issue framing.

Our results are consistent with previous work on newspaper ideology. Our study differsin the fact that, given our setting, we are able to determine media bias as a consequenceof media capture. This results raise the welfare concerns that are typically associated withmedia bias. While a degree of differentiation across network in news content is desirable asit may be a mechanism to ensure pluralism, in this case it may be problematic as deviationin the ideological score are not driven by viewers demand but by political capture. As a re-sult, we might anticipate a number of negative political outcomes, like growing ideologicalsegmentation and increasing polarization in the audience.

21

Page 22: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

ReferencesBaron, David P. 2006. “Persistent media bias.” Journal of Public Economics 90(1):1–36.

Besley, Timothy and Andrea Prat. 2006. “Handcuffs for the Grabbing Hand? Media Captureand Government Accountability.” American Economic Review 96(3):720–736.URL: https://www.aeaweb.org/articles?id=10.1257/aer.96.3.720

DellaVigna, Stefano, Ruben Durante, Brian Knight and Eliana La Ferrara. 2016. “Market-BasedLobbying: Evidence fromAdvertising Spending in Italy.” American Economic Journal: AppliedEconomics 8(1):224–256.URL: http://pubs.aeaweb.org/doi/10.1257/app.20150042

Djankov, Simeon, Caralee McLiesh, Tatiana Nenova and Andrei Shleifer. 2003. “Who Owns theMedia.” Journal of Law and Economics XLVI(October):341–381.URL: http://scholar.harvard.edu/files/shleifer/files/media.pdf

Duggan, J. and C. Martinelli. 2011. “A spatial theory of media slant and voter choice.” Reviewof Economic Studies 78(2):640–666.

Durante, Ruben and Brian Knight. 2012. “Partisan Control, Media Bias, andViewer Responses:Evidence FromBerlusconi’s Italy.” Journal of the EuropeanEconomicAssociation 10(3):451–481.URL: http://doi.wiley.com/10.1111/j.1542-4774.2011.01060.x

Eyheramendy, Susana, Susana Eyheramendy, David D. Lewis and David Madigan. 2003. “Onthe Naive Bayes Model for Text Categorization.”.URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1365

Gentzkow, Matthew and Jesse M. Shapiro. 2010. “What Drives Media Slant? Evidence FromU.S. Daily Newspapers.” Econometrica 78(1):35–71.URL: http://doi.wiley.com/10.3982/ECTA7195

Grasso, Aldo. 2004. Storia della televisione italiana. Milano, Italia: Garzanti.

Groseclose, T. and J. Milyo. 2005. “A Measure of Media Bias.” The Quarterly Journal of Economics120(4):1191–1237.URL: http://qje.oxfordjournals.org/cgi/doi/10.1162/003355305775097542

McMillan, John and Pablo Zoido. 2004. “How to Subvert Democracy: Montesinos in Peru.”Journal of Economic Perspectives 18(4):69–92.URL: https://www.aeaweb.org/articles?id=10.1257/0895330042632690

Mullainathan, Sendhil and Andrei Shleifer. 2005. “The Market for News.” American EconomicReview 95(4):1031–1053.URL: http://pubs.aeaweb.org/doi/abs/10.1257/0002828054825619

Prat, Andrea. 2015. “Media Capture and Media Power.” Handbook of Media Economics Vol.1B(2002):669–686.

22

Page 23: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

Prat, Andrea and David Stromberg. 2013. “The Political Economy of Mass Media.” Advances inEconomics and Econometrics Tenth World Congress, Volume 2: Applied Economics pp. 135–187.

Reinikka, Ritva and Jakob Svensson. 2005. “Fighting Corruption to Improve Schooling: Evi-dence from a Newspaper Campaign in Uganda.” Journal of the European Economic Association3(2-3):259–267.URL: http://doi.wiley.com/10.1162/jeea.2005.3.2-3.259

Slapin, Jonathan B. and Sven-Oliver Proksch. 2008. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52(3):705–722.

Strömberg, David. 2015. “Media and Politics.” Annual Review of Economics 7(1):173–205.

Tella, Rafael Di and Ignacio Franceschelli. 2011. “Government Advertising and Media Cover-age of Corruption Scandals.” American Economic Journal: Applied Economics 3(4):119–151.URL: https://www.aeaweb.org/articles?id=10.1257/app.3.4.119

Zaller, John. 1999. A Theory of Media Politics. University of Chicago Press.URL: http://www.sscnet.ucla.edu/polisci/faculty/zaller/media politics book .pdf

23

Page 24: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

A Appendix - List of top 25 and bottom5MPs includedin the scaling of parliamentary speeches by number ofinterventions

MP name Political group Number of interventions1 Roberto Giachetti Partito Democratico 3402 Antonio Borghesi Italia Dei Valori 2773 Fabio Evangelisti Italia Dei Valori 2614 Erminio Angelo Quartiani Partito Democratico 2255 Mario Tassone Unione Di Centro Per Il Terzo Polo 2156 Simone Baldelli Popolo Della Liberta’ 2077 Angelo Compagnon Unione Di Centro Per Il Terzo Polo 1708 Federico Palomba Italia Dei Valori 1529 Francesco Barbato Italia Dei Valori 15010 Renato Cambursano Misto 14811 Arturo Iannaccone Misto 14112 Furio Colombo Partito Democratico 13213 Pierluigi Mantini Unione Di Centro Per Il Terzo Polo 12814 Marco Giovanni Reguzzoni Lega Nord Padania 12615 Massimo Polledri Lega Nord Padania 12316 Pier Ferdinando Casini Unione Di Centro Per Il Terzo Polo 12217 Sergio Michele Piffari Misto 12118 Massimo Donadi Misto 12019 Carlo Monai Italia Dei Valori 11920 Ivano Strizzolo Partito Democratico 11921 Donatella Ferranti Partito Democratico 11322 David Favia Misto 11123 Amedeo Ciccanti Unione Di Centro Per Il Terzo Polo 11024 Rita Bernardini Partito Democratico 10825 Teresio Delfino Unione Di Centro Per Il Terzo Polo 108. . .

514 Michela Vittoria Brambilla Popolo Della Liberta’ 1515 Piero Testoni Popolo Della Liberta’ 1516 Pietro Lunardi Popolo Della Liberta’ 1517 Sandro Oliveri Misto 1518 Vincenzo Barba Popolo Della Liberta’ 1

24

Page 25: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

B Appendix - Eiffel plot of the distribution of tokensfor the Italian news programs (2010-2014The followingfigure represent thewords’ discriminationparameters computedby theWordfishestimation performed on the entire corpus of the Italian TV news programs. The horizontalaxis reports the β parameters, which estimate the weight of word in discriminating betweenthe news programs’ positions, while the ψ scores, reported on the vertical axis, reports thewords’ fixed effect, capturing the frequency of the words (with less frequent words typicallyhaving greater discrimination weight).

Figure 14: Word discrimination parameters for the corpus of Italian news programs

25

Page 26: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

C Appendix - Eiffel plot of the distribution of tokensfor the subset of economic news (2010-2011The followingfigure represent thewords’ discriminationparameters computedby theWordfishestimation performed on the subset of economic news reported while the Berlusconi govern-ment was in charge. Interpretation is equivalent to the Eiffel plot in the previous appendixB.

Figure 15: Word discrimination parameters for economic news on the Italian news programs

26

Page 27: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

D Appendix - List of top 25 words associated with eco-nomic news’ poles of the latent space

Table 3: List of top 25 words associated respectively with positive and negative beta scores

Tokens b psibors_stat -1.50 -4.83tagl_bilanc -1.50 -4.65capovolg -1.50 -4.73ital_aggiung -1.50 -4.83rating_unit -1.50 -5.19mil_tutt -1.50 -4.49tempest_perfett -1.50 -5.34arriv_guadagn -1.50 -4.19cattedr -1.51 -5.35chiud_perd -1.51 -4.94arriv_miliard -1.51 -5.06aggiorn_colleg -1.51 -4.43de_bortol -1.51 -5.06voragin -1.51 -4.00tedesc_super -1.51 -5.06pers_valor -1.51 -5.06tremont_paregg -1.51 -5.35percentual_iva -1.51 -4.84moment_econom -1.51 -4.84rest_ben -1.51 -4.94europ_scont -1.51 -5.35fin_prim -1.51 -4.58part_union -1.51 -4.10temporal -1.51 -4.58andat_bors -1.51 -5.07

Tokens b psiquattord_gennai 5.45 -8.52mirafior_referendum 5.39 -9.05referendum_accord 5.19 -8.72regol_rappresent 4.92 -9.14fatt_accord 4.81 -8.72accord_mirafior 4.80 -7.61referendum_futur 4.79 -8.47fiom_cobas 4.75 -8.64accord_separ 4.74 -7.88cgil_fiom 4.71 -8.19conten_accord 4.60 -8.74nuov_fiat 4.53 -7.45esclusion_fiom 4.51 -8.63gennai_referendum 4.39 -8.33accord_import 4.28 -8.05fia 4.20 -7.14fim_cisl 4.08 -7.94debutt_bors 4.05 -6.96alluvion_venet 4.02 -8.02fiat_mirafior 3.94 -5.76ivec 3.83 -7.39alto_dicembr 3.78 -7.22ventott_gennai 3.77 -6.87videogam 3.73 -6.76fiom_accord 3.70 -7.64

27

Page 28: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

E Appendix - Eiffel plot of the distribution of tokensfor the subset of crime news (2010-2011)The followingfigure represent thewords’ discriminationparameters computedby theWordfishestimation performed on the subset of crimenews reportedwhile the Berlusconi governmentwas in charge. Interpretation is equivalent to the Eiffel plot in the previous appendix B.

Figure 16: Word discrimination parameters for crime news on the Italian news programs

28

Page 29: ATestofMediaCapturebyUsingMachineLearning Techniques ...€¦ · ATestofMediaCapturebyUsingMachineLearning Techniques EvidencefromItalianTelevisionNews2010-2014 AndreaDeAngelis AlessandroVecchiatoy

F Appendix - List of top 25 words associated with thecrime news’ poles of the latent space

Table 4: List of top 25 words associated respectively with positive and negative beta scores

Tokens b psitrascin_div -2.00 -5.67famos_german -2.00 -5.96fium_giorn -2.01 -5.97lasc_entrar -2.01 -5.68anni_esatt -2.01 -5.56purific -2.01 -5.46mai_screz -2.01 -5.82iniz_vit -2.01 -6.16immens_dolor -2.01 -5.37propr_cont -2.02 -5.98vib_valenz -2.02 -5.07cos_vuot -2.03 -5.83dov_spegn -2.03 -5.48disag_viaggiator -2.03 -5.48pegn_rivendic -2.03 -5.70rivendic_mulin -2.03 -5.70doman_sempr -2.03 -5.70salv_gett -2.03 -5.84motovedett_riusc -2.03 -5.48temperatur_picc -2.03 -5.48gest_protest -2.03 -5.30parol_parol -2.03 -5.02scors_infatt -2.03 -5.71impiant_riscald -2.04 -5.59propr_stess -2.04 -5.85

Tokens b psigrad_accogl 6.58 -14.69profug_dov 6.22 -13.33emergent_accogl 6.19 -13.56accogl_cinquant 5.88 -12.46situazion_igien 5.86 -12.88dunqu_orma 5.80 -12.83trasfer_emigr 5.43 -11.88marin_sammarc 5.42 -11.94destin_migrant 5.34 -12.02arriv_region 5.33 -11.45marcator 5.24 -11.43equipagg_asso 5.06 -10.84termin_consigl 5.05 -11.50govern_sicil 5.02 -9.55tripol_equipagg 4.96 -10.90attracc_nav 4.95 -10.67maron_region 4.94 -10.57vers_piattaform 4.92 -10.17mezz_sbarc 4.90 -11.01immigr_dic 4.87 -10.95quarant_immigr 4.82 -10.83isol_sempr 4.77 -10.46don_dio 4.74 -10.29maron_frattin 4.70 -9.47central_sci 4.69 -10.00

29