dynamic call drop analysis

Master Thesis in Statistics and Data Mining

Dynamic Call Drop Analysis

Martin Arvidsson

Division of StatisticsDepartment of Computer and Information Science

Linköping University

SupervisorPatrik Waldmann

ExaminerMattias Villani

“It is of the highest importance in the art of detection tobe able to recognize, out of a number of facts, which areincidental and which vital. Otherwise your energy and

attention must be dissipated instead of beingconcentrated.” (Sherlock Holmes - Arthur Conan Doyle)

ContentsAbstract 1

Acknowledgments 3

1. Introduction 51.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2. Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Data 92.1. Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2. Raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1. Data variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3. Methods 133.1. Text Mining and Variable creation . . . . . . . . . . . . . . . . . . . 133.2. Sampling strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3. Evaluation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4. Online drop analysis through classification . . . . . . . . . . . . . . . 17

3.4.1. Dynamic Logistic Regression . . . . . . . . . . . . . . . . . . . 173.4.2. Dynamic Model Averaging . . . . . . . . . . . . . . . . . . . . 223.4.3. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5. Drop description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.1. Association Rule Mining . . . . . . . . . . . . . . . . . . . . . 32

3.6. Technical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4. Results 354.1. Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2. Online classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2.1. Sampling strategies . . . . . . . . . . . . . . . . . . . . . . . . 384.2.2. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.3. Dynamic Logistic Regression . . . . . . . . . . . . . . . . . . . 414.2.4. Summary of results . . . . . . . . . . . . . . . . . . . . . . . . 484.2.5. Static Logistic Regression vs. Dynamic Logistic Regression . . 48

4.3. Online drop analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.1. DMA posterior inclusion probabilities . . . . . . . . . . . . . . 504.3.2. Evolution of odds-ratios and reduction in entropy . . . . . . . 54

i

Contents Contents

4.3.3. Static Logistic Regression vs. Dynamic Logistic Regression . . 58

5. Discussion 61

6. Conclusions 65

A. Figures 67A.1. Results: Online classification . . . . . . . . . . . . . . . . . . . . . . . 67A.2. Results: Online drop analysis . . . . . . . . . . . . . . . . . . . . . . 68

A.2.1. Single dynamic logistic regression vs. Univariate DMA . . . . 68A.2.2. Significant covariates in interesting period . . . . . . . . . . . 69A.2.3. Static vs. Dynamic Logistic Regression: covariate effects . . . 74

B. Tables 77B.1. Results: Online classification . . . . . . . . . . . . . . . . . . . . . . . 77

B.1.1. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.1.2. Dynamic Logistic Regression . . . . . . . . . . . . . . . . . . . 78

Bibliography 83

ii

Abstract

This thesis sets out to analyze the complex and dynamic relationship between mo-bile phone call connections that terminate unexpectedly (dropped calls) and thosethat terminate naturally (normal calls). The main objective is to identify tem-porally discriminative features, such as to assist domain experts in their quest oftroubleshooting mobile networks. For this purpose, dynamic extensions of logisticregression and partition trees are considered.The data consists of information recorded in real-time from mobile phone call con-nections, and each call is labeled by its category of termination. Characterizingfeatures of the data that pose considerable challenges are: (i) class imbalance, (ii)high-dimensional, (iii) non-stationary, and (iv) sequentially arriving in a stream.To address the issue of class imbalance, two sampling techniques are considered.Specifically, an online adaptation of the random undersampling technique is im-plemented, as well as an extension (proposed in this thesis) that accounts for thepossibility of changing degree of imbalance. The results suggest that the former ispreferable for this data, but that both improve the degree of identification of theminority class (dropped calls).Another characterizing feature of this dataset is that several of the covariates aretemporally sparse. This is shown to cause problems in the recursive estimationstep of the dynamic logistic regression model. Consequently, this thesis presents anextension that accounts for temporal sparsity, and it is shown that this extensionallows for the inclusion of temporally sparse attributes, as well as to improve thepredictive capability.A thorough evaluation of the considered models is performed, and it is found thatthe best model is the single dynamic logistic regression, achieving an Area underthe curve (AUC) of 99.96%. Based on odds ratios, posterior inclusion probabilities,and posterior model probabilities from the dynamic logistic regression, and reductionin entropy from the dynamic trees, analysis of temporally discriminative featuresis performed. Specifically, two sub-periods of abnormally high call drop rate areanalyzed in closer detail, and several interesting findings are made; demonstratingthe potential of the proposed approach.

1

Acknowledgments

Several people deserve and have my deepest appreciation for their aid and supportin making this thesis possible.First, I would like to thank Ericsson for giving me the opportunity to work withthem, as well as for providing the data for this thesis. Special thanks to my co-supervisors Paolo Elena and Henrik Schüller for, on the one hand, defining a reallyinteresting problem, and on the other, providing good support. Thanks also to LeifJonsson, who oversaw the thesis projects and provided valuable input. Anotherperson that cannot be left out is domain expert Håkan Bäcks, who provided veryuseful insights about the data and the functionality of the network.I would also like to thank my supervisor at Linköping University, Patrik Waldmann,who provided good advice and participated in many fruitful discussions.Finally, I would also like to thank my opponent, Andreea Bocancea, for her im-provement suggestions. These undoubtedly strengthened the subsequent versions ofthe thesis.

3

1. Introduction

1.1. Background

Besides selling hardware and software, network equipment providers (NEPs) alsoprovide support to mobile network operators (MNOs). One imperative support-related task is that of troubleshooting, which consists of detecting problems in thenetwork and understanding their causes. This task poses considerable challenges,not just because of the complexity of the systems, but also because of the enormousquantities of information that is collected from the networks every day.In this thesis, troubleshooting will be considered from a statistics and data analysispoint of view. More specifically, this thesis sets out to analyze the complex anddynamic relationship between dropped calls and normal calls - where a dropped callmay be defined as a call that ends without the intention of either participants ofthe call. While it is the case that a certain number of dropped calls are expected,inevitable and not interesting, there are also dropped calls of the sort that are un-expected, and may be caused by system malfunctions. Hence, from the perspectiveof the NEPs, it is of great interest to quickly identify and understand the causes fordropped calls, such that eventual problems can be correctly addressed. In periodsof abnormally high call drop rates (percentage of calls that are dropped) the identi-fication of drop causes is especially important. System degradation can have a widerange of different causes and explanations, such that the problem becomes quitecomplex. Two examples of high-level causes are; (i) system updates in the network,and (ii) new phones or software updates in already existing phones. In this thesis,statistical and machine learning methods are applied to identify low-level indicatorsof dropped calls, which later can be interpreted by domain experts to put eventualproblems into context.The issue of detecting problems in mobile networks has been considered with a rangeof approaches in the literature, in particular within the subdisciplines of anomalydetection, fault detection, and fault diagnosis. A substantial amount of researchhas been done in these areas, and there are quite a few papers that consider theseproblems within the context of mobile networks, for example (Brauckhoff et al.,2012; Watanabe et al., 2008; Cheung et al., 2005; Rao, 2006). The bulk of thesepapers are concerned with identifying problems at the level of defined geographicalregions, and the data is such that it describes the characteristics of particular regions(cells/radio base stations or radio network controllers), and not individual calls, asis the case in this thesis. A common approach is to work within the unsupervised

5

Chapter 1 Introduction

framework, where the detection of a fault or anomaly often is the result of a setupwhereby one tracks and/or model a selected number of features, such to gain an ideaof the normal behavior, and then, when large deviations from this normal behaviorare observed, through - for instance - threshold violations, as in Cheung et al. (2005)and Rao (2006), an anomaly or fault has been identified. Various techniques havebeen explored to extract and describe anomalies and faults: one approach is to applyassociation rule mining, as in Brauckhoff et al. (2012).

There are relatively few papers that, within the context of mobile networks, considerthe problem of fault detection or fault diagnosis in a supervised setting. In one ofthe exceptions, Khanafer et al. (2006), a Naive Bayes classifier is considered forpredicting a set of labeled faults. Zhou et al. (2013) and Theera-Ampornpunt et al.(2013) also work within the supervised framework, with similar data (equal responseand similar input) to that of this thesis, but with a slightly different objective; toperform early classification, such that proactive management can be implementedto deter certain types of calls from dropping. The classification methods consideredin these two papers are Adaboost and Support Vector Machines.

A limitation of the aforementioned approaches is that they, to a varying degree,implicitly assume a stationary and static environment - and mobile networks are ingeneral not static systems: as previously mentioned, internal and external modifi-cations and updates occur irregularly. This motivates a dynamic approach, ratherthan a static one. An additional limitation of the aforementioned approaches is thatthey also, to some extent, assume that the the data can be stored. In the context ofprocessing data from mobile networks, however, this is assumption is problematic,since the volume of the data that is processed every day is astronomical: in 2014,Ericsson, the company at which this work was carried out, had 6 billion mobilesubscribers, with a global monthly traffic of ∼ 2400 Petabytes (Ericsson, 2014).While it may not be feasible to thoroughly analyze the whole data, it does appearintuitively appealing to be able to analyze more data for the same cost, and thus, ap-proaches with such characteristics ought to be preferable. A research discipline thathas gained a lot of attention recently, and which deals with limitations of the sortdescribed above, is online learning. In this thesis, a framework centered around on-line learning is proposed for the problem of predicting dropped calls and explainingtheir causes.

I addition to being non-stationary, the data is also greatly imbalanced with respectto the response variable. To address the challenges that come with imbalanced data,sampling techniques are explored. In particular, an adaptive undersampling schemeis developed, where less data is sampled during periods of few dropped calls, andmore data is sampled during periods of increased number of dropped calls.

Another challenge is that several attributes in the data are temporally sparse. Thispresents a limitation for one of the selected methods. Subsequently, in this thesis,an extension of the forgetting factor framework originally proposed by McCormicket al. (2012), is developed and evaluated.

6

1.2 Objective

1.2. Objective

The aim of this master thesis is to develop a framework that can identify temporallydiscriminative features for explaining dropped calls. A key challenge is that theunderlying distribution of the data is non-stationary and changes are expected tooccur irregularly and unpredictably over time. Subsequently, this thesis sets outto tackle this problem by using an online learning approach, wherein dynamic ex-tensions of the logistic regression and partition trees are explored. Another (notcompletely orthogonal) aim of this thesis is to predict dropped calls with high pre-cision. This latter objective is motivated by the fact that only information recordedup to a certain time before call termination is used, and as such, may be thoughtof as a first step in exploring the possibilities of early classification for this typeof data. Finally, to evaluate the decision of using the dynamic approach, a set ofscenarios are simulated in which the best dynamic classifier is compared to its staticequivalency, both in terms of predictability and exploratory insights.

1.3. Definitions

The following definitions are needed to fully understand the context of the problem.

Troubleshooting

Troubleshooting is an approach to problem solving. Specifically, it is the systematicsearch for the source of a problem - such that it can be solved.

User Equipment (UE)

User equipment (UE) constitutes of phones, computers, tablets, and other devicesthat connect to the network.

Network Equipment Provider (NEP)

Companies that sell products and services to communication service providers, suchas mobile network operators, are referred to as network equipment providers (NEPs).

Mobile Network Operator (MNO)

Companies that provide services of wireless communications that either own or con-trol the necessary elements to sell and deliver services to end users are referred toas mobile network operators (MNOs). Examples of such companies are Telia, Tele2,and Telenor.

7

Chapter 1 Introduction

Normal calls

Normal calls refer to connections between user equipment (UE) and the networkwhere the connection terminates as expected.

Dropped calls

Dropped calls refer to connections between user equipment (UE) and the network,with the outcome of unexpected termination.

UMTS Network

A Universal Mobile Telecommunications System (UMTS), also referred to as 3G,is a third generation mobile cellular system for telecommunication networks. Thesystem supports standard voice calls, mobile internet access, as well as simultaneoususe of both voice call and internet access. Although 4G has been introduced, 3Gremain the most widely used standard for mobile networks.

Radio Network Controller (RNC)

The Radio Network Controller (RNC) is the governing element in the UMTS net-work and is responsible for controlling the radio base stations that are connected toit.

Radio Base Station (RBS)

Radio base stations (RBS) constitute the elements of a network that provides theconnection between UE and the RNC.

8

2. Data

2.1. Data sources

The data were supplied by Ericsson AB and consist of machine-produced trace logs.These so called trace logs were originally collected from a lab environment at theEricsson offices in Kista. As such, the information contained in the data does notreflect the behavior of any real people, but rather programmed systems. However,these systems are programmed such that they should reflect human behavior: asimulated call may for instance consist of texting, browsing the internet, physicalmovements, and others. Moreover, even though it is a lab environment, the imple-mented system technology is equivalent to that which is used in most live networks;the so called Universal Mobile Telecommunications System (UMTS), also known as3G.Introduced in 2001, 3G is the third generation of mobile systems for telecommunica-tion networks, and supports standard voice calls, mobile internet access, as well assimultaneous use of both voice call and internet access. Although 4G has been intro-duced, 3G still remains the most widely used system for mobile networks. The 3Gnetwork is structured hierarchically and by geographical region. More specifically,the network consists of three primary - interacting - elements: the user equipment(UE), radio base stations (RBS), and radio network controllers (RNC). At the bot-tom of the hierarchy, there are the cells, which define the smallest geographicalregions in the network. RBSs are deployed such that they may be responsible formultiple cells, and as described in sec. 1.3, a RBS acts similar to that of a router:it provides the connection between the UE and the RNC. The RNC is the rulingelement of the 3G network and is responsible for managing the division of resourcesat lower levels; for example which RBS a particular UE should use.

2.2. Raw data

For every call that is initiated, a trace log is produced. The contents of these logsare recorded in real-time and contains information that corresponds to signals sentbetween the user equipment (UE), radio base stations (RBS), and the radio net-work controller (RNC). These signals may contain connection details, configurationinformation, measurement reports, failure indications, and others. This informa-tion, originally formatted as text, were first transformed into suitable format (as

9

Chapter 2 Data

described in sec. 3.1), and later used as the input to the statistical models evaluatedin this thesis. Finally, for each call, there is a recorded outcome, {normal, dropped},which defines the response variable. More details about specific variables follow insec. 2.2.1.

The period for which the data were collected is January 26, 2015 - April 10, 2015,corresponding to approximately two and a half months’ worth of data. During thisperiod, a total of 7.200 dropped calls were recorded. The total number of normalcalls in the same period was much greater: 670.000. That is, approximately 99% ofthe calls terminated as expected (=normal), and only 1% terminated unexpectedly(=dropped). Datasets with this characteristic are often referred to as imbalancedin the machine learning and statistics literature. For classifiers that seek to sepa-rate two or more classes, imbalance can be problematic. In sec. 3.2, techniques foraddressing challenges accompanying imbalanced datasets are described.

In Figure 2.1, a time-series plot is presented, displaying the number of dropped callsover the period of interest. Note that the time-scale of the plot is not in minutes,hours or days: instead the data were divided into 100 equally large subsets, andthen the sum was calculated within each subset. The rationale for presenting thedata like this, rather than in relation to actual time, is twofold: (i) this lab data donot have any periodic dependencies, and (ii) an unequal amount of calls were tracedduring different periods, and during some days, no calls were recorded.

Time

Num

ber

of d

ropp

ed c

alls

0 20 40 60 80 100

010

020

030

040

050

060

0

Figure 2.1.: Number of dropped calls as divided into 100 (ordered) equally largesubsets.

As one may observe in Figure 2.1, the number of drops is approximately constantfor most of the period, with no apparent trend. There are however multiple timeperiods in which the number of drops increases quite drastically. Intuitively, theseperiods represent some form of degradation in the system with systematic errors.One of the goals of this thesis is to identify what factors that was important duringsuch periods. In an online implementation, such a framework could potentially beused to detect causes for problems early on.

10

2.2 Raw data

2.2.1. Data variables

From the original trace logs, a total of 188 attributes were initially extracted. Ex-ploratory analysis revealed that quite a large proportion of them was redundant,which resulted in a final input-space of 122 attributes. To reduce the degree of dis-tortion from events occurring a long time prior to the termination of the calls, s was(together with domain experts at Ericsson) decided that only the last 20 secondsof each call should be kept for analysis. Note that the main contributing factor forincluding a particular variable weren’t the known significance of the variable, butrather its intrinsic and potential relevance - in terms of future events (such thatobserving changes in its degree of relevance are useful for troubleshooting the net-work). In this section, a brief summary and explanation of each category of variablesis presented.

2.2.1.1. Cell Ids

As described in the previous section, cells are defined geographical areas of the net-work, and hence, in a model context, these variables contain information about thelocation of the call events. From the considered period, 17 cell ids were recorded,resulting in 17 binary dummy variables. In a real setting with live networks, thenumber of cells would increase. In such a situation, clustering methods could po-tentially be used to merge cells that are (i) close to each other geographically, and(ii) similar by some relevant metric - this to reduce the dimension of variables toinclude and evaluate in the model.

2.2.1.2. tGCP

GCP, short for Generic Connection Properties, describes the range of possible con-nection properties that a call may possess. tGCP, or target-GCP, are the connectionproperties that are targeted or requested by a particular device at a particular time-point. A maximum of 31 connection properties can be possessed for a particular call.The presence or absence of a particular connection property is registered as 1 and0 respectively. In this work, the last set of registered connection properties for eachcall is used as input to the model - this to capture the connection properties thatwere requested at the time of the drop. The 31 connection properties are treated asbinary dummy-variables.

2.2.1.3. Trace 4 Procedures

Trace, in the context of this dataset, refers to the process of monitoring the executionof RNC functions that are relevant to a particular call. Traces are grouped suchthat similar events (execution of RNC functions) are traced by the same trace group.

11

Chapter 2 Data

For the considered STP, over the relevant time-period, three trace groups wereobserved: trace1, trace3, and trace4 - the latter being (by far) the most frequentone. Trace4 describes events such as: Importation and Deportation of processes andprogram procedures. More specifically, trace4 can be divided into 37 different events,referred to as procedures. For example, procedure 10 describes the Importation orDeportation of a “soft handover” event. In this thesis, these procedures are treatedas binary dummy-variables.

2.2.1.4. UeRcid

UeRcid, short for UE Radio Connection Id, defines - as the name suggests - the typeof radio connection that a particular UE has activated. For the considered data,approximately 150 different such id’s exist. In this work, we group these by theirinherent properties. Specifically, we differentiate between PS (Packet switched),CS (Circuit switched), SRB (Signaling Radio Bearer) and Mixed (a combinationof the aforementioned). PS is that connection which is concerned with data traffic,whilst CS is that which is concerned with conversation/speech. SRB is the resultof the initial connection establishment, as well as the release of the connection.The presence or absence of a particular radio connection is registered as 1 and 0respectively.

2.2.1.5. evID

EvID, short for Event Id, is found in the measurement reports of the trace logs, andconstitutes of reports related to radio quality, signal strength and others. As such, aspecific evID defines a specific type of such a report or event. Consider for instance,“evID=e2d”, which defines “Quality of the currently used frequency is below a certainthreshold”. In this work, these evID’s are treated as binary dummy-variables.

12

3. MethodsIn this chapter, the framework and subsequent methods used in this thesis are ex-plained. The framework is divided into four parts; the first step, text mining andvariable creation, is the step in which the data is transformed from machine gen-erated text to structured matrices apt for statistical methods. The second step,sampling, addresses the challenges of imbalanced data. The third step of the frame-work is the main part and consists of dynamic classification of streaming data. Thefourth and final part of the framework seeks to derive intuitive descriptions of theresults obtained from step 3, through the application of association rules.

3.1. Text Mining and Variable creation

As previously mentioned, the original format of the data was “text”, such that anydirect input to statistical methods was not possible. To address this, techniquescommonly associated with the area of text mining were applied. More specifically,text variables were created, and defined as binary dummy-variables. For instance, if“configuration request” appears in a particular call, then the value of that variableis ”1”. Initially, the count of specific words was also considered, but it was foundthat it did not add any discriminative value, and were consequently dismissed.Some numerical measurements were also found in the logs - these do however (i)not occur in all of the logs and (ii) are not missing at random: some measurementsare only triggered under certain circumstances. To cope with this type of missingdata, discretization techniques were applied such that categorical variables could bederived from the original numerical ones (including a category ’missing’). Specifi-cally, the CAIM discretization algorithm, proposed by Kurgan and Cios (2004) wasused. For a continuous-valued attribute, the CAIM algorithm seeks to divide itsrange of values into a minimal number of discrete intervals, whilst at the same timeminimizing the loss of class-attribute interdependency. It is out of the scope of thisthesis to cover the details of this algorithm, and hence we refer to Kurgan and Cios(2004) for more details.

3.2. Sampling strategies

Sampling, in the most general sense of the word, is concerned with selecting asubset of observations from a particular population. In the context of classification,

13

Chapter 3 Methods

sampling techniques are popular for dealing with the issue of class imbalance. Animbalanced dataset is defined as one in which the distribution of the response variableis skewed towards one of the classes (He and Garcia, 2009). The motivation forconsidering sampling techniques in this thesis is three-fold: (i) due to limitationsin memory & computational power, and the overwhelming size of the unformattedsource (txt) files, only a limited number of logs could feasibly be extracted from thesesource files, (ii) for imbalanced datasets, classifiers tend to learn the response classesunequally well, where the minority class often is ignored, such that the separationcapability becomes poor (Wang et al., 2013), and (iii) sampling techniques has shownto be effective for addressing class imbalance in other works (He and Garcia, 2009).Sampling is a well-researched subject, and a wide range of techniques have beenproposed over the years. The great bulk of these techniques are however limitedto environments where the data is assumed to be fixed and static. For example,the random undersampling technique, that has a simple and intuitive appeal: ob-servations from the majority class are selected at random and removed, until theratio between the response classes has reached a satisfactory level. Japkowicz et al.(2000) evaluated this simpler technique and compared it to more sophisticated ones,and concluded that random undersampling held up well. The issue of online classimbalance learning has so far attracted relatively little attention (Wang et al., 2013).Most of the proposed methods for addressing non-static environments assume thatthe data arrives in batches (Nguyen et al., 2011), and are thus not directly appli-cable to online learning. One of the first papers to address the issue of imbalanceddata in an online learning context was Nguyen et al. (2011). In it, a technique -here referred to as ORUS - that allows the analyst to choose a fixed rate at whichundersampling should occur were proposed: observations from the minority classare always accepted for inclusion, whilst observations from the majority class areincluded only with a fixed probability. In other words random under sampling inan online context. This simple implementation is described more formally in equa-tion (3.1), where q is the parameter determining the fixed sampling rate. Nguyenet al. (2011) shows that this approach is able to provide good results for an onlineimplementation of the naive Bayes classifier:

ORUS : p(inclusionxt) =1 yt = 1q yt = 0

(3.1)

This technique does however not account for the possibility of changing levels ofimbalance over time; it assumes a fixed rate, to be known a priori. In Wang et al.(2013), an extension were proposed in which the degree of imbalance is continuouslyestimated, using a decay factor, such that the inclusion probability is allowed tochange over time.In this thesis, a simple adaptive sampling scheme, sharing traits with both Wanget al. (2013) and Nguyen et al. (2011), is developed. Specifically, a sliding window is

14

3.3 Evaluation techniques

used to estimate the local imbalance at different time points, such that the under-sampling rate (inclusion probability) of the majority class is allowed to change overtime. If the proportion of dropped calls during a particular period is relatively high,the inclusion probability for normal calls is increased. If, on the other hand, theproportion of dropped calls is relatively low, the inclusion probability for normalcalls is decreased. More formally, as in Nguyen et al. (2011), we let the analystselect a constant (q): it should be the baseline expectation of the class-imbalanceprior to observing any data. In the case of mobile networks, the call-drop rate iswell-understood such that this “pseudo prior” can be set with confidence. The ideais then to use this baseline expectation to construct the sliding window: w = 1

q.

This sliding window moves incrementally, one observation a time, and estimates thelocal imbalance-rate at every time point as a result of the number of minority ob-servations found in that particular time-window. This is described mathematicallyin equation (3.2), where q is the constant describing the baseline expectation:

O − ARUSt : p(inclusionxt) =

1 yt = 1∑t−1

t−wyt

wyt = 0 & ∑t−1

t−w yt > 1q yt = 0 & ∑t−1

t−w yt ≤ 1(3.2)

For instance, let’s consider a scenario in which the analyst has set a baseline ex-pectation of 1%: the sliding window would then become w = 1

0.01 = 100. Considerfurther that we stand at time point (t), and in the past 100 observations, 3 observa-tions from the minority class have been encountered. The inclusion probability fora majority class observation would then, at time point (t), be equal to 3

100 = 3%.

3.3. Evaluation techniques

The question of how one should evaluate a classifier depends on the data, and theobjective of the classification. What is of particular interest in this thesis is toidentify and discriminate positive occurrences from negative ones, e.g. identifyingand separating ’dropped calls’ from ’normal calls’ - largely because the main ob-jective of this thesis is to explore what factors contribute towards the classificationof ’dropped calls’. The most commonly used metric for evaluating classifiers is theAccuracy measure:

Accuracy = TP + TN

TP + TN + FP + FN(3.3)

Where TP = TruePositives, TN = TrueNegatives, FP = False Positives,FN = FalseNegatives. It simply describes the total number of correct predic-tions as a ratio of the total number of predictions.

15

Chapter 3 Methods

In cases where the number of positive and negative instances differ greatly (im-balanced data), the accuracy measure can be misleading. For instance, with animbalance-ratio of 99:1, it would be possible to achieve an accuracy of 99% simplyby classifying all observations as negative instances. To avoid such pitfalls, a vari-ety of evaluation metrics have been proposed: one being AUC, which represents thearea under the ROC curve (Hanley and McNeil, 1982). The ROC curve displays therelationship between the true positive rate, TPR (Sensitivity) and the false positiverate, FPR (1 − Specificity). More specifically, the ROC curve is constructed byconsidering a range of operating points or decision thresholds, and for each such point(or threshold), it calculates the true positive rate and false positive rate. The inter-section of these two scores, at each threshold, produces a dot in a two-dimensionaldisplay. Between the plotted dots, a line is drawn: this constitutes the ROC curve(Obuchowski, 2003).

Sensitivity = TPR = TP

TP + FN(3.4)

1− Specificity = FPR = FP

FP + TN= 1− TN

TN + FP(3.5)

AUC can be interpreted as the probability that a randomly selected observationfrom the positive class is ranked higher than a randomly selected observation fromthe negative class - in terms of belonging to the positive class.It should be emphasized that, in the context of online learning, and hence for themethods considered in this thesis, there is no training- or test- dataset: as themodels are constructed and updated sequentially, we instead evaluate the one-step-ahead predictions of the models. The first papers to address the issue of imbalancedonline learning, (Nguyen et al., 2011) and (Wang et al., 2013), proposed the use ofthe G-mean as an evaluation metric. G-mean is short for Geometric mean and isconstructed as follows (Powers, 2011):

G−mean =√precision× recall (3.6)

where

precision = TP

TP + FPrecall = TP

TP + FN(3.7)

AUC and G-mean will constitute the main measurements upon which comparisonsand evaluations are founded in this thesis.

16

3.4 Online drop analysis through classification

3.4. Online drop analysis through classification

Classification, as a statistical framework, defines the process of modeling the rela-tionship between a set of input variables, X, and an outcome variable, y, where theoutcome variable is discrete. As the main objective of this thesis is to study therelationship between ’dropped calls’ and ’normal calls’, it is naturally framed as aclassification problem - with the response: {Dropped call, Normal call}.

An extensive number of classification techniques have been proposed over the years,and what amounts to the “best one” is often data and task specific. In the case ofthis thesis, there are four fundamental criteria that a classifier must meet: (i) it mustbe transparent, in the sense that insight of what variables contributes to a certainoutcome is required, (ii) it must be able to cope with a high-dimensional input, asthere is a great deal of interesting information recorded for each call, (iii) it must beable to handle the sequential nature of the data, e.g. that data arrives continuouslyin a stream, and finally (iv) it should be adaptive and be able to capture localbehaviors, since - as explained before - the cause for drops are expected to changeover time. These criteria drastically reduce the space of apt classifiers: populartechniques such as Support Vector Machines and Artificial Neural Networks are goodalternatives to deal with complex high-dimensional input (and may be extended todeal with streaming data), but they fail on the important issue of transparency inregards to variable importance.

The sequential and adaptive aspects described above are naturally addressed inthe field of online learning, which assumes that data is continuously arriving andmay not be stationary. Hence, an ideal intersection would be an online learningclassifier that is transparent and can handle higher dimensions. Two such techniqueswere identified, the Dynamic Logistic Regression and Dynamic Trees. The staticversions of these two, the logistic regression and partition trees, are known for theirtransparency in regards to variable contribution, and hence the dynamic extensionsare appealing for this work.

3.4.1. Dynamic Logistic Regression

This technique, originally proposed by Penny and Roberts (1999), extends the stan-dard logistic regression by considering an additional dimension: time. Through aBayesian sequential framework, the parameter estimates are recursively estimated,and hence allowed to change over time. The particular version of the dynamic lo-gistic regression that is applied in this work follows McCormick et al. (2012), and itis described below. But first, let’s consider - what in this thesis is referred to as -the static logistic regression.

17

Chapter 3 Methods

3.4.1.1. Static logistic Regression

The static logistic regression, or just logistic regression, is a technique for predictingdiscrete outcomes. It was originally developed by Cox (1958), and still remainsone of the most popular classification techniques. Logistic regression has severalattractive characteristics, in particular its relative transparency, and the way inwhich one is able to evaluate the contribution of the covariates to the predictions.Logistic regression is a special case of generalized linear models, and may be seenas an extension of the linear regression model. Since the dependent variable isdiscrete, or more specifically Bernoulli distributed, it is not possible to derive thelinear relationship between the response and the predictors directly, and hence atransformation is needed.

y ∼ Bernoulli(p)

In the case of the logistic regression, a logit-link is used for the purpose of transfor-mation. Consider the logistic function in equation (3.8):

F (x) = 11 + e−(β0+β1x1+...) (3.8)

Where the exponent describes a function of a linear combination of the independentvariables. The logit-link is derived through the inverse of the logistic function, as inequation (3.9):

logit(p) = g(F (x)) = lnF (x)

1− F (x) = β0 + β1x1 + ... = xT θ (3.9)

3.4.1.2. State-space representation

Given the objective of exploring temporal significance of independent variables, anatural extension of the static logistic regression model is to add a time dimension.As in McCormick et al. (2012), we do so by defining the logistic regression throughthe Bayesian paradigm, and by applying the concept of recursive estimation: thisallows sequential modeling of the data, and - what in the literature commonly isreferred to as - online learning. Equation (3.9) is hence updated to:

logit(pt) = xTt θt (3.10)

Notice the added subscript t. The recursive estimation is computed in two steps:the prediction step and the updating step:

18


Prediction step:

At a given point in time, (t), the posterior mode of the previous time step (t − 1)is used to form the prior for time (t). The parameter estimates at time (t) arehence based on the observed data up and till time (t− 1). Using these estimates, aprediction of the outcome at time (t) is made.More formally, we let the regression parameters θt evolve according to the state equa-tion θt = θt−1 +δt, where δt ∼ N(0,Wt) is a state innovation. That is, the parameterestimates at time (t) are based on the parameter estimates at time (t − 1) plus adelta term. Inference is then performed recursively using Kalman filter updating,Suppose that, for set of past outcomes Y t−1 = {y1, ..., yt−1}:

θt−1|Y t−1 ∼ N(θt−1, Σt−1)

The prediction equation is then formed as:

θt|Y t−1 ∼ N(θt−1, Rt) (3.11)

where

Rt = Σt−1

λt(3.12)

λt is a forgetting factor, and is typically set slightly below 1. The forgetting factoracts as a scaling factor to the covariance matrix from the previous time point,this to calibrate the influence of past observations. The concept of using forgettingfactors for this particular purpose is quite common in the area of dynamic modeling,and there has been a range of proposed forgetting strategies. For a review, see(Smith, 1992). In this work, we apply the adaptive forgetting scheme proposed byMcCormick et al. (2012), which allows the amount of change in the model parametersto change over time - an attractive feature, considering the complex dynamics of themobile network systems. More about the specifics of the forgetting factor later inthis section.Updating step:

The prediction equation in (3.11) is, together with the observation arriving at time(t), used to construct the updated estimates. More specifically, having observed yt,the posterior distribution of the updated estimate θt is:

p(θt|Y t) ∝ p(yt|θt)p(θt|Y t−1) (3.13)

19

Chapter 3 Methods

where p(yt|θt) is the likelihood at time (t), and the second term is the predictionequation (which now acts a prior). Since the Gaussian distribution is not the conju-gate prior of likelihood function in logistic regression, the posterior is non-standard,and there is no solution in closed form of equation (3.13). Consequently, McCormicket al. (2012) approximate the right-hand side of equation (3.13) with the normal dis-tribution, as is common practice. More formally, θt−1 is used as a starting value,and then the mean of the approximating normal distribution at time point (t) is:

θt = θt−1 −D2l(θt−1)−1Dl(θt−1) (3.14)

where second and third term of the right-hand side are the second and first deriva-tives of l(θ) = log p(yt|θ)p(θ|Y t−1) respectively, e.g. the logarithm of the likelihoodtimes the prior. The variance of the approximating normal distribution, which isused to update the state variance, is estimated using:

∑t

= {−D2l(θt−1)}−1 (3.15)

In McCormick et al. (2012), a static (frequentist) logistic regression is used in a train-ing period to obtain some reasonable starting points for the coefficient estimates.Now, since the data which is used in this thesis is sparse with regards to several ofthe input variables, this approach cannot straightforwardly be implemented. Thisis so because, for some of the covariates, none or very few occurrences are recordedduring the first part of the data. Consequently, we here apply a pseudo-Bayesianframework, introducing two pseudo priors (mean, variance): θ0, σ

20, for every coef-

ficient. If no observations are observed during the training period, these priors aresimply not updated.The forgetting factor, λIn Raftery et al. (2010), the predecessor to McCormick et al. (2012), a forgettingscheme where λ is a fixed constant were introduced, and more specifically they setλ = 0.99. It is noted that this constant ought to be determined based on the beliefof the stability of the system. If the process is believed to be more volatile and non-stationary, a smaller λ is preferable, since the posterior update at each time-pointthen weighs the likelihood - relative to the prior - higher, and hence the parameterestimates are more locally fitted, and updated more rapidly. More formally, thisforgetting specification implies that an observation encountered j time-points in thepast is weighted by λj (Koop and Korobilis, 2012). For instance, with λ = 0.99, anobservation encountered 100 time-points in the past receive approximately 37% asmuch weight as the current observation.McCormick et al. (2012), in addition to extending (Raftery et al., 2010) from dy-namic linear regression to dynamic binary classification, also proposed a new adap-tive forgetting scheme. The forgetting factor, λt (now defined with a subscript: t), is

20


extended such that it is allowed to assume different values at different time-points.This has the effect of allowing the rate of change in the parameters to change overtime. The predictive likelihood is used to determine the λ to be used at each time-point. More specifically, the λt that maximizes the following argument is selected;

λt = arg maxλt

ˆ

θt

p(yt|θt, Y t−1)p(θt|Y t−1)dθt (3.16)

However, since this integral is not available in closed form, McCormick et al. (2012)uses a Laplace approximation:

f(yt|Y t−1) ≈ (2π)d/2|{D2(θt)}−1|1/2p(yt|Y t−1, θt)p(θt|Y t−1) (3.17)

Which, according to Lewis and Raftery (1997), should be quite accurate. Insteadof evaluating a whole range of different λ′ts to maximize the expression in equation(3.16), McCormick et al. (2012) uses a simpler approach, that only considers twopossible states: some forgetting (λt = c < 1) and no forgetting (λt = 1). Differ-ent parameters are allowed to have different forgetting factors, and hence it wouldcomputationally difficult to evaluate multiple λ′s for models consisting of more thanjust a few variables, because the combinatorics grows exponentially. In their exper-iments, they conclude that the results were not sensitive to the chosen constant. Inthis thesis, both single and multiple λ′s will be evaluated. In the case of multipleλ′s, the model will share a common forgetting factor.Quite early on, it was empirically found that the forgetting schemes described aboveencountered problems with temporally sparse covariates, and that the smaller theλ, the bigger the trouble. In an attempt to remedy this issue, we propose a sim-ple, yet intuitively reasonable, modification. The basic idea is that c, the constantselected by the analyst, is - for each observation, and each attribute - scaled basedon an estimate of the local sparsity, such that, during periods of mostly zeros for aparticular covariate, λ is scaled towards 1:

λ(2)t = λ

(1)t + (1− λ(1)

t )

1 + (∑t

i=t−wxi)

w

3 (3.18)

Where w is a constant to be selected by the user: it is the window upon which thelocal sparsity is estimated. The summation in the denominator reflects the numberof non-zero occurrences in the past w observations. The more occurrences that areobserved, the larger the number that (1− λ(1)

t ) is divided by, and consequently theless λ(1)

t is scaled.

21

Chapter 3 Methods

For instance, consider a fictive scenario in which an analyst has selected c = 0.95,and w = 10, and for a particular covariate, at a particular time-point, 9 out of thelast 10 observations are zero for this attribute, e.g. sparse. Equation (3.18) wouldhave the effect of modifying λ(1)

t = 0.95 to λ(2)t = 0.995. If, at another time-point,

say 8 out of the 10 occurrences in w are non-zero values, λ(1)t =0.95 is only changed

to λ(2)t = 0.9501. The effect of this modification is further analyzed in sec. 4.2.

Evolution of the odds-ratios

In McCormick et al. (2012); Koop and Korobilis (2012), two approaches were con-sidered for studying the temporal significance of covariates and how the conditionalrelationships change over time; one being through the evolution of odds-ratios forspecific covariates.

Just as in the static logistic regression, odds-ratios are obtained by exponentia-tion the logit coefficients. Odds-ratios may be interpreted as the effect of one unitchange inX in the predicted odds, with all other independent variables held constant(Breaugh, 2003). An odds-ratio > 1.0 implies that a particular covariate potentiallyhas a positive effect, while an odds-ratio < 1.0 implies a potential negative effect.The farther the odds ratio is from 1.0, the stronger the association. In (Haddocket al., 1998), guidelines for interpreting the magnitude of an odds ratio are provided,and in particular a rule of thumb which states that odds ratios close to 1.0 repre-sent a ’weak relationship’, whereas odds ratios over 3.0 indicate ’strong (positive)relationships’. In (McCormick et al., 2012), ±2 standard errors are computed, andif the confidence interval doesn’t overlap 1.0, a covariate is concluded to have asignificant effect.

In this thesis, both of the aforementioned approaches are considered in the processof reflecting upon temporal significance of covariates.

3.4.2. Dynamic Model Averaging

Dynamic model averaging (DMA), originally proposed by Raftery et al. (2010), is anextension of Bayesian Model Averaging (BMA) that introduces the extra dimensionof time through state-space modeling. In this thesis, DMA is used together withthe dynamic logistic regression, as in McCormick et al. (2012). This combinationis attractive, considering the objectives of this work, in that the dynamic logisticregression allows the marginal effects of the predictors to change over time, whilstthe dynamic model averaging allows for the set of predictors to change over time.

BMA, first introduced by Hoeting et al. (1999), addresses the issue of model uncer-tainty by considering multiple (M1, ...,Mk) models simultaneously, and computesthe posterior distribution of a quantity of interest, say θ, by averaging the posteriordistribution of θ for every considered model - weighting their respective contribution

22


by their posterior model probability (Hoeting et al., 1999), as in equation (3.19):

p(θ|X) =K∑k=1

p(θ|Mk, X)p(Mk|X) (3.19)

The posterior model probability for model Mk can be written as follows:

p(Mk|X) = p(X|Mk)p(Mk)∑Kl=1 p(X|Ml)p(Ml)

(3.20)

where p(X|Mk) =´p(X|θk,Mk)p(θk|Mk)dθk is the integrated likelihood of model

Mk, and θk is the vector of parameters of model Mk.

3.4.2.1. State-space representation

By introducing a state-space representation of the BMA, leading to DMA, the pos-terior model probabilities become dynamic, and are hence allowed to change overtime. Just as in regular BMA, one considers K candidate models {M1, ...,MK}.Considering the specific combination of DMA and dynamic logistic regression, were-define equation (3.10) as follows:

logit(p(k)t ) = x

(k)Tt θ

(k)t (3.21)

Notice the superscript (k) that is present for both x(k)Tt and θ(k)

t , implying that can-didate models may have different setups of covariates, and their parameter estimatesmay also differ.Estimation with DMA, following McCormick et al. (2012), is computed using thesame framework as in the (single-) dynamic logistic regression, e.g. the two stepsof prediction and updating. Different from the single-model case, however, is thedefinition of the state space, which here consist of the pair (Lt,Θt), where Lt is amodel indicator - such that if Lt = k, the process is governed by model Mk at time(t), and Θt = {θ(1)

t , ..., θ(k)t }. Recursive estimation is performed on the pair (Lt,Θt):

K∑l=1

p(θ(l)t |Lt = l, Y t−1)p(Lt = l|Y t−1) (3.22)

Equation (3.22) may be compared to (3.19), which is the corresponding equation forBMA. An important aspect of (3.22) is that θkt is only present conditionally whenLt = l.

23

Chapter 3 Methods

Before we consider the prediction and updating steps, it is worth noting that, asin McCormick et al. (2012), a uniform prior is specified for the candidate models:p(Lt = l) = 1/K.

Prediction step

We here consider the second term of equation (3.22), which is the prediction equationof model indicator Lt: in other words, the probability that the considered modelis the governing model at time (t), given data up and till (t − 1). The predictionequation is defined as follows:

P (Lt = k|Y t−1) =K∑l=1

p(Lt−1 = l|Y t−1)p(Lt = k|Lt−1 = l) (3.23)

The term p(Lt = k|Lt−1 = l) implies that a K × K transition matrix needs bespecified. To avoid this, Raftery et al. (2010) redefines equation (3.23) and introduceanother forgetting factor, αt:

P (Lt = k|Y t−1) = P (Lt−1 = k|Y t−1)αt∑Kl=1 P (Lt−1 = l|Y t−1)αt

(3.24)

where αt has the effect of flattening the distribution of Lt, and hence increase theuncertainty. Just as with λt, αt is adjusted over time using the predictive likelihood(but here across candidate models).

Updating step

The (model-) updating step is defined through equation (3.25):

P (Lt = k|Y t) = ω(k)t∑K

l=1 ω(l)t

(3.25)

where

ω(l)t = P (Lt = l|Y t−1)f (l)(yt|Y t−1) (3.26)

Notice that the first term on the right side of equation (3.26) is the predictionequation and the second term is the predictive likelihood for model (l). An importantfeature here is that this latter term (the predictive likelihood) has already beencalculated (recall that it was used to determine the model-specific forgetting factorλt).

24


Just as λt is allowed to take different values at different time-points, the forgettingfactor for the model indicator αt is as well. To determine which αt to be used attime t, (McCormick et al., 2012) suggests maximizing:

arg maxαt

K∑k=1

f (k)(yt|Y t−1)P (Lt = k|Y t−1) (3.27)

That is, maximizing the predictive likelihood across the candidate models. The firstterm in equation (3.27) is the model-specific predictive likelihood (which we alreadyhave computed), and the second term is (3.24). As such, this adds minimal addi-tional computation. Now, in practice, McCormick et al. (2012) takes the approachof evaluating two α values at each time-point {some forgetting/no forgetting}.Finally, upon predicting yt at time t, equation (3.28) is applied:

yLDMAt =

K∑l=1

P (Lt = l|Y t−1)y(l)t (3.28)

where y(l)t is the predicted response for model l and time t. That is, to form the

DMA prediction, each candidate model’s individual prediction is weighted by itsposterior model probability.Evolution of the inclusion probabilitiesThe second approach considered by McCormick et al. (2012); Koop and Korobilis(2012) for the purpose of studying the temporal significance of covariates is thatwhich is centered around posterior inclusion probabilities. They are derived bysumming the posterior model probabilities for those models that include a particularvariable at a particular time. To do so, first all 2p combinations of the input variablesneed to be computed as to construct 2p candidate models - where p is the number ofpredictors. More formally, the posterior inclusion probability for variable i at timet is (Barbieri and Berger, 2004):

pi,t ≡∑I: l=1

P (MI |y) (3.29)

This approach is feasible in both McCormick et al. (2012) and Koop and Korobilis(2012) since the number of covariates is - in comparison to this thesis - relativelysmall, and the length of the time-series are also relatively short. For this thesis,the ideal would have been to set up candidate models such to represent all possiblecombinations of variables, but since the number of covariates is quite large (> 100),and the length of the time-series is long, that isn’t computationally feasibly. Conse-quently, we do not consider all possible combinations of all covariates, but rather use

25

Chapter 3 Methods

the “interesting variable groups” (as defined in sec. 4.2), and consider all the possiblecombinations of these. Although limiting, this approach is reasonable since many ofthe covariates have quite a clear group structure, and the motivation for exploringthis approach is that it may give some (high-level) insights into what variable groupsare important at different time-points.The univariate scannerAn additional approach considered in this thesis for exploring temporal significanceof covariates is one in which candidate models are constructed to be univariate.This approach is explored because (i) it allows for covariate-specific updating of theforgetting factor in a computationally feasible way, and (ii) it avoids eventual issuesof multicollinearity that the first approach of McCormick et al. (2012) may sufferfrom.To determine the significance of a particular variable at a particular time, the odds-ratios may be interpreted as described in the last section, or through the posteriormodel probabilities (> 0.5), as recommended by Barbieri and Berger (2004).

3.4.3. Dynamic Trees

Dynamic trees, first proposed by Taddy et al. (2011), is an extension of the popularnon-parametric technique partition trees. This thesis follow the particular versiondeveloped by Anagnostopoulos and Gramacy (2012), which extends the former byintroducing a retiring scheme that allows the model complexity of the tree to notincrease in a monotonic way over time, but rather change in accordance with localstructures of the data. We first outline some basic concepts of partition trees andrelevant notations, and then the dynamic extension is introduced.

3.4.3.1. Static partition trees

The basic idea of (static) partition trees is to hierarchically partition a given in-put space X into hyper-rectangles (leaves), by applying nested logical rules. Thestandard approach is to use binary recursive partitioning.A tree, here denoted by T , consists of a set of hierarchically ordered nodes ηεT ,each of which is associated with a subset of the input covariates xt = {xs}t. Thesesubsets are the result of a series of splitting rules.Considering the tree structure in a bit more detail, one may differentiate betweendifferent types of nodes: (i) at the top of every tree, one finds the root node, RT ,which includes all of xt, (ii) using binary splitting rules, a node η may be spitted intotwo new nodes that are placed lower in the hierarchy, these are referred to as η’s childnodes, or more specifically η’s left and right children: Cl(η) and Cr(η) respectively,and are disjoint subsets of η such that Cl(η)∪Cr(η) = η, (iii) the parent node, P (η),on the other hand, is placed above η in the hierarchy, and contains both η and its

26


sibling node S(η), such that P (η) = η ∪ S(η). A node that has children is definedas an internal node, whilst nodes that do not are referred to as leaf nodes. The setsof internal nodes and leaf nodes in T are denoted by IT and LT respectively.At every leaf node, a decision rule is deployed and is parametrized by θη. Inde-pendence across tree partitions leads to likelihood p(yt|xt, T, θ) = ∏

ηεLTp(yη|xη, θη),

where [xη, yη] is the subset of data allocated to η. This way of considering theleaf nodes is often referred to as a Bayesian treed models in the literature. Whilstflexible, this approach poses challenges in terms of selecting a suitable tree struc-ture. To address this problem, Chipman et al. (1998) designed a prior distribution,π(T ) (often referred to as the CGM tree prior), over the range of possible partitionstructures, that allows for a Bayesian approach with inference via the posterior:p(T |[x, y]t) ∝ p(yt|T, xt)π(T ), where [x, y]t is the complete data set. The CGMprior specifies a tree probability by placing a prior on each partition rule:

π(T ) ∝∏ηεIT

psplit(T, η)∏ηεLT

[1− psplit(T, η)] (3.30)

Where psplit(T, η) = α(1 +Dη)−β is the depth-dependent split probability (α, β > 0and Dn= depth of η in the tree). Equation (3.30) implies that the tree prior isthe probability that internal nodes have split and leaves have not. In Chipmanet al. (1998), a Metropolis-Hastings MCMC approach is developed for samplingfrom the posterior distribution of partition trees. Specifically, stochastic modifica-tions referred to as “moves” (grow, prune, change, and swap) of T are proposedincrementally, and accepted according to the Metropolis-Hastings ratio. It is uponthis framework that Taddy et al. (2011) base its dynamic extension.

3.4.3.2. Dynamic Trees

The extension from static partition trees (or more specifically, Bayesian static treedmodels) to dynamic trees is the result of defining the tree as a latent state which isallowed to evolve according to a state transition probability: P (Tt|Tt−1, xt), referredto as the evolution equation, where Tt−1 represents the set of recursive partitioningrules observed up to time t− 1. A key insight here is that the transition probabilityis dependent on xt, which implies that only such moves (grow, prune, etc.) thatare local to the current observation (e.g. leaf η(xt)) are considered. This makesthis approach computationally feasible. Following Anagnostopoulos and Gramacy(2012), we let:

P (Tt|Tt−1, xt) =0, if Tt is not reachable fromTt−1 viamoves local to xt

pmπ(Tt), otherwise

27

Chapter 3 Methods

(3.31)

where pm is the probability of a particular move, and π(Tt) is the tree prior. Themoves that are considered in this sequential approach are: {grow, prune, stay}.Taddy et al. (2011) argues that the exclusion of the change and swap moves al-lows considerably more efficient processing. The three considered moves are equallyprobable, and are defined as follows:

• Stay: The tree remains the same: Tt = Tt−1

• Prune: The tree is pruned such that η(xt) and all of the nodes below in hehierarchy are removed, including η(xt)′s sibling node S(η(xt)). This impliesthat η(xt)’s parent node P (η(xt)) after the prune becomes a leaf node.

• Grow: A new partition is created within the hyper-rectangle defined for η(xt).More specifically, this move first uniformly chooses a split dimension (covariatedimension) j, and split point xgrowj . Then the observations of η(xt), are dividedaccording to the defined split rule.

3.4.3.2.1 Prediction and the Leaf Classification ModelFor posterior inference with dynamic trees, two quantities are imperative: (i) themarginal likelihood for a given tree, and (ii) the posterior predictive distribution fornew data.The marginal likelihood is obtained by marginalizing over the regression model pa-rameters, which in this case are the leaves ηεLT , each parametrized by θη ∼ π(θ):

p(yt|Tt, xt) =∏ηεLTt

p(yη|xη)

=∏ηεLTt

ˆp(yη|xη, θη)dπ(θη) (3.32)

That is, by conditioning a given tree, the marginal likelihood is simply the product ofindependent leaf likelihoods. Combining (3.32) with the prior described earlier, weobtain the posterior p(Tt|[x, y]t, Tt−1). Considering next the predictive distributionfor yt+1, given xt+1, Tt, and data [x, y]t:

p(yt+1|xt+1, Tt, [x, y]t]) =

= p(yt+1|xt+1, Tt, [x, y]η(xt+1)])

28


=ˆp(yt+1|xt+1, θ)dP (θ|[x, y]η(xt+1)) (3.33)

Notice that in the second step of the derivation that [x, y]t is re-written as [x, y]η(xt+1),this is so because we only consider the leaf partition which contains xt+1. Thesecond term in (3.33), dP (θ|[x, y]η(xt+1)), is the posterior distribution over the leafparameters (classification rules), given the data in η(xt+1). As such, the predictivedistribution is simply the classification function at the leaf containing xt, integratedover the conditional posterior for the leaves (model parameters).The model defined at each of the leaves may be linear, constant or multinomial.Since the response variable in this work is binary, the approach of binomial leaves isapplied. As such, each leaf response yηs is equal to one of 2 alternative factors. Theset of outcomes for a particular leaf is summarized by a count vector: zη = [zη1 , zη2 ]′,such that the total count for each class is zηc = ∑|η|

s=1 1(yηs = c). Following Taddyet al. (2011), we then model the summary counts for each leaf as follows:

zη = Bin(pη, |η|) (3.34)

where Bin(p, n) is a binomial with expected count pc/n for each category. A Dirich-let Dir(1C/C) prior is assumed for each leaf probability vector, and as such, theposterior information about pη is given by:

pη = (zη + 1/C)(|η|+ 1) = (zη + 1/2)

(|η|+ 1).The marginal likelihood for leaf node η is then defined by equation 3.35:

p(yη|xη) = p(zη) =C∏c=1

Γ(zηc + 1/C)zηc !× Γ(1/C) =

2∏c=1

Γ(zηc + 1/2)zηc !× Γ(1/2) (3.35)

Finally, the predictive response probabilities for leaf node η containing covariates xis:

p(y = c|x, η, [x, y]η) = p(y = c|zη) = pηc for c = 1, 2 (3.36)

3.4.3.2.2 Particle Learning for Posterior Simulation

29

Chapter 3 Methods

As in the static version of Chipman et al. (1998), a sampling scheme is applied toapproximate the posterior distribution of the tree. More specifically, Taddy et al.(2011) uses a Sequential Monte Carlo (SMC) approach: at time t− 1, the posteriordistribution over the trees is characterized by N equally weighted particles, eachof which includes a tree T (i)

t−1 as well as sufficient statistics S(i)t−1 for each of its leaf

classification models. This tree posterior, {T (i)t−1}Ni=1, is updated to {T (i)

t }Ni=1througha two-step procedure of (i) resampling and (ii) propagating. In the first step, par-ticles are resampled, with replacement, according to their predictive probability forthe next (x, y) pair: wi = p(yt|T (i)

t−1, xt). In the second step, each tree particle isupdated by first proposing local changes: T (i)

t−1 → T(i)t via the moves {stay, prune

or grow}, resulting in three candidate trees: {T stay, T prune, T grow}. As the candi-date trees are equivalent above the parent node for xt, P (η(xt)), one only needs tocalculate the posterior probabilities for the subtrees rooted at this particular node.Denoting subtrees by Tmovet , the new Tt is sampled with probabilities proportionalto: π(Tmovet )p(yt|xt, Tmovet ), where the first term, the prior, is equal to (3.31) and thesecond term, the likelihood, is (3.32) with leaf marginal (3.35). As noted in Taddyet al. (2011) and Anagnostopoulos and Gramacy (2012), this sequential filteringapproach enables the model to inherit a natural division of labor that mimics thebehavior of an ensemble method - without explicitly maintaining one.3.4.3.2.3 Data retirementWhat has been considered so far for this method is the original approach developedby Taddy et al. (2011). Whilst being sequential, this approach is not strictly online,because the tree moves may require access to full data history. Furthermore, thecomplexity of the original dynamic trees model grows with log t, and in terms ofclassification in non-stationary environments, this isn’t ideal, as we suspect thatthe data generating mechanism may change over time. In Anagnostopoulos andGramacy (2012), an extension is proposed where data is sequentially discarded anddown-weighted. Specifically, an approach referred to as data point retirement isdeveloped, where only a constant number, w, of observations are active in the trees(referred to as the ’active data pool’). Whilst data points are sequentially discardedthey are still ’remembered’ in this approach. This is achieved by retaining thediscarded information in the form of informative leaf priors.More specifically, suppose we have a single leaf ηεTt, for which we have alreadydiscarded some data, (xs, ys){s}, that was in η at some time t′′ ≤ t in the past.Anagnostopoulos and Gramacy (2012) suggests that this information can be “re-membered” by taking the leaf-specific prior, π(θη), to be the posterior of θη, givenonly the retired data. If we generalize this to trees of more than one leaf, we maytake:

π(θ) =df P (θ|(xs, ys){s} ∝ L(θ; (xs, ys)π0(θ) (3.37)

where π0(θ) is a baseline non-informative prior to all of the leaves. Following Anag-

30


nostopoulos and Gramacy (2012), we update the retired information through therecursive updating equation:

π(new)(θ) =df P (θ|(xx, ys){s},r) ∝ L(θ;xr, yr)P (θ|(xs, ys){s}) (3.38)

where (xr, yr) is the new data point that is retired. Anagnostopoulos and Gramacy(2012) shows that equation 3.38 is tractable whenever conjugate priors are employed.In our case, with the binomial model, the discarded response values ys are repre-sented as indicator vectors zs, where zjs = 1(ys = j). The natural conjugate isthe Dirichlet D(a), where a is a hyperparameter vector that may be interpreted ascounts. It is updated through: a(new) = a+ zr, where zjm = 1(yr=j). Anagnostopou-los and Gramacy (2012) shows that through this approach, the retirement preservesthe posterior distribution, and as such, the posterior predictive distributions andmarginal likelihoods required for SMC updates are also unchanged.A dynamic tree with retirement manage two types of information: (i) a non-parametricmemory of an active data pool of (constant) size w < t, as well as (ii) a parametricmemory of possibly informative priors. The algorithm proposed by Anagnostopou-los and Gramacy (2012) may be summarized by the following steps:

1. At time t, add the tth data point to the active data pool.2. Update the model through the Sequential Monte Carlo scheme described in

3.4.3.2.3. If t exceeds w, select some data point (xr, yr) and remove it from the active

data pool. But before doing so, update the associated leaf prior for η(xr)(i) foreach particle i = 1, ..., N, as to ’remember’ the information present in (xr, yr).

More details are found in (Anagnostopoulos and Gramacy, 2012).3.4.3.2.4 Temporal adaptivity using forgetting factors

To address the possibility of a changing data generating mechanism in a streamingcontext, Anagnostopoulos and Gramacy (2012) further introduced a modificationof the retiring scheme described in the previous section. Specifically, retired datahistory, s, is exponentially down-weighted when a new point ym arrives:

π(new)λ (θ) ∝ L(θ|ym)Lλ(θ; (ys, xs){s})π0(θ) (3.39)

Where λ is a forgetting factor. At the two extremes, when λ = 1, the standardconjugate Bayesian updating is applied, as in the previous section, and when λ = 0,the retired history is disregarded completely. A λ in-between these two extremes hasthe effect of placing more weight on recently retired data points. More specifically,

31

Chapter 3 Methods

in the context of the binomial model, the conjugate update is modified from a(new) =a+ zr to a(new) = λa+ zm.In the algorithm described in 3.4.3.2, one of the steps noted “select some data point(xr, yr) and remove it”. We may here specify that, in the context of this thesis, andfollowing Anagnostopoulos and Gramacy (2012), this data point is the oldest datapoint in the active data pool.3.4.3.2.5 Variable ImportanceTo measure the importance of predictors for dynamic trees, where the responsevariable is discrete, Gramacy et al. (2013) proposed the use of predictive entropybased on the posterior predictive probability (p) of each class c in node η. This leadsto the entropy reduction:

4(η) = nηHη − nlHl − nrHr (3.40)

where Hn = −∑c pclog pc and n is the number of data points in η. The second and

third term on the right hand side of equation (3.40) describes the entropy for nodeη’s left- and right- children respectively. In Gramacy et al. (2013), however, variableimportance is not considered in an online setting: each covariates’ predictive entropyis calculated based on results from the full dataset. In this thesis, we are interestedin the temporal variable importance, and as such, we instead consider the meanentropy reduction for a particular covariate at each time point, by averaging overthe N particles. This allows us to display the variable importance as a time-series;a simple and intuitive way to study its relative importance over time.

3.5. Drop description

3.5.1. Association Rule Mining

From the analysis of sec. 3.4, one may gain insights about which variables are relevantat different time-points. As an additional layer to this aforementioned analysis, wefurther consider the application of association rule mining, originally proposed byAgrawal et al. (1993), with the objective of gaining intuitive descriptions that areeasy to interpret for domain experts. This approach is convenient since the datahas been formatted such that it constitutes of binary variables. The possessedknowledge of which variables that are interesting at different time-points (inheritedfrom sec. 3.4), and hence which variables to consider for deriving association rulesat these different time-points, has the positive effect of reducing the search-spaceneeded to be explored for obtaining association rules.Specifically, the Apriori algorithm is used to generate the association rules. TheApriori algorithm is designed to operate on transaction databases, and hence the

32

3.6 Technical aspects

first step constitutes of transforming the original data into a transaction databaseformat. Following the transformation, the data consist of a set of transactions, whereeach transaction (T ) is a set of items (I), and is identified by its unique TID (=transaction identifier). An association rule is an implication of the form X → Y ,where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. The Apriori algorithm works in a bottom-upapproach, and first identifies frequent individual items in the database and thenextends them into larger item sets, as long as those item sets appear sufficientlyoften in the database (Agrawal et al., 1993).Given a set of transactions, the problem of mining association rules is to generateall association rules that have support and confidence greater than the user-specifiedminimum support (minsup) and minimum confidence (minconf ) respectively. Sup-port is simply the count of the number of transactions in the dataset that containthe association rule, divided by the total number of transactions, whilst the confi-dence measures how many of the transactions containing a particular item (say X)that also contain another item (say Y ). More formally, the support for associationrule X → Y is defined by equation (3.41):

Support(X → Y ) = count(X ∪ Y )N

(3.41)

whereN is the number of transactions (observations). The confidence for associationrule X → Y is obtained through equation (3.42):

Confidence(X → Y ) = count(X ∪ Y )count(X) (3.42)

In this thesis we are interested in those association rules which have {Drop = 1} inthe right-hand-side of the association rule, and hence we introduce such a constraintinto the process - in addition to minsup and minconf.

3.6. Technical aspects

For the purpose of data cleaning, pre-processing, and sampling, the Python pro-gramming language was used. The analysis part of this thesis was carried out usingthe R programming language. For the dynamic logistic regression and dynamicmodel averaging, code was first extracted from the dma library - and on the basis ofthis code, various extensions and modifications were implemented. For the dynamictrees model, the dynaTree package was used. Finally, association rule mining reliedon the arules package.

33

4. ResultsThis section presents the results of this thesis, and is divided into three parts: thefirst containing a brief exploratory analysis of the data. In the second part, thetask of deriving an online classifier with high prediction capabilities is tackled. Inthe third and final part, exploration of the temporal significance of covariates isconsidered, where interesting periods are analyzed in more detail - with the objectiveof identifying potential causes for drops at those particular periods.

4.1. Exploratory analysis

In Figure 4.1 the number of dropped calls over the relevant period is displayed.

Time

Num

ber

of d

ropp

ed c

alls

0 20 40 60 80 100

010

020

030

040

050

060

0

Figure 4.1.: Number of dropped calls over the period January 26 - April 11 forSTP9596, as divided by 100 equally large time-ordered subsets.

As one may observe, there are at least ~5 time-periods in which the call drop rateincreases considerably. Upon exploring the temporal significance of covariates insec. 4.3, one of these periods will receive special attention.Worth noting is that no effort has been made to account for periodicity, and thisis because the data originates from programmed systems that does not have anyperiodicity-dependencies. Although, even if there would have been any, the sequen-tial Bayesian framework would naturally have incorporated that aspect by updatingthe parameters accordingly.Initially, 188 covariates were extracted. Having considered the aspect of redundancyand multicollinearity, several covariates could be removed. For instance, a lot signal

35

Chapter 4 Results

types have both a ’request’ and a ’response’ signal, and hence almost always occurtogether. In such cases, one of them were removed. The resulting dataset of 122covariates is one in which the degree of multicollinearity is low, as one may observefrom the heatmap plot of the correlation matrix in Figure 4.2:

−1.0

−0.5

0.0

0.5

1.0

Var 1

Var 2

value

Figure 4.2.: Heatmap of correlation matrix

To demonstrate the concept of temporal significance, let’s consider two of the 122covariates, starting with PS. As mentioned in sec. 2.2.1, this covariate describes the“type of radio connection that a particular UE has”, or more specifically, that ithas a data connection activated. In Figure 4.3, the percentage of such calls thatterminate unexpectedly is displayed, as divided by four equally sized (ordered) timeperiods. Considered as a univariate classifier, the red bars in this plot representsthe true positive rate at four different time periods.

It can be observed that the proportion of calls with PS that drops is not constantover the considered time period. In the first time period, the percentage of normaloutcomes outweighs the dropped ones. This changes quite drastically in the secondperiod, where more than 75% of the calls with PS terminate unexpectedly. Inperiod three, the percentage of dropped calls still outweighs the number of normalones considerably. In the fourth and final period, the proportions are almost equal.

Next, we consider one of the GCP covariates, more specifically the GCP combination“000011000000011000011011”. In Figure 4.4, the proportion of calls that has this

36

4.1 Exploratory analysis

0.00

0.25

0.50

0.75

1.00

T1 T2 T3 T4Timeperiod

%

Termination

Drop

Normal

Figure 4.3.: Percentage of PS that terminates unexpectedly, as divided by 4equally sized time periods

specific combination of generic connection properties and terminates unexpectedlyis displayed - again, as divided by four equally sized time periods.

0.00

0.25

0.50

0.75

1.00

T1 T2 T3 T4Timeperiod

%

Termination

Drop

Normal

Figure 4.4.: Percentage of calls with GCP=000011000000011000011011 that ter-minates unexpectedly, as divided by 4 equally sized time periods

One may observe that the proportion of calls having this particular GCP, that drops,changes over time. Specifically, during the first 1/4th of the time-series, close to 70%of the calls that attain this GCP terminates unexpectedly. For the following threeperiods, however, this relationship shifts such that the calls attaining this propertyinstead tend to correlate with normal calls.

37

Chapter 4 Results

4.2. Online classification

In this section, the sampling and classification techniques described in chapter 3are evaluated as to derive a model that can discriminate between dropped calls andnormal calls with high precision. To conclude this section, the (best) online classifieris compared to its static equivalence in a few fictive scenarios.

4.2.1. Sampling strategies

As previously mentioned, the number of normal calls far outweighs the number ofdropped calls. This part of the results is concerned with studying the effects ofthe imbalance on the capability of the classifiers. The first question one reasonablymay pose is whether sampling is needed at all? If yes, what sampling techniqueand what sampling rate is suitable? To answer these questions, the online randomundersampling (ORUS) technique, as well as the proposed extension, adaptive onlinerandom undersampling (A-ORUS) that were described in sec. 3.2 are evaluated -using the same evaluation metric as in Nguyen et al. (2011); Wang et al. (2013), thegeometric mean.Datasets of different rates of imbalance were created via these sampling techniques,and then the dynamic logistic regression model and the dynamic trees model (withfixed-parameter settings) were applied to these datasets - e.g. holding everythingexcept the sampling size constant. For ORUS, the considered imbalance rates are: (i)10%/90%, (ii) 30%/80%, and (iii) 50%/50%. For A-ORUS, the considered imbalancerate is: (i) 50%/50%. The original imbalance rate of 1.2%/98.8% is also considered.Let us first evaluate the results of the dynamic logistic regression. In Table 4.1, theresults for this evaluation are presented.

Sampling Strategy G-meanORIG 1/99 0.487ORUS 10/90 0.810ORUS 30/70 0.875ORUS 50/50 0.913A-ORUS 0.890

Table 4.1.: Evaluation of Sampling strategies using Dynamic Logistic Regression

It can be seen that the original imbalance rate (∼ 1% dropped calls) has resultedin a G − mean score that is considerably worse than the other four; an indica-tion that sampling may be justified. A general tendency that one may observe isthat, as the undersampling rate increases for ORUS - and the distribution over theclasses becomes more uniform - the G−mean score increases, reflecting the increasedcapability of the model to predict positive instances (dropped calls) correctly. Con-sidering the proposed adaptive technique, it can be observed that it does not affect

38

4.2 Online classification

the G−mean as positively as the 50/50 sampling rate of ORUS. Let us next considerthe corresponding results for the dynamic trees.

Sampling Strategy TPR TNR G-meanORUS 1/99 0.370 0.994 0.545ORUS 10/90 0.610 0.984 0.719ORUS 30/70 0.718 0.946 0.789ORUS 50/50 0.779 0.876 0.827A-ORUS 0.743 0.911 0.798

Table 4.2.: Evaluation of Sampling strategies using Dynamic Trees

The results from Table 4.2 align well with those of Table 4.1, in that the G−meanscore steadily increases as the distribution between the classes becomes more even.To confirm the conclusions based on the G −mean, one may further consider theTPR and TNR values, and in particular, the general trend that the TPR increases asthe undersampling-rate increases. This, however, come at the cost of reductions inthe TNR. Even so, the overall performance of the classifier is improved. Since whatis of particular interest in this thesis is to discriminate dropped calls from normalcalls, e.g. the positive cases, the G−mean and TPR are of particular importance.Considering the proposed adaptive technique, it can again be observed that it doesnot affect the G−mean or TPR as positively as the 50/50 sampling rate for ORUS.Based on the results presented in Table 4.1 and 4.2, the decision was made to usethe data resulting from the 50/50 ORUS (e.g. the one with a 50%/50% distributionbetween the classes) for the remainder of the analysis.

4.2.2. Dynamic Trees

The first online classification technique to be considered is the dynamic trees. Aspreviously described, the tree prior (affecting the split probability) is specified bytwo parameters: α and β. Sensitivity analysis of these was performed, and it wasfound that the results were only marginally affected by their specification. Based onthis analysis, the parameters were set to α = 0.99 and β = 2. These settings alignwell with what usually is applied in the literature. Tables displaying the sensitivityanalysis are found in Table A.1 and A.2 in the Appendix.In addition to the tree prior, there is also the forgetting factor (λ), the active datapool size (w), and the number of particles (N). The latter was set in accordance withthe literature; N = 1000. For the former two, an empirical evaluation is performed,as to derive the best DT. Let’s first consider the forgetting factor λ, holding wconstant. In Table 4.3, the result of this evaluation is presented:It can be observed that, between λ = 1 and λ = 0.90 , the prediction capability- as measured by AUC and G-mean - monotonically improves. At λ = 0.80, the

39

Chapter 4 Results

Lambda w TPR TNR AUC G-mean1.00 1000 0.747 0.858 0.869 0.7990.99 1000 0.839 0.841 0.921 0.8500.95 1000 0.831 0.871 0.926 0.8560.90 1000 0.850 0.875 0.933 0.8640.85 1000 0.823 0.877 0.927 0.8570.80 1000 0.839 0.889 0.934 0.8640.70 1000 0.802 0.900 0.924 0.8490.60 1000 0.819 0.882 0.927 0.8550.50 1000 0.813 0.894 0.927 0.852

Table 4.3.: Evaluation of forgetting factors for the Dynamic Trees

best score is obtained. Lowering λ further does not present any improvement. Oneimportant takeaway here is that a λ < 1 is rewarding, which implies that weighting(retired) observations that are observed more recently higher improves the result.This reflects the time-dependence of the system. Let us next consider the activedata pool size, w, holding λ constant.

w TPR TNR AUC G-mean50 0.727 0.886 0.874 0.799100 0.777 0.875 0.894 0.823250 0.808 0.874 0.913 0.843500 0.797 0.884 0.917 0.839750 0.815 0.885 0.926 0.8511000 0.823 0.889 0.928 0.8561500 0.817 0.894 0.931 0.8592000 0.837 0.883 0.930 0.8663000 0.827 0.886 0.931 0.8714000 0.800 0.903 0.929 0.874

Table 4.4.: Evaluation of active data pool size (w) for the Dynamic Trees

From Table 4.4, one can observe that, between w = 50 and w = 1000, the perfor-mance of the classifier is monotonically increased. After this point, the AUC andG-mean does still increase (up and till w = 3000), but only marginally (relative tothe increase in w). Considering the notable increase in terms of computational cost(see Table A.3), the marginal gains in performance is not enough to offset w = 1000as the best alternative.The observation that the performance is improved as w is increased is not that sur-prising, because, as previously described, the size of the active data pool determinethe total number of observations to be stored in the tree (at any given time point),and hence, a lower w has the effect of forcing the tree to be smaller, whilst a largerw allows the tree to grow larger. A larger tree has the advantage of being able to

40


capture more complex structures in the data, but it comes at the cost of potentiallynot being as flexible as the smaller tree.Based on the results and analysis of Table 4.3 and 4.4, it is concluded that the bestDT is the one that has parameter settings: λ = 0.80, w = 1000: it achieved an AUCof 93.4% and a G-mean of 86.4%. To gain insight of how well this model performedover time we, in Figure 4.5, consider a rolling window displaying the accuracy atdifferent time-points.

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Time

Accuracy

Figure 4.5.: Rolling window measuring the Accuracy for the best Dynamic Treesmodel over the considered period

One can observe that the performance of the classifier at approximately 7 timesdegrades to an accuracy of less than 60%. The time-points for the degradations maybe compared to the call drop time-series for the undersampled data in Figure 4.7. Indoing so, one finds that four of the degradations happen during the second period ofabnormally high drop-rate (corresponding to subset 13-22 in Figure 4.7). The sixthand seventh degradation are related to the third and fifth period of abnormally highdrop-rate respectively. As described in sec. 3.4.3, the latent state of the dynamictree consist of the tree-structure. The degradations in this plot reflects the inabilityof the model to update its structure fast enough.A general observation one might make is that there is no clear trend, which can implyone of two things, either (i) there are no structural changes that occurred over thistime-period, or (ii) the classifier is able to adapt to changing circumstances, and thusavoids any longer period of degradation. Seeing it as there clearly are reductionsin the performance, but that the classifier recovers, the second alternative appearmore likely. In sec. 4.3, we will consider the period 4200-6400 in more detail, as toexplore what might have caused these degradations.

4.2.3. Dynamic Logistic Regression

In this subsection, the dynamic logistic regression, as well as the extension withdynamic model averaging are evaluated.

41

Chapter 4 Results

4.2.3.1. Allowing for the inclusion of sparse attributes

As previously mentioned, in addition to being imbalanced, the data is also sparsewith regards to several of the input variables, and this proved to pose anotherchallenge. This part of the results is concerned with shedding light on this problem,as well as to evaluate the proposed forgetting factor modification and compare it tothe original proposed in McCormick et al. (2012).

To explore how many, and which variables the original forgetting framework hastrouble with, an experiment was set up such that 122 univariate dynamic regressionmodels were fitted, one for each covariate. If the model-fitting failed to executecorrectly during the recursive-updating step, a ”1” was recorded for that attribute. Ifno problem occurred, a ”0” was registered. The same experiment was then performedfor the modified forgetting framework. The forgetting factor λ was set to 0.95 forboth versions. For the modified version, the additional parameter w was set to 10.In Table 4.5, the outcome of these runs is presented:

Forgetting scheme Success FailedOriginal 92 30Modified 122 0

Table 4.5.: Evaluation forgetting frameworks

One may observe that approximately 1/4 of the covariates could not be used withthe original forgetting framework. By applying the modified version, however, allcovariates could be included. The full list of variables for which the original forget-ting framework failed is found in Table A.4. One characteristic that they all share,is that they are temporally sparse. Let’s consider one of these covariates, cell_526,in a bit more detail. The updating step fails to converge at time-point 3441, and inFigure 4.6, the log-odds, the values of the covariate, and the values of the responsevariable are presented between time-point 3350-3450.

From Figure 4.6 it can be observed that during this sub-period, 6 calls were madefrom cell_526, of which 3 were dropped. At time-point 22 (in this plot), a matchoccurs (cell_526= 1 and y = 1), the model react by updating the parameter esti-mate to ∼ 500+. At time-point 36, the next call from cell_526 is made, howeverthis time it’s not a match (cell_526= 1, y = 0), consequently, the model updatesthe parameter estimate to ∼ −200. Finally, at time-point 91, we observe two con-secutive matches, and this is what causes the model updating to crash: the modelupdates the parameter to such an extent that when the logit prediction is made, ex-ponentiation of the log-odds produces an infinite value. As previously mentioned, bylowering the λ value, we assign larger weights to more recent observations (accord-ing to: weightobservation j = λj) , and what we observe here is that, during periodsof sparsity, this has the potential effect of causing extreme inflation of parameterestimates that crashes the algorithm. The proposed modified forgetting framework

42


addresses this problem by scaling λ closer to 1 during periods of sparsity, and hencebase its parameter updates on longer spans of observations.

Time

Log−

odds

0 20 40 60 80 100

−20

00

200

400

●●●●

●

●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●

●●

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Time

celli

d_52

6

●

●

●●●●●●●

●

●

●

●

●

●●

●

●●●●

●

●●●

●●●●

●●

●

●

●

●

●

●●●●●●●●●

●

●●

●●●●●

●●●●

●

●●●

●

●●●

●

●●

●●●

●●

●

●●

●

●●

●

●●

●●●

●●●●

●●

●●●

●●

●●●●●

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Time

y

Figure 4.6.: Example of breakdown for original forgetting framework: cell_526

43

Chapter 4 Results

4.2.3.2. Evaluation of forgetting factors

Given the central importance of the concept of forgetting, this section is dedicated toevaluate the different forgetting strategies described in Section sec. 3.4.1, as well asdifferent forgetting constants (c = λ < 1), such to obtain the best fit and predictioncapability. Note that, when using the modified forgetting framework, the additionalparameter, w (defining the window upon which local sparsity should be estimated), isset to 10 throughout this work: it was empirically found to be a suitable value. Firstwe consider the simplest strategy, that of the fixed λ, using the original forgettingframework.

Lambda AUC G-mean1.000 0.94746 0.881260.999 0.97700 0.92302

Table 4.6.: Evaluation of Simple Forgetting Strategy

Lambda AUC G-mean0.99 0.97009 0.913770.95 0.99656 0.981190.90 0.99930 0.995080.85 0.99954 0.996490.80 0.99963 0.996880.75 0.99956 0.99657

Table 4.7.: Evaluation of Adaptive Forgetting Strategy

Lambda AUC G-meanMultiple: 1, 0.99, 0.95, 0.90, 0.85, 0.80 0.99972 0.99867

Table 4.8.: Evaluation of Multiple Adaptive Forgetting Strategy

From Table 4.6, it can be observed that, out of the two considered λ values, thebest result is obtained by using λ = 0.999. As this strategy were implemented viathe original forgetting framework (because we want a fixed λ), the process collapsesfor λ values lower than 0.999, and hence lower values were not explored for thisapproach. The results from Table 4.6 do nonetheless give a hint that a more local fitis probably preferable - giving relatively higher weights to more recently observeddata points.From Table 4.7, one can observe that, using the extended forgetting strategy pro-posed by McCormick et al. (2012) coupled with the modification proposed in thisthesis, we are able to improve the results of the fixed λ considerably. The forgettingfactor which has resulted in the best prediction capability is λ = 0.80, obtaining

44


an AUC of 99.963% and G −mean = 99.688%. This again tends to imply that alocal fit is preferable to a global one. A λ value of 0.80 implies that an observationoccurring 10 time-points back is assigned approximately 1/10th of the weight thatthe past observation has.In Table 4.8, results for the third strategy of multiple λ’s are displayed. It is foundthat extending the number of λ′s to evaluate at each iteration leads to a marginalimprovement in this case (AUC = 99.972% and G − mean = 99.867%). Thisimprovement comes at the cost of slower computational time, such that a trade-offhas to be made. Since the improvement is only marginal in this case, the extensionmay not be worth the computational cost. However, if the degree of change changesover time, or is temporal, multiple λ′s may be worthwhile.A final comment is that, using the original (adaptive) forgetting framework proposedby McCormick et al. (2012), the lowest λ value that could be used were λ = 0.98,and it resulted in AUC = 99.688% and G−mean = 98.234%. Hence, the modifiedforgetting framework, in addition to being able to include more covariates, is ableto outperform the original approach on this data.

4.2.3.3. Extension with Dynamic Model Averaging

Two approaches for the construction of candidate models are considered: (i) onecandidate model per “variable group”, and (ii) one candidate model for all possiblecombinations of “the most interesting variable groups”. It should be emphasizedthat the set of variables and variable groups are different in (i) and (ii): the formerconstitutes of all 122 variables (22 variable groups), whilst the latter only contains92 variables and 6 variable groups. Let’s begin by considering the former:Strategy 1

First, the model forgetting factor, α, is considered (holding the within-model for-getting factor, λ, constant):

Lambda Alpha AUC G-mean0.99 0.99 0.881 0.7980.99 0.95 0.917 0.8390.99 0.90 0.931 0.8550.99 0.85 0.935 0.8600.99 0.80 0.937 0.8620.99 0.75 0.937 0.8630.99 0.70 0.937 0.863

Table 4.9.: Evaluation of alpha for DMA

From Table 4.9, one can observe that as the α value is lowered, the predictivecapability of the model steadily increases. This reflects, on the one hand, that

45

Chapter 4 Results

we have many small models that by themselves may not be very predictive, andon the other hand, that these models discriminate the data with varying qualityover the span of the time-series: e.g. that one variable group (candidate model)may explain the data relatively well at one point in time, but not at the other.The gains in predictive capability from a lowered α takes a decaying form, andstops around 0.75 (AUC is marginally lower at α = 0.70). As such, we move onto the second parameter, the within-model forgetting factor, λ (holding the modelforgetting factor, α, constant):

Lambda Alpha AUC G-mean1.00 0.75 0.928 0.8470.99 0.75 0.936 0.8630.98 0.75 0.945 0.8750.97 0.75 0.950 0.8830.96 0.75 0.955 0.8900.95 0.75 0.958 0.8950.94 0.75 0.960 0.8970.93 0.75 0.962 0.8990.92 0.75 0.963 0.9020.91 0.75 0.965 0.9050.90 0.75 0.966 0.9070.89 0.75 0.967 0.9090.88 0.75 0.968 0.9110.87 0.75 0.969 0.9130.86 0.75 0.970 0.9130.85 0.75 0.970 0.913

Table 4.10.: Evaluation of lambda for DMA

In Table 4.10 it can be seen that the comments made for α also apply to λ: asthe forgetting factor is lowered, the overall performance increases. As previouslydescribed, what this translates into, in practice, is that the candidate models adaptto local behaviors through rapid and local updating of the coefficient estimates.

It should be noted that some trouble were encountered here when lowering the λvalue (even with the modified forgetting factor). These problems were limited tospecific candidate models, and as such, a specific (higher) λ value was set for those.

Strategy 2

In the second strategy, we construct candidate models by considering all the possiblemodel-combinations of “the most interesting variable groups”. Selecting six variablegroups translates into 64 candidate models. In Table 4.11 these variable groups aredisplayed. Given the encountered limitations of the previous strategy, we here set arather conservative λ of 0.95: this to ensure stability.

46


Variable.Groups VariablestProc4 t_proc1,...,t_proc37Cell ID cell_1,...,cell_17GCP X1,...,X23Radiolink Radiolinkrequest,...,RadiolinkfailureUeRcid PS, CS, SRB, MixedEvID e1a, e1d,...e1f

Table 4.11.: Variable groups

Model Alpha AUC G-meanDMA 0.99 0.983 0.939DMA 0.9 0.990 0.954DMA 0.8 0.993 0.961DMA 0.7 0.994 0.965DMA 0.6 0.995 0.968DMA 0.5 0.995 0.970DMA 0.4 0.996 0.971DMA 0.3 0.996 0.972DMA 0.2 0.996 0.973DMA 0.1 0.997 0.974DMA 0.01 0.997 0.974Full Model – 0.990 0.954

Table 4.12.: Strategy 2: Evaluation of alpha for DMA

From Table 4.12 it can observed that as the model-forgetting factor α is lowered theclassification capability of DMA monotonically increases, and outperforms the single(full) model at α = 0.90. The gains in AUC from lowering α gradually decays, andat α = 0.10, only the 5th decimal changes, and hence we stop there. These resultsimplies, on the one hand, that the variable groups has a non-constant and varyingdegree of importance over the time-period, and being able to shift more weight tomodels excluding less relevant variables, is rewarding. Decreasing α as low as 0.10has the effect flattening the distribution of the model indicator quite extensively, andas such, all candidate models are assigned relatively low weights and the predictionof DMA becomes more of an averaged prediction of many candidate models, ratherthan a few. Whilst presenting promising results, this approach has the downside ofhaving to consider 64 candidate models rather than 1, implying a hefty decrease incomputational speed. Although, as the candidate models are updated independentlyof one another, it is possible parallelize this process, as to reduce the computationalconstraints.

47

Chapter 4 Results

4.2.4. Summary of results

In Table 4.13, the best results for each of the considered approaches are presented.

Model AUC G-meanSingle DLR 0.9996 0.9969Group DMA 0.9965 0.9737Dynamic Trees 0.9341 0.8635Table 4.13.: Summary of results

It can be observed that the single dynamic logistic regression is the clear winner; ithas obtained the highest AUC and G-mean scores. Recall that in contrast to thesingle dynamic logistic regression and the dynamic tress model, the Group DMAmodel only consists of 92 variables, divided into 6 variable groups. This is worthunderscoring since we in sec. 4.2.3.3 concluded that the Group DMA outperformedthe single model.From Table 4.13, one can also observe the rather big difference that exists in termsof predictive capability between the models that are centered around the dynamiclogistic regression and the dynamic trees. One possible explanation for this is thatthe process of updating the tree-structure may not be as rapid as the process ofupdating the parameters in the dynamic logistic regression. To demonstrate thispoint, consider Figure A.1. It can be seen that the covariate cell_id220524 wereincluded in 100% of the N particles (trees) between time-point 4200-9800. If wenext consider Figure A.2, where the reduction in entropy for the same covariate isdisplayed, one may observe that the period for which the reduction in entropy islarge (4200-6000), is considerably shorter than the period for which the covariate isincluded in the trees (4200-9800).

4.2.5. Static Logistic Regression vs. Dynamic LogisticRegression

The two classifiers that were selected for this thesis have in common that they aredynamic and updated online. In this subsection, the question of whether a dynamicmodel is preferable to a static one is evaluated in terms of the predictive capabilityover the considered period. Seven scenarios are considered, and what differentiatesthem is the size of the batch in the training set compared to the test set: from {10%training, 90% test}, to {90% training, 10% test}. For all of the scenarios, a staticlogistic regression is fitted in the training period, and then used for predicting theincoming calls for the next period. The results are presented in Table 4.14.It can be observed that the static classifier performs gradually better as it is fedmore data: the highest AUC score is obtained when 80% of the data is in the

48


Training set proportion AUC0.100 0.8460.200 0.8550.333 0.8780.500 0.8810.667 0.9000.800 0.9110.900 0.880

Table 4.14.: AUC Static Logistic Regression

training set (AUC = 91.11%). In comparison to the dynamic logistic regression, theperformance is considerably worse: recall that the best dynamic logistic regressionobtained an AUC of 99.96%.In addition to the scenarios described above, in which data is accumulated sequen-tially, let’s consider a scenario in which all the data is available. First, we randomlyassign 70% of the observations of the data to a training set (without accounting forthe sequential order of the data), and use this data to fit a logistic regression model.Secondly, we use this fitted model to predict the outcomes of the remaining 30%of the data (in the test set). Following step 1 and 2, one obtains predictions thatresults in an AUC of 93.8%. In other words, an improvement to the sequential mod-eling scheme of the static classifier, but still far less well compared to the dynamiclogistic regression.

49

Chapter 4 Results

4.3. Online drop analysis

Besides resulting in actual predictions, the chosen classifiers have - as previouslymentioned - the caveat of leaving varying degrees of “traces” (not to be confusedwith the covariate) as to how these predictions were made. The idea of this part ofthe results, which is termed online drop analysis, is to explore these “traces”, andin particular what they can tell us about periods of abnormal number of droppedcalls.Having randomly undersampled the data as if it arrived online (according to theconcluded best sampling rate: 50/50), the call-drop time-series from Figure 4.1 istransformed as to appear like Figure 4.7:

Time

Num

ber

of d

ropp

ed c

alls

10

01

50

20

02

50

0 10 20 30 40 50

Figure 4.7.: Number of dropped calls, as divided into 50 equally large time-orderedsubsets - based on the undersampled data (ORUS 50/50)

In the previous section, it was shown that the dynamic logistic regression clearlyoutperformed the dynamic trees model, and hence this section will be centeredaround the use of the dynamic logistic regression, although the results of the DTare supplied for comparative and confirmatory purposes. As previously mentioned,in McCormick et al. (2012); Koop and Korobilis (2012), two approaches were con-sidered for studying the temporal significance of covariates and how the conditionalrelationships change over time. The first considering posterior inclusion probabili-ties, and the second considering odds ratios. Both of these approaches are consideredin this section.

4.3.1. DMA posterior inclusion probabilities

As mentioned in sec. 3.4.2, candidate models are - in this work - not constructed onthe basis of individual variables, but by variable groups, such that we do not considerall the possible combinations of variables, but rather all the possible combinations ofvariable groups: 26 = 64. A DMA model was set up with the parameter-settings thatwere found to produce the best results in sec. 4.2.3.3. In Figure 4.8 and Figure 4.9,

50

4.3 Online drop analysis

posterior inclusion probabilities for these variable groups over the considered periodare displayed.A first broad observation is that the inclusion probabilities are quite volatile forall of the variable groups. This is a byproduct of setting a low α value, sincethis parameter controls how rapid the dynamic updating of the model probabilitiesshould be. It may further be noted that neither of the individual candidate modelsassume a posterior model probability higher than 0.5 for any notable length of time.Neither of the variable groups assumes a posterior inclusion probability of 1 or 0 forthe whole period. The two variable groups with the overall lowest inclusion prob-ability are Variable Group 1 (Trace 4 Procedures) and Variable Group 4 (Uercid).The two variable groups with the highest overall inclusion probability are VariableGroup 5 (Radiolink) and Variable Group 2 (GCP).A more specific and possibly more interesting observation that can be made is inregards to Variable Group 6 (Cell IDs). In Figure 4.9 one can observe that for 2-3periods (one around 4500, one around 9000, and one around 11500) the inclusionprobabilities remain close to 1, in a way that is not observed for the rest of thetime-series.

51

Chapter 4 Results

Time

Incl

usio

n pr

obab

ility

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Time

Incl

usio

n pr

obab

ility

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Time

Incl

usio

n pr

obab

ility

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.8.: Posterior inclusion probabilities: Variable Group 1 = Trace 4 Proce-dures || Variable Group 2: GCP || Variable Group 3: evID

52


Time

Incl

usio

n pr

obab

ility

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Time

Incl

usio

n pr

obab

ility

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Time

Incl

usio

n pr

obab

ility

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.9.: Posterior inclusion probabilities: Variable Group 4: UeRcid || VariableGroup 5: Radiolink || Variable Group 6: Cell ID

53

Chapter 4 Results

4.3.2. Evolution of odds-ratios and reduction in entropy

In this subsection, the temporal significance of covariates is analyzed by consideringthe evolution of the odds-ratios as well as the posterior model probabilities from thedynamic logistic regression and the univariate scanner. The reduction in entropyfrom the DT model is also considered as a way to confirm the main results.To explore all of the covariates individually is out of the scope of this thesis. Assuch, we limit the analysis to one interesting period, or more specifically, a periodof abnormal call-drop ratio. It is worth emphasizing that whilst we here first selecta period, and then secondly evaluate which variables were important, this orderof events is not required. That is, one could just as well have assumed that noknowledge of the actual call-drop ratio were possessed, and instead monitor theactual evolution of the output from the considered models.Before considering the “interesting period”, we shall first compare the degree ofinsight from the (full) single dynamic logistic regression to that of the univariateDMA (also referred to as the “univariate scanner”), as to determine which to usefor analyzing the “interesting period”.

4.3.2.1. Single Dynamic Logistic Regression vs. Univariate DMA

As previously described, in the single dynamic logistic regression, all of the covari-ates are included in the same model, whilst in the univariate DMA (the “univariatescanner”) there are as many candidate models as there are covariates, one covari-ate in each. An important difference between these two approaches concerns theforgetting factor: in the full model, we set a common forgetting factor for all ofthe covariates, whilst in the univariate case, we allow for each covariate to have itsown forgetting factor. A priori this reasonably suggests that the latter allows fora more precise recursive estimation for each of the covariates. To evaluate whetherthis is the case, we compare the recursive estimates from the single dynamic logisticregression that were found to obtain the best results in sec. 4.2 (λt = 0.80), to a uni-variate DMA (with λt = 0.95). In Figure 4.10 and Figure 4.11, one such comparisonis displayed.Figure 4.10 and Figure 4.11 demonstrates a general finding, that the univariateDMA indeed is able to more precisely update the coefficient estimates. For in-stance, note that that the confidence band of Figure 4.11 is narrower compared toFigure 4.10, reflecting the ability of the former to not update unnecessarily. Thedegree of similarity between the recursive estimates of the two approaches is linkedto the posterior model probabilities in the univariate DMA, such that if the posteriormodel probability - for a particular covariate - is large at a particular time-point,the updating (at that time-point) of the single dynamic logistic regression is likelyto be similar to that of the univariate DMA. For instance, consider Figure A.3 andFigure A.4 in the Appendix. This is reasonable, since what determines whether

54


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

4

Figure 4.10.: Log-odds from the single dynamic logistic regression: “GCP000011000000001100001000”

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

4

Figure 4.11.: Log-odds from the “univariate scanner”: “GCP000011000000001100001000”

forgetting should be applied or not at a particular time point is the predictive like-lihood, and hence, if there is a covariate that is predominant at this time-point,it will also have a great impact on the predictive likelihood in the full model. Theincreased precision of the univariate DMA comes at the cost of computational speed.Given the analysis of the previous paragraph, the univariate DMA is used as thetool for obtaining log-odds and odds-ratios for the remainder of this section.

4.3.2.2. Interesting period

The period of consideration is that which occurs between time-point 4200-6400,corresponding to 14-23 in Figure 4.7 (e.g. the second period of high drop rate).In Table 4.15, the covariates that were found to attain a significant positive effectduring this period are listed. Time-series plots of the recursive coefficient estimatesfor these covariates are found in sec.A.2.2.As one may observe from Table 4.15 (or from sec.A.2.2), some of the covariates are

55

Chapter 4 Results

Coefficient Time-periodcell_id220524 4200-6400X14 4200-5200X15 4200-6400X16 4200-5200X21 4200-6400X22 4200-5200X23 5200-6400GCP 000011000000001100001000 4200-5200GCP 000011000000011100011100 4200-5200GCP 000000000000011000011011 5200-6400GCP 000000000000001000011011 5200-6400PS 4200-6400radiolinksetupfailurefdd 5200-6400

Table 4.15.: Significant coefficients during period of consideration: 4200-6400

only relevant for a certain part of this period. More specifically, time-point ~5200appear to be a division-point. Hence, this period can be thought of as consistingof two sub-periods. If we take a look at Figure 4.7, this appears reasonable: attime-point 18 there is noticeable increase (and this corresponds to time-point ~5200in the figures displayed in sec. A.2.2).Sub-period 1

Considering the evolution of the odds-ratios for the first sub-period, there is onecovariate that present a particularly interesting behavior, and that is Cell_220524.Let us therefore consider it in a bit more detail. Recall that cells define the geo-graphical area in which a call is made. For most of the (full) period, e.g. 1-14200,this covariate is insignificant, with an odds-ratio hovering around 1, but for the par-ticular period under concern, the odds-ratio shoots up significantly, easily passingthe previously mentioned rule-of-thumb of > 3. See Figure 4.12.Exploring this covariate under this particular sub-period more closely, one findsthat 52.7% of the calls were made from Cell_220524, and that out of these, 98.5%were dropped. This can be compared to the prior period (1 – 4200), in which 5%of the calls were recorded for this cell, and out of these, only 15% were droppedcalls. Considering the posterior model probabilities for the (univariate) candidatemodel consisting of this covariate, one may observe from Figure 4.13 that DMAhas assigned 100% posterior model probability for this candidate model during thisperiod. Finally, if we consider the measure of reduction in entropy obtained fromthe dynamic trees, as is displayed in Figure 4.14, one can observe that this replicatesFigure 4.12 and Figure 4.13 quite well.Deriving association rules for the first sub-period (4200-5200), one obtains rulesthat confirm the significance of the covariates. One finding, for instance, is that

56


Time

Odd

s ra

tio

0 2000 4000 6000 8000 10000 12000 14000

050

100

150

Figure 4.12.: Odds-ratios: Cell 220524

Time

Pos

teri

or M

odel

Pro

babi

lity

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.13.: Posterior Model Probabilities: Cell 220524

∼ 20% of the calls during this period were made from phones with the particularGCP combination “000011000000011100011100”, and out of these, 99.58% wereterminated unnaturally. Another finding is in regards to the covariate PS : 52.3% ofthe calls originated from phones transmitting data, and out of these 86.7% dropped.Sub-period 2Considering the evolution of the odds-ratios for the second sub-period, there isagain one covariate that presents a particularly interesting behavior, in this caseit is radiolinksetupfailurefdd. For the greater part of the full-period, this covariatemostly registers 0-values, but under the particular sub-period of concern, it registersa lot of 1’s, indicating its presence in the calls. In Figure 4.15, the evolution of theodds-ratio for this covariate is presented. Furthermore, in Figure 4.16 the posteriormodel probabilities for the candidate model representing this covariate is displayed.From Figure 4.15, we can observe that the odds-ratio shoots up significantly around~5200. One may further notice that the estimate then stabilizes around ~70 forthe remainder of the period. This is because very few subsequent observationshaving this attribute are encountered, and hence the coefficient isn’t updated. FromFigure 4.16, it can be seen that DMA has assigned a posterior model probability

57

Chapter 4 Results

Time

Var

iabl

e Im

port

ance

0.0

00

0.0

10

0.0

20

0.0

30

0 2000 4000 6000 8000 10000 12000 14000

Figure 4.14.: Reduction in Entropy: Cell 220524

Time

Odd

s ra

tio

0 2000 4000 6000 8000 10000 12000 14000

010

2030

4050

6070

Figure 4.15.: Odds-ratios: Radiosetupfailurefdd

of 100% for the greater part of this sub-period. Furthermore, in Figure A.12 thereduction in entropy for this covariate is displayed, and it can be observed that itreplicates these two aforementioned plots quite well. Exploring this covariate moreclosely, one finds that 74.6% of calls during this sub-period had a radiolink-setup-failure (radiolinksetupfailurefdd=1), of which 98.8% dropped.Deriving association rules for the second sub-period (5200-6400), one again findsrules that confirm the significance of the covariates. During this sub-period, wefind that two particular GCP combinations, “000000000000011000011011” and“000000000000001000011011”, are relevant and correlate with dropped calls to agreat degree. They represent 18.1% and 27.8% respectively of the calls during thissub-period, and out of these 99.5% and 98.5% were dropped respectively.

4.3.3. Static Logistic Regression vs. Dynamic LogisticRegression

In sec. 4.2.5, the question of whether a dynamic approach is preferable to a staticone was evaluated in terms of predictive capability. In this section, this question is

58


Time

Pos

teri

or M

odel

Pro

babi

lity

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.16.: Posterior Model Probabilities: Radiosetupfailurefdd

considered in terms of the degree of insight about variable effects. In Table B.5 inthe Appendix, one finds the coefficient estimates obtained from the static logisticregression model fitted on 70% of the dataset, without accounting for the order-dependence. Below, a few examples where the results of the two approaches do notalign are presented.

Cell_id220517

As one may observe from Table B.5, using the static logistic regression, the coefficientfor this covariate has been estimated to −0.528371 (with a p-value of 0.022355),implying a significant negative effect. If we instead consider the recursive estimatesobtained from the dynamic logistic regression, displayed in Figure A.17, one canobserve that whilst this covariate presents a significant negative effect for some sub-periods, it also presents periods of significant positive effect.

Cell_id220518

From Table B.5, it can be seen that the coefficient has been estimated to −0.285032for this covariate (with the static logistic regression), however resulting in a p−valueof 0.17, and hence, by most standards, would be considered insignificant. If we takea look at the recursive estimates displayed in Figure A.18, one can see that thecoefficient is indeed insignificant for large sections of the period, but that for a fewsub-periods it attains (both positive and negative) estimates that are at least ±2standard errors away from 0.

X23

In Table B.5, one can observe a coefficient estimate of −0.506021 (with a p-valueof 0.000606), hence implying a significant negative effect. Considering the recursiveestimates from the dynamic logistic regression, displayed in Figure A.10, one mayindeed observe that for large parts of the period, the coefficient is estimated to anegative value, but that for a sub-period of approximately 1000 time-points, thecoefficient is estimated to a positive value that is significant (where it reaches anodds-ratio of ∼ 15).

59

Chapter 4 Results

4.3.3.1. Summary

For the three examples described above, the common theme is that through thestatic framework, temporal behaviors are not captured, and as a consequence, theresulting estimated effects could be misleading to interpret. For instance, to interpretthe effect of “X23” as “significantly negative over the considered period”, althoughmethodologically correct, may be misleading in practical applications. There aredozens of additional cases like those described above. These results may be seenas a part-explanation of why the dynamic logistic regression also proved to performstronger in terms of predictability.

60

5. Discussion

To our knowledge, this thesis work is the first to approach the problem of analyzingdrop causes in mobile networks by using online learning classification techniques.This approach was motivated by, on the one hand, the availability of class labels,and on the other, the assumption that data is non-stationary and what correlateswith a particular class may change over time. A natural approach would otherwisehave been to consider this problem as an anomaly detection problem, in which ab-normal periods could be identified by large increases in the number of dropped calls.By instead framing the problem as an online classification problem, one arguablyaddresses the core of the problem more directly, in that one do not necessarily haveto first detect a suspicious period in order to detect changes in drop causes.In its original format, the data consisted of very large .txt files, such that any directapplication of statistical or machine learning methods were not possible. Conse-quently, quite a lot of time was initially spent on processing the data using varioustext mining techniques. The collective size of the .txt files was enormous, suchthat pre-processing with a regular computer faced memory issues. The decision wasmade to limit the analysis to one STP, and further to apply sampling techniques inthe parsing step, such that only a limited amount of observations (calls) had to bepre-processed. Making the parsing-scripts more efficient and possibly running thescripts on a more powerful computer would increase one the one hand the possiblescope of the analysis, as well as the practical applicability of the proposed approach,and is left for future work.A lot of effort was invested trying to understand the data as well as exploring whatmethods had been used previously to analyze similar data. One characteristic ofthe data that was of particular focus initially was that every observation has asequential structure: every call has a beginning and an end, and in between thesetwo events, data are successively recorded - denoted by time-stamps. As such,an initial idea was to use sequential pattern-mining methods for the purpose ofdetecting new behavior in the data. However, after having explored simpler (static)classification techniques on unordered data from the logs, it became clear that thesequential structure may not be as important as initially expected (to discriminatebetween normal and dropped calls). As such, the problem was instead defined as aclassification problem, in which another sequential aspect was emphasized; that ofbetween logs, rather than within logs.On the basis of the characteristics of the data (high-dimensional, sequentially arriv-ing, and non-stationary) and the objective of the thesis (to explore discriminative

61

Chapter 5 Discussion

features), four criteria were set up as to determine the specific classification tech-niques to be used. From the literature-review that was performed, it was found thatthese criteria quite drastically narrowed the space of apt classifiers. In the end, (dy-namic) extensions of the logistic regression and partition trees were selected, bothsatisfying the four criteria relatively well.More specifically, considering the former first, two dynamic extensions of the logis-tic regression were considered, one being the (single-) dynamic logistic regression,and the other being a further extension of the former, also accounting for modeluncertainty through an extension of BMA (DMA). The initial expectation was thatthe latter would pose a stronger alternative compared to the former in terms of per-formance. It was however found that the standard approach of considering all thepossible variable-combinations was not computationally feasible for this data, dueto high-dimensionality, as well as a long time-series. An interesting extension of theDMA were proposed in Onorante and Raftery (2014) to address the issue of largemodel spaces, using the concept Occam’s window to reduce the number of possiblemodels considered at every time point. This approach was however ultimately dis-regarded as (i) it is suitable for shorter time series, and (ii) only occasionally tests toinclude candidate models from the larger space of models, and as such doesn’t alignso well with the concept of detecting change in drop causes. Instead, two alternativestrategies for constructing candidate models were considered. Whilst showing strongperformance, neither could outperform the single dynamic logistic regression, whichwere shown to perform excellent on this data: the best one resulting in an AUC of99.96% and a G−mean of 99.7%.The other (online learning) classification technique considered in this thesis was theDynamic Trees. A careful evaluation of the model parameters were performed asto derive the best DT. Just as with the dynamic logistic regression, it was foundthat a λ < 1 improved the results, implying that a local fit is preferable to a globalone. In terms of predictive capability, this technique did not perform as well as thedynamic logistic regression: the best DT was shown to obtain an AUC and G-meanof 93.41% and 86.35% respectively. Figure 4.5 displays the performance of the DTover the considered period, and as previously noted, several degradation’s in theperformance are present, implying that the tree-structure were not able to adaptas quickly as needed. Such degradation’s were not found for the dynamic logisticregression.The performance of the best dynamic logistic regression classifier were further com-pared to that of the standard (static) logistic regression in two experimental setupsin which (i) data were gradually observed, and (ii) all data were available. In bothcases, the dynamic logistic regression was shown to outperform the static logisticregression with comfortable margins: supporting the hypothesis of non-stationary,as well as motivating the dynamic extension. In addition, it may also be worthnoting that ANN and SVM - known for their ability to classify complex and high-dimensional data - also were applied to this data (the full dataset); it was foundthat neither could beat the performance of the dynamic logistic regression. These

62

Discussion

are promising results, as the covariates extracted from each call are of such a typethat extending the proposed approach to early prediction seems feasible. This couldbe an interesting approach to explore in the future.

In addition to the three main characteristics of the data listed above, one mayadd (iv) temporally sparse. This turned out to cause problems for the dynamiclogistic regression, where the original method failed to converge for about 1/4 ofthe covariates. To address this problem, this work presented a modification ofthe forgetting framework originally proposed by McCormick et al. (2012), which inaddition to considering the predictive likelihood also considers local sparsity. Thismodification was shown to allow the inclusion of multiple variables that could not beused with the original forgetting framework. An evaluation - in terms of predictivecapacity - was also performed, which showed that the modified framework achieveda slightly stronger performance. The basic idea of this modification that duringperiods of sparsity, lower degree of forgetting is applied, and hence the update ofthe parameters is based on a longer span of data, appears intuitive. However, whilstallowing more attributes to be included and a lower λ to be set, this modificationit did not completely solve the issue. It was found that under some circumstances,this framework still had problems with convergence; when λ were set very low. Assuch, when stability rather than maximum predictability is the objective, a moreconservative selection of λ is likely to be preferable. By scaling the forgetting factorcloser to 1 during periods of sparsity - for a particular covariate - one also runs therisk of potentially not capturing some interesting local behavior at such periods.The modification further introduced an additional parameter, defining the widthof the window in which sparsity should be considered - and everything else equal,more parameters is not to be preferred. Further refinement and evaluation of thismodification is needed, and is left for future work.

Regarding the part of the analysis concerned with temporal significance of covari-ates, which we termed online drop analysis, it was demonstrated that the selectedonline learning classification techniques were able to provide good insights into whatvariables are important at different time-points. Two levels of granularity were con-sidered; one centered around ’variable groups’, and the other focusing on (as in thestandard case) the ’actual variables’. In the former, posterior inclusion probabilitiesobtained from DMA were analyzed, and in the latter, the evolution of the log-odds orodds-ratios. The scope of insights from the former was quite limited. Concerning thelatter, an evaluation was performed as to determine whether the ’univariate scanner’is more suitable than the single dynamic logistic regression. It was found that the’univariate scanner’, through its variable-specific forgetting, could more preciselydescribe the evolution of the coefficient estimates. In the case of the ’univariatescanner’, in addition to evaluating the log-odds or odds-ratios, we also analyzedthe posterior (univariate) model probabilities. Finally, for the dynamic trees, thereduction in entropy was studied. Using these approaches, two sub-periods of ab-normal call-drop rates were analyzed, and several interesting findings were made.One, for instance, is that the geographical area in which calls were made from (the

63

Chapter 5 Discussion

cells) played a key role in the first sub-period. Furthermore, it was shown that theaforementioned approaches to analyzing temporal significance resulted in similarconclusions. Reflecting briefly on the effect of the forgetting factor in the contextof online drop analysis; it may again be underscored that a lower λ has the effectgiving greater weight to more recently observed data points, and hence producingmore volatile updates of the odds ratios. This increases the degree to which one isable to detect local temporal significance of covariates. However, to ensure stabilityof the system and convergence of the algorithm, λ values below 0.9 were not con-sidered for the univariate scanner. To further refine the modified forgetting factor,and hence potentially allowing for greater granularity in terms of identifying localbehavior, is left for future work. Since the objective of this thesis was to developand demonstrate a framework rather than extracting specific covariates, collabora-tion with domain-experts could improve the selection of variables to extract, as tobetter serve the specific objectives of the troubleshooting team. Another extensionof this approach could be to automate the “detection-step”, such that alerts wouldbe triggered if a covariate reaches a certain odds ratio for instance.In addition to reducing the computational burden of pre-processing, sampling tech-niques were also used for the purpose of addressing the issues of class-imbalance.Sampling techniques generally have the positive effect of helping classifiers so thatthey are able to learn the minority class better. This was shown to be the case inthis thesis as well. Specifically, two sampling techniques were evaluated: (i) onlinerandom undersampling, and (ii) adaptive-online random undersampling, which weredeveloped in this thesis. Both were shown to be able to increase the capability of theclassifiers to correctly identify minority instances. This improvement does howevercome at the cost of potential information loss. By undersampling to such a great ex-tent as done in this thesis, one does run the risk of missing out on potentially usefulinformation. To evaluate to which degree this was problem, sensitivity analysis wereperformed as to ensure that different sampling rates resulted in approximately thesame conclusions. This analysis did not reveal any noteworthy issues. Additionalevaluation of this aspect would nonetheless be rewarding and is left for future work.In terms of implementation, online learning has the caveat of not requiring thestorage of any data, which may come in handy seeing it as the size of the data thatEricsson accumulates every day is enormous. The DMA approach can further beparallelized, since each candidate model is updated independently of each other, asto speed up the computation speed considerably.

64

6. Conclusions

This work presents an, to our knowledge, new approach for analyzing dropped callsin mobile networks. Compared to the static state-of-art approaches, the developedframework enables the detection of changes in drop causes without first determininga suspicious period, and secondly, does not require any storage of data.

To address the issue of class imbalance, this thesis applied an online adaptationof the random undersampling technique, as well as an extension developed in thisthesis. Whilst the developed technique did not succeed to improve the results of theonline random undersampler, both techniques were shown to significantly improvethe degree of discrimination of dropped calls compared to the original data.

Two online learning classification techniques, dynamic logistic regression and dy-namic trees, were explored in this thesis. The former were shown to have consid-erable problems with temporally sparse covariates. To remedy this problem, thiswork proposed a modification to the forgetting framework originally developed byMcCormick et al. (2012). The modification was shown to both allow for the inclu-sion of sparse covariates, as well as to improve the overall classification capability.Having carefully evaluated the parameters for both of the models, the best dynamiclogistic regression model were shown to achieve excellent results, with an AUC of99.96% and a G−mean of 99.7%, whilst the best dynamic trees model achieved anAUC of 93.4% and G − mean of 86.4%. That is, the dynamic logistic regressionwas shown to achieve a considerably stronger performance compared to the dynamictrees. To evaluate the choice of online learning, a comparison was also made to thestatic logistic regression, which was found to achieve an AUC of between 80% and92%, depending on the amount of data fed to the training set; providing strongsupport for the online learning approach.

In addition to showing that the online learning approach was able to predict mobilephone call data with great precision, this thesis also shows that the selected onlinelearning techniques are able to provide useful insights regarding temporally impor-tant variables for discriminating dropped calls from normal calls. A comparisonto static models was also considered in terms of variable importance insights. Itwas found that considering the dataset as “a whole” rather than sequentially, led tomisleading effect interpretations, due to the temporal nature of the system: furthersupporting the online learning approach.

Whilst showing a lot of potential, the proposed approach needs to undergo furtherrefinement and testing, to ensure stability and confirm its practical use. There are

65

Chapter 6 Conclusions

several dimensions by which this work could be extended. A natural next step wouldbe to consult domain experts and more carefully select which variables that oughtto be monitored. Another possibility is to evaluate whether it is possible to extendthe framework to address the task of early classification of dropped calls, as in Zhouet al. (2013).

66

A. Figures

A.1. Results: Online classification

Time

varprop

0 2000 4000 6000 8000 10000 12000 14000

0.0

0.2

0.4

0.6

0.8

1.0

Figure A.1.: Proportion on the N particles that includes the covariate“Cell_id220524”

Time

Var

iabl

e Im

port

ance

0.0

00

0.0

10

0.0

20

0.0

30

0 2000 4000 6000 8000 10000 12000 14000

Figure A.2.: Reduction in entropy for covariate “Cell_id220524” from DynamicTrees

67

Chapter A Figures

A.2. Results: Online drop analysis

A.2.1. Single dynamic logistic regression vs. Univariate DMA

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

46

Figure A.3.: Log-odds from the single dynamic logistic regression:“Cell_id220524”

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

46

Figure A.4.: Log-odds from the “univariate scanner”: “Cell_id220524”

68

A.2 Results: Online drop analysis

A.2.2. Significant covariates in interesting period

A.2.2.1. Log odds from the Dynamic Logistic Regression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−4

−2

02

46

Figure A.5.: Recursively estimated coefficient for “X14” from the dynamic logisticregression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

−1

01

23


69

Chapter A Figures

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−6

−4

−2

02

46


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

−1

01

23


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

4


70


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−3

−2

−1

01

23

4

Figure A.10.: Recursively estimated coefficient for “X23” from the dynamic logis-tic regression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

4

Figure A.11.: Recursively estimated coefficient for “GCP000011000000001100001000” from the dynamic logistic regression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−10

−5

05

10

Figure A.12.: Recursively estimated coefficient for “GCP000011000000011100011100” from the dynamic logistic regression

71

Chapter A Figures

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

46

Figure A.13.: Recursively estimated coefficient for “Cell_id220524” from the dy-namic logistic regression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

01

23

45

Figure A.14.: Recursively estimated coefficient for “radiolinkfailurefdd” from thedynamic logistic regression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

−1

01

23

Figure A.15.: Recursively estimated coefficient for “PS” from the dynamic logisticregression

72


A.2.2.2. Reduction in entropy from the Dynamic Trees

Time

Var

iabl

e Im

port

ance

0 2000 4000 6000 8000 10000 12000 14000

0.00

00.

010

0.02

00.

030

Figure A.16.: Reduction in entropy for “radiolinkfailurefdd” from the dynamictrees model

73

Chapter A Figures

A.2.3. Static vs. Dynamic Logistic Regression: covariate effects

A.2.3.1. Dynamic Logistic Regression

Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

−1

01

2


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

−1

01

2


74


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−1

01

2


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

02

46


Time

Log

odds

0 2000 4000 6000 8000 10000 12000 14000

−2

−1

01

23


75

B. Tables

B.1. Results: Online classification

B.1.1. Dynamic Trees

Alpha Beta TPR TNR AUC G-mean0.99 2 0.820 0.855 0.908 0.8450.90 2 0.814 0.852 0.906 0.8410.80 2 0.810 0.850 0.904 0.8380.70 2 0.815 0.842 0.904 0.839

Table B.1.: Evaluation of tree prior alpha for Dynamic Trees

Beta Alpha TPR TNR AUC G-mean1.75 0.99 0.811 0.854 0.908 0.8402.00 0.99 0.820 0.855 0.908 0.8452.25 0.99 0.815 0.855 0.905 0.8432.50 0.99 0.811 0.848 0.901 0.836

Table B.2.: Evaluation of tree prior beta for Dynamic Trees

w user.self sys.self elapsed50 342.300 0.150 342.420100 388.100 0.300 388.330250 520.190 0.190 520.370500 768.580 0.110 768.660

1000 1057.950 0.170 1058.1302000 1796.790 5.360 1802.1104000 2489.360 27.440 2516.830

Table B.3.: Evaluation of the effect of active pool size (w) on computational timefor Dynamic Trees

77

Chapter B Tables

B.1.2. Dynamic Logistic Regression

CovariateX10X12imsi_factor1imsi_factor2imsi_factor3imsi_factor4imsi_factor5imsi_factor6imsi_factor7X_uraupdate.origlast_tGCP000011000000001100001000last_tGCP000011100000000000000011last_tGCP000000100000000000000011last_tGCP000000100000000000000000last_tGCP000011000000011100011100last_tGCP000000000000001000011011last_tGCP000000000000001000011000last_tGCP000011000001011100011100last_tGCP000011000000001100011100cell_id220412cell_id220511cell_id220521cell_id220524cell_id220526RSCP_avg.1X_interrathandoverinfo.origt_proc22radiobearerreleaselocationreportsecuritymodereject

Table B.4.: Covariates for which the original forgetting framework failed toconverge

Coefficient Estimate Standard.Error z.value p.valueIntercept -2.59 0.31 -8.40 < 2e-16X6 0.16 0.23 0.72 0.473546X10 0.47 0.35 1.33 0.183486X12 1.13 0.23 4.88 1.08e-06X13 -0.82 0.19 -4.37 1.23e-05

78

B.1 Results: Online classification

X14 -0.03 0.14 -0.22 0.822596X15 1.41 0.18 7.95 1.86e-15X16 -0.96 0.17 -5.76 8.42e-09X19 -1.37 0.48 -2.86 0.004237X22 1.09 0.14 7.55 4.39e-14X23 -0.51 0.15 -3.43 0.000606imsi_factor1 -0.19 0.20 -0.95 0.343919imsi_factor2 -0.74 0.21 -3.46 0.000545imsi_factor3 -1.19 0.22 -5.36 8.20e-08imsi_factor4 -0.73 0.22 -3.30 0.000969imsi_factor5 -0.47 0.23 -2.01 0.044828imsi_factor6 -0.76 0.22 -3.42 0.000618imsi_factor7 -1.09 0.23 -4.81 1.54e-06imsi_factor8 0.01 0.23 0.05 0.962124imsi_factor9 -0.06 0.23 -0.25 0.801791imsi_factor10 -0.84 0.26 -3.26 0.001127imsi_factor11 1.52 0.22 6.80 1.04e-11imsi_factor12 0.55 0.21 2.60 0.009346imsi_factor13 0.52 0.22 2.32 0.020193imsi_factor14 0.69 0.22 3.11 0.001844imsi_factor15 0.64 0.24 2.67 0.007568X_cellupdate.orig -0.69 0.09 -7.35 1.99e-13X_uraupdate.orig -2.39 0.34 -6.95 3.61e-12activate 0.22 0.15 1.43 0.152776activesetupdatecomplete -0.37 0.15 -2.44 0.014902X_physicalchannel.orig 0.85 0.11 7.42 1.13e-13X_compressed.orig -0.02 0.13 -0.19 0.852600X_dlpower.orig 0.15 0.10 1.39 0.165634GCP_000011000000000000000000 1.21 0.23 5.24 1.60e-07GCP_000011100000000000000000 -1.02 0.28 -3.60 0.000314GCP_000011000000001100001000 1.57 0.36 4.39 1.11e-05GCP_000011000000001000011000 0.52 0.26 1.97 0.049023GCP_000011100000000000000011 -0.86 0.31 -2.80 0.005092GCP_000000100000000000000011 -2.19 0.37 -5.89 3.86e-09GCP_000011000000011000011011 0.88 0.30 2.95 0.003212GCP_000011000000011000011100 -0.64 0.33 -1.94 0.052570GCP_000000100000000000000000 -1.88 0.37 -5.13 2.92e-07GCP_000011000000011100011100 2.37 0.29 8.04 9.15e-16GCP_000000000000001000011011 0.73 0.27 2.77 0.005695GCP_000000000000001000011000 -0.57 0.30 -1.90 0.058078GCP_000011000000011000011000 0.60 0.32 1.84 0.065270GCP_000011000001011100011100 2.18 0.33 6.58 4.64e-11GCP_000011000000001000011011 0.82 0.33 2.46 0.013758GCP_000011000000001100011100 1.12 0.37 3.00 0.002713

79

Chapter B Tables

GCP_000000000000011000011011 0.61 0.60 1.02 0.308164GCP_000000000000001000000011 -0.08 0.52 -0.16 0.874456cell_id220412 0.09 0.22 0.43 0.665636cell_id220413 0.21 0.18 1.13 0.257968cell_id220414 1.06 0.17 6.23 4.64e-10cell_id220415 1.85 0.18 10.55 < 2e-16cell_id220416 0.49 0.32 1.55 0.121757cell_id220511 -0.03 0.21 -0.12 0.901457cell_id220512 0.74 0.16 4.52 6.10e-06cell_id220513 0.20 0.24 0.85 0.396193cell_id220514 0.25 0.32 0.80 0.425370cell_id220517 -0.53 0.23 -2.28 0.022355cell_id220518 -0.29 0.21 -1.37 0.171236cell_id220519 0.26 0.24 1.07 0.282202cell_id220521 1.36 0.16 8.55 < 2e-16cell_id220523 1.01 0.18 5.66 1.52e-08cell_id220524 1.81 0.16 11.03 < 2e-16cell_id220526 1.11 0.19 5.79 6.89e-09cell_id250511 2.14 0.37 5.78 7.41e-09RSCP_avg.1 1.13 0.58 1.95 0.051235RSCP_avg.2 -0.76 0.16 -4.71 2.49e-06X_interrathandoverinfo.orig -0.32 0.25 -1.24 0.214177X_tx.orig 0.02 0.10 0.21 0.833289cpichAvg.1 -0.11 0.13 -0.82 0.414159cpichAvg.2 0.03 0.15 0.20 0.845180t_proc1 -0.27 0.18 -1.52 0.128285t_proc14 -0.22 0.13 -1.75 0.079899t_proc15 -0.72 0.33 -2.21 0.026835t_proc16 -0.17 0.24 -0.69 0.488020t_proc18 -0.29 0.28 -1.04 0.298706t_proc2 0.42 0.36 1.16 0.246736t_proc21 -0.46 0.25 -1.85 0.064493t_proc22 0.42 0.52 0.80 0.424858t_proc23 -0.35 0.32 -1.09 0.277684t_proc29 -0.08 0.22 -0.34 0.730663t_proc3 0.13 0.16 0.77 0.442721t_proc32 0.05 0.19 0.28 0.778440t_proc33 0.20 0.23 0.84 0.398736t_proc34 -0.02 0.47 -0.04 0.967561t_proc37 -0.24 0.21 -1.16 0.246237t_proc10 0.02 0.09 0.17 0.868475t_proc11 -0.19 0.09 -2.08 0.037290t_proc12 0.45 0.14 3.16 0.001571t_proc13 -0.03 0.09 -0.32 0.752861

80

B.1 Results: Online classification

t_proc25 0.34 0.10 3.23 0.001231t_proc28 0.35 0.12 2.86 0.004256t_proc31 -0.05 0.10 -0.45 0.653183t_proc4 -0.29 0.14 -2.04 0.041486t_proc6 0.08 0.18 0.45 0.653796t_proc9 0.10 0.09 1.11 0.265443sirAvg.1 -0.00 0.41 -0.01 0.994379sirAvg.2 0.21 0.10 2.18 0.029493radiolinksetupfailurefdd 6.92 0.44 15.55 < 2e-16radiolinkadditionrequestfdd -0.57 0.10 -5.60 2.15e-08radiolinkfailureindication 2.14 0.09 22.90 < 2e-16radiobearerrelease -1.22 0.23 -5.34 9.52e-08radiobearerreconfiguration -0.72 0.11 -6.46 1.05e-10Interact 0.43 0.16 2.70 0.006918SRB -0.42 0.19 -2.21 0.026755Other 2.93 0.27 10.80 < 2e-16e4a -0.24 0.24 -0.98 0.328102evid_not_measured 1.12 0.20 5.73 9.83e-09e1d 0.92 0.15 6.21 5.32e-10e2d 0.00 0.12 0.00 0.998481e2f -1.14 0.14 -7.91 2.56e-15locationreport 0.29 0.19 1.56 0.118088location -0.69 0.14 -5.05 4.41e-07locationreportingcontrol 0.57 0.17 3.38 0.000720X_rab.orig 1.59 0.22 7.35 2.05e-13securitymodereject 5.74 0.48 11.94 < 2e-16securitymodecommand -1.47 0.21 -7.05 1.76e-12Table B.5.: Coefficient estimates for a Static Logistic Regression model trained onthe full data set

81

Bibliography

Agrawal, R., Imieliński, T., and Swami, A. (1993). Mining association rules betweensets of items in large databases. In ACM SIGMOD Record, volume 22, pages 207–216. ACM.

Anagnostopoulos, C. and Gramacy, R. B. (2012). Dynamic trees for streaming andmassive data contexts. arXiv preprint arXiv:1201.5568.

Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Annalsof Statistics, pages 870–897.

Brauckhoff, D., Dimitropoulos, X., Wagner, A., and Salamatian, K. (2012).Anomaly extraction in backbone networks using association rules. IEEE/ACMTransactions on Networking (TON), 20(6):1788–1799.

Breaugh, J. A. (2003). Effect size estimation: Factors to consider and mistakes toavoid. Journal of Management, 29(1):79–97.

Cheung, B., Kumar, G., and Rao, S. A. (2005). Statistical algorithms in fault de-tection and prediction: Toward a healthier network. Bell Labs Technical Journal,9(4):171–185.

Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian cart modelsearch. Journal of the American Statistical Association, 93(443):935–948.

Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the RoyalStatistical Society. Series B (Methodological), pages 215–242.

Ericsson (2014). Ericsson Mobility Report 2014 kernel description. http://www.ericsson.com/res/docs/2014/ericsson-mobility-report-june-2014.pdf.Accessed: 2015-05-27.

Gramacy, R. B., Taddy, M., Wild, S. M., et al. (2013). Variable selection andsensitivity analysis using dynamic trees, with an application to computer codeperformance tuning. The Annals of Applied Statistics, 7(1):51–80.

Haddock, C. K., Rindskopf, D., and Shadish, W. R. (1998). Using odds ratiosas effect sizes for meta-analysis of dichotomous data: a primer on methods andissues. Psychological Methods, 3(3):339.

Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under areceiver operating characteristic (roc) curve. Radiology, 143(1):29–36.

He, H. and Garcia, E. A. (2009). Learning from imbalanced data. Knowledge andData Engineering, IEEE Transactions on, 21(9):1263–1284.

83

http://www.ericsson.com/res/docs/2014/ericsson-mobility-report-june-2014.pdf

http://www.ericsson.com/res/docs/2014/ericsson-mobility-report-june-2014.pdf

Chapter B Bibliography

Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesianmodel averaging: a tutorial. Statistical science, pages 382–401.

Japkowicz, N. et al. (2000). Learning from imbalanced data sets: a comparisonof various strategies. In AAAI workshop on learning from imbalanced data sets,volume 68, pages 10–15. Menlo Park, CA.

Khanafer, R., Moltsen, L., Dubreil, H., Altman, Z., and Barco, R. (2006). A bayesianapproach for automated troubleshooting for umts networks. In Personal, Indoorand Mobile Radio Communications, 2006 IEEE 17th International Symposiumon, pages 1–5. IEEE.

Koop, G. and Korobilis, D. (2012). Forecasting inflation using dynamic model av-eraging*. International Economic Review, 53(3):867–886.

Kurgan, L. A. and Cios, K. J. (2004). Caim discretization algorithm. Knowledgeand Data Engineering, IEEE Transactions on, 16(2):145–153.

Lewis, S. M. and Raftery, A. E. (1997). Estimating bayes factors via posterior sim-ulation with the laplace metropolis estimator. Journal of the American StatisticalAssociation, 92(438):648–655.

McCormick, T. H., Raftery, A. E., Madigan, D., and Burd, R. S. (2012). Dynamiclogistic regression and dynamic model averaging for binary classification. Biomet-rics, 68(1):23–30.

Nguyen, H. M., Cooper, E. W., and Kamei, K. (2011). Online learning from imbal-anced data streams. In Soft Computing and Pattern Recognition (SoCPaR), 2011International Conference of, pages 347–352. IEEE.

Obuchowski, N. A. (2003). Receiver operating characteristic curves and their use inradiology 1. Radiology, 229(1):3–8.

Onorante, L. and Raftery, A. E. (2014). Dynamic model averaging in large modelspaces.

Penny, W. D. and Roberts, S. J. (1999). Dynamic logistic regression. In NeuralNetworks, 1999. IJCNN’99. International Joint Conference on, volume 3, pages1562–1567. IEEE.

Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc,informedness, markedness and correlation.

Raftery, A. E., Kárny, M., and Ettler, P. (2010). Online prediction under modeluncertainty via dynamic model averaging: Application to a cold rolling mill. Tech-nometrics, 52(1):52–66.

Rao, S. (2006). Operational fault detection in cellular wireless base-stations. Networkand Service Management, IEEE Transactions on, 3(2):1–11.

Smith, J. (1992). A comparison of the characteristics of some bayesian forecast-ing models. International Statistical Review/Revue Internationale de Statistique,pages 75–87.

84

Bibliography

Taddy, M. A., Gramacy, R. B., and Polson, N. G. (2011). Dynamic trees for learningand design. Journal of the American Statistical Association, 106(493).

Theera-Ampornpunt, N., Bagchi, S., Joshi, K. R., and Panta, R. K. (2013). Usingbig data for more dependability: a cellular network tale. In Proceedings of the 9thWorkshop on Hot Topics in Dependable Systems, page 2. ACM.

Wang, S., Minku, L. L., and Yao, X. (2013). A learning framework for onlineclass imbalance learning. In Computational Intelligence and Ensemble Learning(CIEL), 2013 IEEE Symposium on, pages 36–45. IEEE.

Watanabe, Y., Matsunaga, Y., Kobayashi, K., Tonouchi, T., Igakura, T., Nakadai,S., and Kamachi, K. (2008). Utran o&m support system with statistical faultidentification and customizable rule sets. In Network Operations and ManagementSymposium, 2008. NOMS 2008. IEEE, pages 560–573. IEEE.

Zhou, S., Yang, J., Xu, D., Li, G., Jin, Y., Ge, Z., Kosseifi, M. B., Doverspike, R.,Chen, Y., and Ying, L. (2013). Proactive call drop avoidance in umts networks.In INFOCOM, 2013 Proceedings IEEE, pages 425–429. IEEE.

85

LIU-IDA/STAT-A–15/007–SE

dynamic call drop analysis

Documents