consumer segmentation and knowledge … segmentation and knowledge extraction from smart meter and...

Consumer Segmentation and Knowledge Extraction fromSmart Meter and Survey Data ∗

Tri Kurniawan Wijaya †‡ Tanuja Ganu § Dipanjan Chakraborty§ Karl Aberer†

Deva P. Seetharam§

AbstractMany electricity suppliers around the world are deploying smartmeters to gather fine-grained spatiotemporal consumption data andto effectively manage the collective demand of their consumer base.In this paper, we introduce a structured framework and a discrim-inative index that can be used to segment the consumption dataalong multiple contextual dimensions such as locations, commu-nities, seasons, weather patterns, holidays, etc. The generated seg-ments can enable various higher-level applications such as usage-specific tariff structures, theft detection, consumer-specific demandresponse programs, etc. Our framework is also able to track con-sumers’ behavioral changes, evaluate different temporal aggrega-tions, and identify main characteristics which define a cluster.

1 IntroductionMany electricity suppliers around the world are deployingsmart meters to gather fine-grained spatiotemporal consump-tion data [23]. These companies are interested in mining thecollected data to extract deep insights such as the set of con-sumers to be selected for winter peak load reduction, the setof consumers to be monitored for potential theft/anomaly, theset of consumers who can be targeted for energy efficiencyprograms, etc [4, 16, 17]. These insights are necessary formultiple application sub-domains in the energy sector suchas billing, energy audit, etc. For all such advanced applica-tions, consumer segmentation has been viewed as one of keyrequirements [5, 10, 13, 15, 19].

However, segmenting consumers based on the smartmeter data is challenging due to three reasons. First, thescale of smart meter data is humongous: high volume(data from millions of consumers) and high velocity (meterscan report data at the rate of once every minutes to onceevery 30 minutes). Second, since electricity consumptionis influenced by internal (family size, work hours, economicstatus, etc) and external contextual factors (weather, holiday,day of week, etc), the meter data must be correlated withheterogenous data sources (weather sites, survey resultsetc) that provide data at different time granularity and withvarying data quality. Third, for meaningful grouping of

∗Supported by European Union’s Seventh Framework Programme(FP7/2007-2013) 288322, WATTALYST.†School of Computer and Communication Sciences, EPFL, Switzerland.

{tri-kurniawan.wijaya,karl.aberer}@epfl.ch‡The work is done during the author’s internship at IBM Research, India.§IBM Research, India. {tanuja.ganu,cdipanjan,dseetharam}@in.ibm.com

consumers, the segmentation may need to be performedalong disparate contextual dimensions.

To address these challenges, we propose a novel frame-work for consumer segmentation. The key contributions ofthis paper are:• Design and implementation of a versatile framework

for consumer segmentation that tries to jointly derive‘meaning’ from consumption data, context data anduser surveys. Previous works have primarily targeted aspecific problem (e.g. setting tariff [11, 21], predictingconsumer characteristics [3]) and do not consider thistask in a holistic manner.

• Design of a temporal aggregation method that varies thelevel of aggregation based on application requirementsand data quality.

• Design of a novel clustering consistency index to trackthe evolution of consumption behaviors (that helps spot-ting fraudulent activities such as thefts and tampering).

• Design of a novel discriminative index and survey min-ing approach to identify main consumer characteris-tics that can be used to classify consumers into well-demarcated clusters.The rest of the paper is organized as follows. In

Section 2, we give a brief overview of the related work.In Section 3, we explain our general purpose consumersegmentation framework. In Section 4, we discuss ourclustering consistency index. In Section 5, we outline ourmethod to extract knowledge from survey data to obtain maincharacteristics of a cluster. In Section 6, we describe thedataset and experimental results, and conclude in Section 7.

2 Related WorkSince the rollout of first batch of smart meters, smart me-ter data analytics has attracted immense interest. In par-ticular, consumer segmentation has been considered to becrucial for enabling various smart grid applications. Mosset al. provided an overview of consumer segmentation inthe electricity sector, especially for demand-side manage-ment [15]. Application-specific segmentations frameworkshave been developed in [5, 6, 7, 10, 11, 19, 21], mainly forsetting tariff or consumer classification, and predicting con-sumer characteristics [3].

In their work, Albert and Rajagopal also correlated con-sumers’ consumption profile with demographics and appli-ance usages [3]. However, they did not attempt to iden-

tify relevant characteristics for clustering consumers. In-stead, they started with a predefined set of characteristicsand determined whether those can be predicted from con-sumer’s consumption profile. This approach is rather similarwith [12] where the authors used the same dataset as oursto predict household demographics from consumption pro-files. Regarding consumer behavior analysis, [18] presenteda psychographic consumer segmentation based on how con-sumers feel, think, and act. However, their segmentation isbased solely on survey data about consumers’ behavior andattitude toward electricity and energy conservation, and didnot involve processing of any consumption data.

Moreover, our clustering consistency index is relatedto Rand index [20]. However, we generalize it further todetermine whether an individual is likely to change clusterover time, i.e., individual to cluster consistency measure.

Though, the earlier works have taken initial steps inderiving consumer segmentation based on smart meter data,they primarily target specific challenges/applications and donot present a ‘general-purpose’ framework. Moreover, noneof the prior works have incorporated additional context-specific data for consumer segmentation. More importantly,as we will explain in the next section, different featuresand algorithms that are employed in the prior work, can beexpressed as part of the building blocks in our framework.

3 Context-Specific Consumer SegmentationIn this work, we developed a context-specific ‘general pur-pose’ consumer segmentation framework which exhibits 5design principles addressing bottom-up data federation chal-lenges and top-down unifying solution requirements dis-cussed earlier. This is a user-centric design which provides acertain freedom to the framework users to respecify the data,context, and feature spaces she is interested in based on thespecific consumer segmentation task at hand. The frame-work enables the user to accomplish a number of consumersegmentation tasks such as segmenting the consumers basedon the consumption magnitude, variability, or trend etc.3.1 Design Principles We define the requirement specifi-cation based on 5 design principles which can be used indi-vidually or in combination by the framework user:R1 (Customized Data Selection): to declare a) a period

she is interested in, such as from June to August 2013b) a subset of consumers that satisfy certain criteria(like average daily consumption greater than 5 kWh)c) a specific time of day she is interested in, such asafternoon peak hours 12:00 to 16:00

R2 (Customized Temporal Aggregation): to declare thetime granularity of consumers’ consumption profileused for segmentation, such as hourly, every 3 hours,daily, weekly, or monthly

R3 (Customized Context): to declare specific context suchas summer, winter, weekend, January, Tuesday, temper-ature more than a certain threshold, etc.

R4 (Customized Features): to declare specific feature com-putation such as mean, standard deviation, coefficientof variation or median.

Unsupervised Learning Configuration Selection

Context Sensors: weather, temperature, …

holidays, seasons, special events, …

Target Sensors: smart meters, …

Data Selection

Temporal Aggregation hourly, daily, weekly, monthly, …

Context Filtering

…

Features Generator statistical functions: mean, median, standard deviation, IQR, …

Figure 1: Architecture diagram of the framework.

R5 (Customized Algorithm) a) if the user has knowledgeabout a specific clustering algorithm to be used (froma predefined set) b) if supported by the algorithm, theuser should be able to declare the number of consumersegments (or clusters) that she is interested in, or letthe framework determine the best number of clusters(according to some cluster evaluation metrics).

3.2 General-purpose Framework Now, we define ourframework, as shown in Figure 1. This framework is basedon the design principles discussed above and supports theoperations for data selection, temporal aggregation, contextfiltering, feature vector generation and clustering algorithmselection. Let D = {d1, . . . , dn, dn+1, . . . , dn+m} be a setof sensor devices available, where:• {d1, . . . , dn} is the set of sensors that is the main

subject of our analytics task, or target sensors, and• {dn+1, . . . , dn+m} is the set of additional context sen-

sors,In our case, smart meter is an example of a target sensor,whereas temperature, motion, or sound sensors are examplesof context sensors.

For a vector V, we write V(i) to address the i-thelement of V. Let γ?() represents application of a specificdesign principle ? (from R1 to R5) to the input set.DEFINITION 3.1. (MEASUREMENTS) We define ameasurement-tuple as s = (ts,Vs), where:• ts is a timestamp,• Vs ∈ Rn+m is a vector of sensor values, i.e., Vs(i) is

the value of di at time ts.A time series of measurement is defined as S ={s1, . . . , s|S|}, where s ∈ S is a measurement-tuple andwhenever i < j, we have tsi < tsj ,∀1 ≤ i, j ≤ |S|.

3.2.1 Data Selection Let X be the set of consumers,tsstart , tsend be the starting and the ending timestamp, andtd start , tdend be the starting and ending time of day that weare interested in, as the selection criteria of R1. In addition,let timeOfDay(t) denote the time of day of timestamp t, i.e.,the hour, minute, second, and millisecond of t.

Let Sx denotes the time series of measurements fromconsumer x’s premise. For a set of consumer X , we define

SX = {Sx | x ∈ X}. Let X+ ⊇ X . Then, data selectionover SX+ is γR1 (SX+ ,X , tsstart , tsend , td start , tdend) ={S′x | x ∈ X}, where S′x = {si | si ∈ Sx, tsstart ≤ ti ≤tsend , td start ≤ timeOfDay(ti) ≤ tdend}.3.2.2 Temporal Aggregation Let T = [T , T ] be a timeinterval, where T and T both are timestamps as the lowerand upper bounds of the interval, respectively. Let T ={T1, . . . , T|T |} be a set of time intervals denoting the tempo-ral aggregation that we are interested in R2. For a time seriesof measurements S, temporal aggregation by T over S is de-fined as γR2(S, T ) = {s1, . . . , s|T |}, where si = (Tsi ,Vsi)is an aggregated measurement, Tsi = Ti and

(3.1) Vsi =∑ts∈Ti

Vs,∀1 ≤ i ≤ |T |.

Note that in the Eq. (3.1) above we aggregate by summingup the sensor values. Depending on the application scenario,other aggregation function such as taking the average, maxi-mum, or minimum values can also be used.

Example. For monthly aggregation over one year data (fromJanuary to December), we have |T | = 12, where eachT ∈ T is a one month time interval. Thus, aggregation byT over any time series S results in |γR2(S, T )| = 12, whereeach element, i.e., an aggregated measurement, contains theaggregation of sensor values over one month.

3.2.3 Context Filtering We define two context types thatcan be defined by the user with respect to R3, namelycalendar context, and measured context. Calendar context isdefined on timestamps, such as summer, January, weekday,or weekend. Measured context is defined on sensor values,such as temperature between 30 and 35 degree, humiditybetween 50% and 60%.DEFINITION 3.2. (CALENDAR CONTEXT) We define a cal-endar context u as a function fu : t → {0, 1}, where t isa timestamp. We have fu(t) = 1, if t belongs to contextu, and 0 otherwise. Let U be a set of calendar contexts,a time interval T = [T , T ] satisfies U iff fu(T ) = 1 andfu(T ) = 1,∀u ∈ U .

Example. Let U = {summer ,weekend} be the set ofcalendar contexts that we are interested in. We have:• fsummer : t→ {0, 1}, which return 1 if t is in summer,

and 0 otherwise,• fweekend : t→ {0, 1}, which return 1 if t is on weekend

days, and 0 otherwise.That is, time intervals which satisfies U are intervals whoselower and upper bounds are both in summer and on weekenddays.

DEFINITION 3.3. (MEASURED CONTEXT) We definea measured context as a tuple w = (δw,Xw) whereδw ∈ {n+ 1, . . . , n+m} is a sensor index, dδw ∈ D is thecontext sensor, and Xw is the accepted interval of the valuesof context sensor dδw . A sensor values V ∈ Rn+m satisfiesa set of measured context W iff V(δw) ∈ Xw,∀w ∈W ,

DEFINITION 3.4. (CONTEXT FILTERING) Let S be a timeseries of aggregated measurements, U be a set of calendarcontexts, and W be a set of measured contexts. Contextfiltering over S by U and W is defined as γR3 (S, U,W ) =

{s | s ∈ S, Ts satisfies U, and Vs satisfies W}.

3.2.4 Feature Vector Generation In the context of energyconsumption sensing, or environmental sensing in general,measurements from different time of day can be very differ-ent and hence, it is considered as an important feature forvarious data mining task. We take that knowledge into ac-count by allowing the features to be built around differenttime intervals.

Let Γ be a set of time interval sets, where each Ti ∈ Γis a time interval set. Let φ : 2R → R be a feature builderfunction (or feature function) with respect to R4, which typ-ically is a statistical function, such as mean, median, stan-dard deviation or inter-quartile range. And, let S be time se-ries of aggregated measurements, where s = (Ts,Vs),∀s ∈S. A feature vectors computed from S using φ over Γ isγR4(S, φ,Γ) = F ∈ R|Γ|·n, where:

(3.2) F(i · n+ j) = φ({Vs(j) | Ts ∈ Ti, s ∈ S}

),

i = 0, . . . , |Γ| − 1, j = 1, . . . , n.

The (i · n + j)-th element of feature vector F is computedusing function φ over the set of aggregated sensor values Vs

that belong the same time interval set Ti, and target sensordj ∈ {d1, . . . dn}.

Example. We give an example of hourly features generation.Let us assume that for the data selection, the user is interestedin the data of year 2010. For the temporal aggregation, she isinterested in hourly temporal aggregation. And, to simplifyour example, let us assume that she is not interested in anycontext filtering, i.e., U = {} and W = {}. Furthermore,smart meter is our only target sensor, i.e., n = 1. Assumethat we have a time series of aggregated measurement, S, forthe whole year of 2010, where each s ∈ S is an (hourly)aggregated measurement for each hour in 2010. Let T =

{Ts | s ∈ S} be a set of all time intervals in S, thus we have|T | = 365 · 24 = 8760. Furthermore, let Γ = {T1, . . . , T24}be a set of time interval set, where each time interval setTi ⊂ T contains the time intervals which accounts only forhour i. Let φ be a function that calculates mean. Hence, theresult of γR4(S, φ,Γ) is F ∈ R24, where each F(i) is themean of hourly consumption of hour i throughout the year2010.

Generation from a set of context sets and feature func-tions. There could be a case where we would like to havea feature vector which is a combined result of applying R3using a set of context sets and R4 using a set of feature func-tions. For example, instead of using mean as the only featurefunction, we might want to use both mean and median tohave a more robust segmentation. Let S be the aggregated

measurement satisfying R1 and R2, Θ be the set of contextsets, and Φ be the set of feature functions. In addition, letΓ be the set of time interval sets to build the features. Foreach context set (Ui,Wi) ∈ Θ and feature function φj ∈ Φ,we compute Fij = γR4 (γR3 (S, Ui,Wi), φj ,Γ). Finally, weappend Fij, one after the other to form the combined featurevector, where 1 ≤ i ≤ |Θ| and 1 ≤ j ≤ Φ.3.2.5 Clustering Algorithm Application Given the ex-pert knowledge from the user to apply a specific algorithm,A ∈ A, and its parameter setting ψ, then our frameworkshould be able to apply it to the consumers’ feature vector(R5 principle).

Let X be a set of consumers, and FX = {Fx | x ∈ X}be a set of feature vectors of all consumers in X . Then,the application of clustering algorithm A with parametersetting ψ over FX results in a set of clusters (or clusterconfiguration, or configuration), i.e., γR5(A,ψ,FX ) ={c1, . . . , ck}, where each cluster cj ⊆ X , ∀1 ≤ j ≤ k,is a set of consumers.Automatic Cluster Configuration Selection. Givendifferent parameter settings, we are often uncertain whichparameter setting delivers us the best cluster configuration(according to some cluster evaluation metrics). This holdseven if the setting is simple and easy to understand, such asthe number of clusters to be created (in case of using k-meansalgorithm). For instance, we are often uncertain in choosingthe value of k, the number of clusters. This motivates usto include an automatic selection of cluster configuration inour framework. Our selection mechanism is similar to themechanism in [14, 24], i.e., we attempt to select compactand well separated clusters.

Given a clustering algorithm A, a set of parametersettings Ψ = {ψ1, . . . , ψ|Ψ|}, and consumers’ feature vectorFX , we can have a set of cluster configuration C = {Ci |γR5(A,ψi,FX ) = Ci}. Thus, our task is to select the bestcluster configuration C∗ ∈ C. In order to determine C∗, weuse three cluster evaluation metrics: Silhouette index [22],Dunn index [9], and Davies-Bouldin index [8].1 We performa majority voting to the best configuration identified by eachmetrics. Algorithm 3.1 describes the selection mechanism inmore details.

Functions sortSilhouette(C), sortDunn(C), andsortDaviesBouldin(C) compute Silhouette, Dunn, andDavies-Bouldin index respectively for each configuration inC, and return an ordered list of the cluster configurations,sorted by the configuration quality in descending order(that is, sorted in decreasing order for Silhouette and Dunnindices, and in increasing order for Davies-Bouldin index).Let count(L, e) be the count of element e in the list L.Function mostFrequent(L) returns an element e∗ in L,where count(L, e∗) > count(L, e) for all e 6= e∗ in L. Inother words, mostFrequent(L) returns the element with thelargest count, e∗, and there is no other element in L whichhas the same count as e∗. If there is no such element, thisfunction returns null.

1See more detailed description in [1].

Algorithm 3.1: Automatic Cluster Configuration Se-lection

Input: a set of cluster configurationC = {C1, . . . , C|C|}

Output: the best configuration C∗ ∈ C1 silhList ← sortSilhouette(C)2 dunnList ← sortDunn(C)3 davbList ← sortDaviesBouldin(C)4 countList ← [ ]5 C∗ ← null6 i← 17 repeat8 countList .add(silhList [i])9 countList .add(dunnList [i])

10 countList .add(davbList [i])11 C∗ ← mostFrequent(countList)12 i← i+ 113 until (i > |C|) ∨ (C∗ = null)14 return C∗

4 Clustering ConsistencyThis section answers two important questions. First, isthere any consumer who changes their behavior over time?For example, we would like to know whether there is aconsumer who is in the low consumption cluster in January,but she changes to the medium/high consumption cluster inFebruary. This insight is important for devising personalizedfeedback to the consumer. Second, how different is onecluster configuration to another? For example, how differentis the cluster configuration using January data comparedto using February data, or March, April, etc. Answeringthis question gives insight to the utility company on thekey contexts to consider when developing policies, such asdifferential pricing or demand response signal. In the sequel,we use the term individual and consumer interchangeably.

4.1 Individual to Cluster Consistency Given two clusterconfigurations, we develop a measure to indicate whetheran individual has the same cluster membership on both ofthem. Because cluster configuration is invariant to the clusterlabels, we require the measure to also be invariant to thelabel sets. Thus, the idea is to define an individual to clusterconsistency index (i2c) which computes how consistent arethe fellow cluster members of an individual on the twoconfigurations.

Let C be a cluster configuration. We write C(x) todenote the cluster of x, i.e., the cluster c ∈ C where x ∈ c.We define an individual to cluster consistency of consumerx ∈ X on two cluster configurations C1 and C2 as:

(4.3) i2c(x,C1, C2) =

|C1(x) ∩ C2(x)|+ |(X \ C1(x)) ∩ (X \ C2(x))| − 1

|X | − 1.

Intuitively, if we denote the set of consumers in C(x) as thefriends of x, and the others as non-friends, then i2c measure

the number of friends in C1 who also friends of x in C2, andthe number of non-friends in C1 who also non-friends in C2

normalized by the number of all consumers excluding x. Thevalue of i2c ranges between 0 and 1. The closer it is to 1, themore consistent is x’s cluster membership in C1 and C2.

4.2 Distance Rank Given individuals who change theirclusters, one might interested more to the ones located closerthe centroid of their new clusters. Thus, we define anadditional measure, distance rank, to denote the ranking ofindividual’s distance to its cluster representative (such ascentroid) compared to the other cluster members. We usedistance rank instead of actual distance measure because it isinvariant to cluster size. Thus, it can be used for comparisonacross different clusters.

Let C(x) be the cluster of x in configuration C, andζC(x) be the cluster representative of C(x). In addition, letdist(x, ζC(x)) be the distance of x to its cluster representa-tive. For an individual x, we define its distance rank as:

(4.4) dr(x,C(x)) =

|{x′ | dist(x, ζC(x)) < dist(x′, ζC(x)), x′ ∈ C(x)}||C(x)|

.

Distance rank ranges between 0 and 1. The higher the value,the closer the individual to its cluster representative.

4.3 Cluster Configuration Consistency In order to inves-tigate community behavioral changes over different contexts,we can measure the difference of the resulting cluster con-figuration over those contexts. Let X be a set of consumers,and C1 and C2 be two cluster configurations over X . Wecompute the difference between C1 and C2, i.e., cluster con-figuration consistency index, as the average of i2c of theirindividuals:

(4.5) ccc(C1, C2,X ) =1

|X |∑x∈X

i2c(x,C1, C2).

The ccc index ranges between 0 and 1. The higher the cccbetween C1 and C2, the more similar they are.

5 Knowledge Extraction from Survey DataConsumer segments can be useful for implementing differ-ent policies, such as: targeted demand responses, more per-sonalized energy feedback, or differential pricing. However,having consumer segmentation alone is not enough. We alsoneed to understand what are the characteristics that consti-tute a consumer cluster? Only by developing this under-standing, we can develop an effective and efficient policieswhich tailored better for our consumers.

Consumer characteristics can be of form demographicprofiles, house types, appliance usages, or living styles. Wecan obtain these through survey/questionnaire, using ques-tions, such as: What best describes the people you live with(single/adults/adults with children)? Do you have a dish-washer (yes/no)? What is the approximate floor area of your

home?2 In this section, we focus on mining consumers char-acteristics, which are the discriminative, i.e., characteristicswhich make a cluster different from the others. We modelconsumer characteristic as a pair of question and answer.Next, we describe how to compute discriminative index.

5.1 Discriminative Index We define a measure to expresshow discriminative a question-answer pair is in distinguish-ing a cluster from the others. Let X be a set of consumers,and C be a cluster configuration over X . For a clusterc ∈ C, we denote ¬c as all individuals who are not in c,i.e., ¬c = {x′ | x′ ∈ X \ c}. In addition, let q be a question,ans(q) be the set of possible answers to q, ansx(q) ∈ ans(q)be the answer of consumer x to question q, and Nc,q be theset of consumers in cluster c who respond to question q.

We define Zc(q, a) as the fraction of individuals incluster c who answer a to question q, i.e.,

(5.6) Zc(q, a) =|{x | ansx(q) = a, x ∈ c}|

|Nc,q|,

where a ∈ ans(q). Then, discriminative index of question qand answer a to cluster c is defined as:

(5.7) DI c(q, a) =Zc(q, a)− Z¬c(q, a)

max(Zc(q, a), Z¬c(q, a)

) .Discriminative index ranges between −1 and +1. It isdiscriminative positive (or negative) if it is positive (ornegative). Both discriminative positive and negative explainhow a cluster differs from the others. Discriminative indexclose to +1 means that most of the individuals in cluster canswer a to question q, whereas individuals in other clustersdo not. In contrast, discriminative index close to −1 meansthat most of individuals in other clusters answer a to questionq, whereas individuals in cluster c do not. Discriminativeindex close to 0, means that answer a to question q does notdifferentiate cluster c from the others, i.e., it has little or nodiscriminative power for cluster c.

5.2 Dealing with Ordinal and Quantitative Data In asurvey, there are some answers which are ordinals or quanti-tatives. For example, in our survey data, answers to the ques-tion whether a consumer would like to do more to reduceelectricity usage, are ordinals, i.e., five criteria from stronglyagree to strongly disagree. Another example: answers to aquestion of the approximate floor area of the house, are quan-titative, i.e, the number which represent the floor area.

In the previous section, we determine whether a specificpair of question and answer is a key characteristic of acluster. However in ordinal and quantitative answers, we areinterested on insights more than that. For example, instead of“most of the consumers in cluster c have X sq ft. floor area ”,we are more interested on more general insight, if any, suchas: “most of the consumers in cluster c have less than X sq ft.

2These example questions are taken from our dataset (explained inSection 6).

floor area”, or “between X and Y”, or “greater than X”. Oneway to do this is by introducing some splitting points whichdivide the answers into groups, or ranges. But then, it leadsus into a combinatorial problem, such as: how many pointsdo we need for the best splitting, and where should we putthe splitting points.

Let q be a question, and ans(q) be a set of consumers’answers to q. To solve this problem, we sort the answersin ascending order and put them into a list. We add −∞and +∞ to the beginning and the end of the list. Letl = |ans(q)| + 2 be the length of the list. We take allpossible n-grams from the list, where n varies from l − 1 to1. Next, we create a set of ranges Rans(q) by taking the firstand the last element of each n-grams as the lower and upperbounds (inclusive). Finally, we remove ranges [−∞,−∞]and [∞,∞] from Rans(q). This takes polynomial time onthe size of ans(q).Example. Let ans(q) = {1, 2, 5}. The set of ranges cre-ated from all possible n-grams from length 4 to 1, without[−∞,−∞] and [∞,∞], is Rans(q) = { [−∞, 5], [1,∞],[−∞, 2], [1, 5], [2,∞], [−∞, 1], [1, 2], [2, 5], [5,∞], [1, 1],[2, 2], [5, 5] }.

Then, for ordinal and quantitative questions/answers,instead of computing discriminative indices based on a ∈ans(q), we now compute them based on ar ∈ Rans(q):

(5.8) Zc(q, ar) =

|{x | ansx(q) ∈ ar, x ∈ c}||Nc,q|

,

(5.9) DI c(q, ar) =

Zc(q, ar)− Z¬c(q, ar)

max(Zc(q, ar), Z¬c(q, ar)

) .6 Experimental EvaluationsIn this section, we describe our experiment details. We useIrish CER dataset, which contains energy consumption mea-surements of around 5,000 consumers over 1.5 years [2]. Themeasurements started in July 2009 and ended in December2010, and recorded energy consumption in kWh every 30minutes. We choose residential consumers that belong to thecontrol group and have no missing values. This results in theselection of 782 consumers. Smart meter is the only targetsensor, n = 1, and there is no context sensor, m = 0. Inaddition, the dataset also contains survey results, which in-cludes consumers’ demographics (occupation, family type,etc.), house information (ownership, age, floor area, etc.),and appliance usages (dishwasher, TV, water pump etc.).3

6.1 Consumer Segmentation First, we demonstrate theresult of our selection mechanism (described in Sec-tion 3.2.5). Second, we show the result of our framework toaccomplish different consumer segmentation tasks, i.e., con-sumer segmentation by consumption trends, absolute con-sumption, and consumption variability. We also show thatapplying the same task on different contexts yield differentresults. We visualize each cluster by its centroids.

3http://www.ucd.ie/issda/static/documentation/cer/smartmeter/cer-residential-pre-trial-survey.pdf

6.1.1 Automatic Cluster Configuration Selection Ourautomatic selection mechanism aims to select compact clus-ters which far from each other (well separated). We performa consumer segmentation task based on their consumptiontrends. We select all data (R1). We use hourly temporal ag-gregation (R2). Our features is a combined feature vector,from a set of context sets (R3) and feature functions (R4).We use calendar context only (without measured context),i.e., the set of context sets {(U,W )} is

{({January, week-

day}, {}),({January, weekend}, {}

)}. To obtain the con-

sumption trends, we use {normalized mean, normalized me-dian} as the set of feature functions. Normalized here meansthat we apply standard normalization on the measurementtuple, such that its mean is 0 and its standard deviation is 1.We use k-means algorithm, for k = 2, . . . , 10 (R5). As aconsequence, we obtained 9 different cluster configurations.

Cluster configuration resulting from k = 2 is deter-mined as the best configuration by our automatic selectionmechanism. Figure 2 illustrates the result using k = 2,k = 3, and k = 4. We can see that centroids of clustersusing k = 2 are better separated than the others. In addition,the configuration separates the consumers who have high andlow peak consumption in the evening.

6.1.2 Various Segmentation Tasks In this section, weshow the generality of our framework to accomplish differentconsumer segmentation tasks. Furthermore, we show thatapplying consumer segmentation in different contexts pro-duce different results.

We perform consumer segmentation by consumptiontrends, absolute consumption, and consumption variability.The setting are similar as in Section 6.1.1, except for the setof context sets and feature functions. We use the set of fea-ture functions {normalized mean, normalized median} forconsumption trends, {mean, median} for absolute consump-tion, and {standard deviation, IQR} for consumption vari-ability. We perform the task in three different contexts: Jan-uary, July, and all months, and separate weekend consump-tion from weekdays. Thus, we use the set of context sets:{({January, weekday}, {}

),({January, weekend}, {}

)}for

January,{({July, weekday}, {}

),({July, weekend}, {}

)}for July, and

{({weekday},{}

),({weekend}, {}

)}for all

months.Figure 3 shows that for segmentation by consumption

trends, we successfully divide the consumers into high peakand low peak consumers. For the next tasks, segmentationby absolute consumption and consumption variability, weare also able to produce clusters with high and low abso-lute consumption and consumption variability, respectively.Furthermore, comparing the results on January, July, and allmonths, shows that applying the segmentation tasks in dif-ferent contexts yields different results. This validates ourapproach to perform context-based consumer segmentation.

6.2 Clustering Consistency Using our clustering consis-tency index (described in Section 4), we can quantify thedifference between cluster configurations. Figure 4a shows

-1

0

1

2

0 24 48 72 96

Feat

ure

Val

ue

s

Feature IDs

Centroid1 Centroid2

(a) k = 2

-1

0

1

2

0 24 48 72 96

Feat

ure

Val

ue

s

Feature IDs

Centroid1 Centroid2 Centroid3

(b) k = 3

-1

0

1

2

0 24 48 72 96

Feat

ure

Val

ue

s

Feature IDs

Centroid1 Centroid2 Centroid3 Centroid4

(c) k = 4

Figure 2: Centroids of clusters using January data, and hourly temporal aggregation. For the features, we use normalized mean of weekday(ID 1-24), normalized mean of weekend (ID 25-48), normalized median of weekday (ID 49-72), and normalized median of weekend (ID73-96) consumption. We use k-means algorithm with k = 2, k = 3, and k = 4.

the difference between cluster configurations, resulting fromconsumer segmentation by absolute consumption for a spe-cific month compared to 1, 3, and 6 months previously. Weuse k-means algorithm, with k=2. This results in two con-sumer segments: high and low consumption clusters.

The result shows that the consistency between clustersresulting from the segmentation of the current month and 1month ago is higher than the consistency between clustersfrom the current month and 3 months or 6 months ago.Especially, the lowest consistency is between July 2010 and6 months previously, January 2010 (cluster configurationconsistency = 0.67). One of the reason is seasonal changesin the energy consumption behavior between summer andwinter, i.e., July and January is in the middle of summer andand winter in Ireland, respectively. However the implicationof our result could be bigger than that. It indicates that,there are a number of consumers who behave differently(compared to their fellow cluster members), which in turnchange their cluster memberships.

To elaborate this, in Figure 4b and 4c, we show the con-sumption profile (mean and median of weekday and week-end consumption) of the centroid of the low consumptioncluster, and two individuals: ID 1301 and ID 7381, in Jan-uary and July 2010. These two consumers are in the lowconsumption clusters in January 2010. Both of them havethe highest distance rank among individuals with low con-sistency index (ID 1301) and high consistency index (ID7381) between January and July 2010. Consumer ID 1301changes her cluster membership from low consumption clus-ter in January 2010 to the high consumption cluster in July2010 (we use centroid as the cluster representative). Thetypical consumption of the low consumption cluster in July2010 is lower than January 2010, which shows the seasonalchanges in the electricity consumption between winter andsummer. Consumer ID 7381, who stays in the low consump-tion cluster, lower her consumption level, in line with thebehavior of her cluster. However, the consumption of con-sumer ID 1301 in July 2010 is approximately the same as herconsumption in January 2010. This causes her to change hercluster membership to the high consumption cluster in July2010. Then, given this results, we could devise a person-alized energy feedback for consumer ID 1301 to lower herenergy consumption (for example, using self-comparison).

-1

0

1

2

0 24 48 72 96

Feat

ure

Val

ue

s Feature IDs

Centroid1 Centroid2

(a) Trends-Jan

-1

0

1

2

0 24 48 72 96

Feat

ure

Val

ue

s

Feature IDs

Centroid1 Centroid2

(b) Trends-Jul

-1

0

1

2

0 24 48 72 96

Feat

ure

Val

ue

s

Feature IDs

Centroid1 Centroid2

2(c) Trends-all

0

1

2

3

4

0 24 48 72 96Feat

ure

Val

ue

s (k

Wh

)

Feature IDs

Centroid1 Centroid2

(d) Absolute-Jan

0

1

2

3

4

0 24 48 72 96Feat

rure

Val

ue

s (k

Wh

) Feature IDs

Centroid1 Centroid2

(e) Absolute-Jul

0

1

2

3

4

0 24 48 72 96Feat

ure

Val

ue

s (k

Wh

)

Feature IDs

Centroid1 Centroid2

(f) Absolute-all

0

1

2

3

0 24 48 72 96

Feat

ure

Val

ue

s (k

Wh

)

Feature IDs

Centroid1 Centroid2

(g) Variability-Jan

0

1

2

3

0 24 48 72 96

Feat

ure

Val

ue

s (k

Wh

)

Feature IDs

Centroid1 Centroid2

(h) Variability-Jul

0

1

2

3

0 24 48 72 96Fe

atu

re V

alu

es

(kW

h)

Feature IDs

Centroid1 Centroid2

(i) Variability-all

Figure 3: Consumer segmentation on consumption trends (a)-(c),absolute consumption (d)-(f), and consumption variability (g)-(i),in different contexts: January, July, and all months. For trends, weused the same features as in Figure 2. Feature ID 1-24 and 49-72are weekdays consumption, whereas 25-48 and 73-96 are weekendconsumption. We use mean (ID 1-48) and median (ID 49-96) forabsolute, and standard deviation and IQR for variability.

In addition, our cluster configuration consistency indexis useful to quantify the difference between results obtainedby various temporal aggregations. Since smart meter datahas high velocity and high volume, determining the rightaggregation is imperative. We compare the segmentationresults by absolute consumption, consumption variability,and trends, performed using different temporal aggregations,against hourly temporal aggregation.

Figure 4d shows that, for consumer segmentation by ab-solute consumption, we have only a little difference betweencluster configurations resulting from various temporal aggre-

0.6

0.7

0.8

0.9

Sep

'09

Oct

'09

No

v'0

9

Dec

'09

Jan

'10

Feb

'10

Mar

'10

Ap

r'1

0

May

'10

Jun

'10

Jul'1

0

Au

g'1

0

Sep

'10

Oct

'10

No

v'1

0

Dec

'10

compared to 1 month ago 3 mths. 6 mths.

(a)

0

1

2

0 24 48 72 96

centroid-low cust id: 1301 cust id: 7381

(b)

0

1

2

0 24 48 72 96

centroid-low cust id: 1301 cust id: 7381

(c)

0.4

0.6

0.8

1

abs, k=2

abs, k=3

var, k=2

var, k=3

tren, k=2

tren, k=3

(d)

Figure 4: (a) cluster configuration consistency over time (monthly), consumption profile using (b) January 2010 data, (c) July 2010 data,and (d) cluster configuration consistency over different temporal aggregations.

Table 1: Consumer characteristics based on their absolute con-sumption. A minus (-) sign denotes discriminative negative.

Cluster Question Answer DI

lowfamily type single 0.86floor area (sq ft.) 805-1073 0.86#bedrooms ≤ 2 0.85

mediumelectric shower (-) ≥ 20 mins -0.76family type (-) single -0.61floor area (sq ft.) 2300-2750 0.56

high#children ≥ 4 0.93family type (-) single -0.90floor area (sq ft.) (-) ≤ 1200 -0.87

gations and hourly aggregation. This is a good news becausestoring and processing monthly aggregated data, for exam-ple, is far more desirable than hourly data. Unfortunately,for consumer segmentation by consumption variability andtrends, this is not the case. The consistency index decreasesas the temporal aggregation become coarser, i.e., the coarserthe temporal aggregation, the higher the difference with thehourly aggregation. This can be understood since as wemove to coarser temporal aggregation we lose the variationin the consumption profiles, which is needed to distinguishdifferent consumption variability or trends.

6.3 Knowledge Extraction from Survey Data Ourdataset contains not only smart meter measurement, but alsoconsumer survey data (questions/answers). From this surveydata, using our discriminative index (described in Section 5)we are able to extract knowledge about main characteristicsof a cluster. This step is imperative for applying the rightbusiness decision or policy to a specific consumer segment.

In Table 1 and 2, we show the top 3 characteristics ofconsumer segments, formed by absolute consumption andconsumption variability, respectively. We use the same fea-tures and setting as in Figure 3f and 3i, with k=3 (numberof clusters). The characteristics (expressed by questions andanswers) are ordered by the absolute value of their discrimi-native index (DI). Recall that DI < 0 denotes discriminativenegative, i.e., the answer is associated more likely to otherclusters.

Consumer segments by absolute consumption is deter-mined more by consumer’s demographics (such as familytype, the number of children) and housing condition (floorarea, the number of bedrooms). Low consumption clusteris dominated by single, whereas medium/high consumptionclusters are dominated by non-single family, either adults

Table 2: Consumer characteristics based on their consumptionvariability. A minus (-) sign denotes discriminative negative.

Cluster Question Answer DI

lowwater pump (-) 1-2hrs -0.88family type single 0.80washing machine (-) ≥ 2-3 loads -0.76

mediumelectric shower 10-20 mins 0.59family type (-) single -0.55#children (-) ≥ 3 -0.54

hightumble dryer ≥ 2 to 3 loads 0.90#children ≥ 4 0.88floor area (sq ft.) ≥ 2800 0.79

only or adults with children. Floor area is also relevant, withlow consumption clusters having the smaller area.4

There are more insights which are based on applianceusages in Table 2. It can be understood since consumptionvariability comes from intermittent appliance usages. Con-sumers with low consumption variability use big appliances,such as water pump and washing machine, in a shorter dura-tion.5 Furthermore, consumers with high consumption vari-ability use tumble dryers for the longest duration.

Often opinion about house’s energy consumption is builtaround its floor area or the year it was built.6 While Table 1and 2 shows insights about the floor area, there is none aboutthe year it was built. To investigate this further, we plot thecumulative distribution of consumers’ floor area (Figure 5a)and the year their houses was built (Figure 5b). Figure 5ashows that, indeed floor area is a relevant characteristics, i.e.,we can distinguish clearly the cumulative distribution of thefloor area among different clusters, where consumers in thelower consumption cluster typically associated with smallerfloor area. Figure 5b, however, shows that this is not the casewith the year the houses was built, where the cumulativedistribution for the three clusters are similar (coincide) toeach other.

7 ConclusionIn this paper, we presented a generic consumer segmentationframework that can be used to classify smart meter data intoclusters using multiple distinguishing characteristics such

4Maximum consumers’ floor area is around 5000 sq ft.5Water pump usages are ranges from <30 minutes to >2 hours. Washing

machine usages are ranges from <1 load to >3 loads a day.6In [1], we also investigate whether ownership of a certain appliance is

discriminative to the consumer clusters.

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000

Cu

mu

lati

ve D

istr

ibu

tio

n

low

medium

high

(a) Floor area (sq ft.)

0

0.2

0.4

0.6

0.8

1

1800 1850 1900 1950 2000

Cu

mu

lati

ve D

istr

ibu

tio

n

low

medium

high

(b) Year built

Figure 5: Cumulative distribution of (a) floor area, and (b) the yearconsumers’ houses was built. Consumers are clustered by theirabsolute consumption: low, medium, and high.

as time of consumption, levels of consumption, associatedcontexts, etc. We also presented a clustering consistencyindex, which can be used to track evolving consumptionbehaviors and to compare consumer segments resulting fromdifferent temporal aggregations.

We evaluated the framework and index using real worldsmart meter data and survey results. Our experimentsshowed that consumer segmentation results are differentfrom one context to another. Moreover, different temporalaggregations have only a little effect on segmentation by ab-solute consumption. But, this does not hold for segmenta-tion by consumption variability or trends. Furthermore, con-sumer’s floor area is relevant to her consumption. In addi-tion, big appliances’ usage patterns also play a role in con-sumer’s consumption variability.

In the future, we plan to define consumer segmentsbased on their responses to DR signals. We aim to quantifyand understand consumer characteristics which are relevantto the responses. This enables us to develop DR design rec-ommendation framework, which allows us to draw relation-ships between consumers’ consumption profile, demograph-ics, and appliance usage patterns to estimate the impact offuture (planned) DR signals. The existence of this frame-work is crucial for delivering effective DR signals. In addi-tion, we also plan to explore applications of our frameworkin anomaly, theft, and fraud detection.

References

[1] Supplementary material. https://github.com/tritritri/consumer-segmentation.

[2] Electricity customer behaviour trial. The Commission forEnergy Regulation (CER), 2012.

[3] A. Albert and R. Rajagopal. Smart meter driven segmenta-tion: What your consumption says about you. IEEE Transac-tions on Power Systems, PP(99):1–12, 2013.

[4] Beyond Zero Emissions. Smart Meters are a SmartChoice. http://engage.haveyoursay.nsw.gov.au/document/show/748, 2011.

[5] C.-S. Chen, J. C. Hwang, and C. Huang. Application of loadsurvey systems to proper tariff design. IEEE Transactions onPower Systems, 12(4):1746–1751, 1997.

[6] G. Chicco, R. Napoli, and F. Piglione. Comparisons amongclustering techniques for electricity customer classification.IEEE Transactions on Power Systems, 21(2):933–940, 2006.

[7] G. Chicco, R. Napoli, P. Postolache, M. Scutariu, andC. Toader. Customer characterization options for improv-ing the tariff offer. IEEE Transactions on Power Systems,18(1):381–387, 2003.

[8] D. L. Davies and D. W. Bouldin. A cluster separation mea-sure. IEEE Transactions on Pattern Analysis and MachineIntelligence, PAMI-1(2):224–227, 1979.

[9] J. C. Dunn. A fuzzy relative of the isodata process and itsuse in detecting compact well-separated clusters. Journal ofCybernetics, 3(3):32–57, 1973.

[10] V. Figueiredo, F. Rodrigues, Z. Vale, and J. Gouveia. Anelectric energy consumer characterization framework basedon data mining techniques. IEEE Transactions on PowerSystems, 20(2):596–602, 2005.

[11] C. Flath, D. Nicolay, T. Conte, C. Dinther, and L. Filipova-Neumann. Cluster analysis of smart metering data. Business& Information Systems Engineering, 4(1):31–39, 2012.

[12] F. Fusco, M. Wurst, and J. W. Yoon. Mining residen-tial household information from low-resolution smart meterdata. In 21st International Conference on Pattern Recogni-tion (ICPR), pages 3545–3548, 2012.

[13] M. Kitayama, R. Matsubara, and Y. Izui. Application of datamining to customer profile analysis in the power electric in-dustry. In IEEE Power Engineering Society Winter Meeting,2002, volume 1, pages 632–634 vol.1, 2002.

[14] F. Martinez Alvarez, A. Troncoso, J. Riquelme, andJ. Aguilar Ruiz. Energy time series forecasting based on pat-tern sequence similarity. IEEE Transactions on Knowledgeand Data Engineering, 23(8):1230–1243, 2011.

[15] S. J. Moss, M. Cubed, and K. Fleisher. Market segmentationand energy efficiency program design. California Institute forEnergy and Environment Berkeley, 2008.

[16] ORACLE. Smart Metering for Electric and Gas Utili-ties. http://www.oracle.com/us/industries/utilities/046593.pdf, 2011.

[17] P. Durand. Market Segmentation: Defining A New AttitudeFor Utilities. http://bit.ly/vwbEiF, 2011.

[18] M. Pedersen. Segmenting residential customers: Energy andconservation behaviors. ACEEE Summer Study on EnergyEfficiency in Buildings, 2008.

[19] S. Ramos, Z. Vale, J. Santana, and J. Duarte. Data miningcontributions to characterize MV consumers and to improvethe suppliers-consumers settlements. In IEEE Power Engi-neering Society General Meeting, pages 1–8, 2007.

[20] W. M. Rand. Objective criteria for the evaluation of clusteringmethods. Journal of the American Statistical Association,66(336):pp. 846–850, 1971.

[21] T. Rasanen and M. Kolehmainen. Feature-based clusteringfor electricity use time series data. In Proceedings of the 9thInternational Conference on Adaptive and Natural Comput-ing Algorithms, ICANNGA’09, pages 401–412, Berlin, Hei-delberg, 2009. Springer-Verlag.

[22] P. J. Rousseeuw. Silhouettes: A graphical aid to the interpre-tation and validation of cluster analysis. Journal of Computa-tional and Applied Mathematics, 20(0):53 – 65, 1987.

[23] S. Harrison. Smart Metering Projects Map. http://bit.ly/GSs1o5, 2013.

[24] W. Shen, V. Babushkin, Z. Aung, and W. L. Woon. Anensemble model for day-ahead electricity demand time seriesforecasting. In Proceedings of the Fourth InternationalConference on Future Energy Systems, e-Energy ’13, pages51–62, New York, NY, USA, 2013. ACM.

consumer segmentation and knowledge … segmentation and knowledge extraction from smart meter and...

Documents