innovation in market transparency - european commission · spatial sampling plan to obtain the...

Innovation in market transparency:A quality methodology for Crowdsourced data

Giuseppe Arbia1, Gloria Solano Hermosilla2 Fabio Micale2, Giampiero Genovese2

1 Universita’ Cattolica del Sacro Cuore, Milano 2EU Joint Research Center – Seville – D4 Agricultural Economics

Market transparency, Bruxelles, 30-31 May 2018

Crucial issue: relationship between transparency, efficiency and competition

WE HAVE LIMITED STATISTICAL INFORMATION (Tassos Haniotis)

WE HAVE TOO MUCH INFORMATION (Michael Sheats)

We need to distinguish DATA from INFORMATION !

and INFORMATION from QUALITY INFORMATION

Transparency is intuitively related to information, on which we have different opinions:

Crucial issue: relationship between transparency, efficiency and competition

FEW POINTS:

1. THIS RAISES DATA QUALITY ISSUE

2. TRANSPARENCY HAS AN IMPORTANT SPATIAL ASPECT. NOT ALWAYS NATIONAL INFORMATION TRANSLATE INTO RELEVANT REGIONAL INFORMATION (STEVE MC CORRISTON)THIS RAISES THE ISSUE OF SPATIAL ANALYSIS OF TRANSPARENCY

IN THIS PAPER WE CONTRIBUTE TO THESE TWO ISSUES

Crucial issue: relationship between transparency, efficiency competition

FURTHERMORE

3. CAUSALITY ISSUES AND OTHER CONFOUNDERS (OMITTED VARIABLES, ASYMMETRIES ETC. (CHAMBELLE + HANIOTIS VS GELLYNCKX)

WE WILL NOT DISCUSS THEM

Outline

1. Introduction and aims of the study

2. The concept of crowdsourcing

3. "Post-sampling" methodology and an example of application in Nigeria

4. Quality indicators

5. Conclusions and further research

© Fotolia.com - Alistair Cotton

Aims of the study

Consider an experiment of data shared by citizens (crowdsourcing) about food prices in Nigeria and

• Develop a robust statistical methodology to produce quality and timely food price indices in almost real-time (extract valuable information out of the "wisdom of the crowd")

• Provide a composite measure of quality (indicators scoreboard) for crowdsourced datasets (or data shared by citizens) (↑credibility and usability of data)

What is crowdsourcing (Citizens' science)?

Initiator/info. need

2-way process

Citizens share data using ICT's& can obtain data back

Crowd (people) + outsourcing (externalising)

As means to complement data

On-line task (e.g. reporting agri/food prices)

Open call

Volunteers (e.g. consumers, farmers,

food traders)

Work/experience

Reward

@Adobe Stock - 28835303

Why crowdsourcing?

Mobile technologies and data shared by citizens offer and enormouspotential for collecting a large amount of data, in our case study, food andagricultural prices in Nigeria:

Complement/ enhance

• Monitoring food price developments in close toreal-time at many different geographic points.

• Increase transparency across markets (↑marketintegration) and along the food chain(↑efficiency), connecting market participants toinformation as data producers and users.

• Policymaking, design, implementation andevaluation of policies

• But…

But…crowdsourcing is not like running a sample survey!

"A lot of data is not necessarily a lot of information"

In a sample survey thereliability of estimatesdepends essentially on thesample size and on thesample design.

In a crowdsourcing exercisedata do not obey to anysample design. Hence thereliability of estimatesdepends on many differentfactors.

Sampling and non-sampling errors

Sampling errorsIn crowdsourcingi) Data collectors that operate on a voluntary basis (self-selection related tothe attractiveness of the incentives) issues of representativeness, andii) Likely presence of a large number of missing data or missing items.

Non-sampling errorsi) Measurement errorsii) Fraudulent or carelessness activitiesiii) Locational errorsiv) Non-independence ("cartel")v) Coverage errorsvi) Other errors, nonresponse, inconsistent data, etc.

A methodology in 2 phases

1. A pre-processing phase to reduce non-sampling errorsSTEP 1 Outlier detection and removalSTEP 2 Spatial outlier detection and replacement

2. A post-sampling phase to let the data collected resemble a spatial sample design and enable sound statistical inference reducing sampling errors

STEP 3 Count and average geo-referenced observationsSTEP 4 Comparison of observed map of points to an optimal spatial sampling plan to obtain the post-sampling ratios STEP 5 Use post-sampling ratios to reweight and aggregate data

Result: Series of data validation and resampling algorithms programmed in Rsoftware to clean & aggregate raw crowdsourced data

Terminology

• A data collection point the food price (e.g. store, market stand, farm, etc.).

• A market neighbourhood may contain several data collection points (within a given maximum threshold distance).

• A location may contain different points and neighbourhoods (generally coincides with the notion of a city, town or village).

• Market typologies, e.g. supermarket, street market, farm, etc.

• Price typologies, i.e. farm gate, main market, local market.

STEP 1: Outliers detection and removal

• Outliers may hide measurement errors or carelessness collector activities. Inthis step we are interested in the identification of standard outliers, defined asthose values that exceed the mean of the area of interest h times thestandard deviation.

𝑃𝑙 > 𝑚 𝑃 + ℎ 𝑠𝑑(𝑃) or 𝑃𝑙 < 𝑚 𝑃 − ℎ 𝑠𝑑(𝑃)𝑃𝑙: price observed by the lth collector, m(P): mean, sd(P): standard deviation and h: usually equal to 2 or 3

• Alternative definition of outliers (e.g. IQR method) have been considered and their use is upon the user.

• Standard outliers are removed from the dataset being considered not reliable.

STEP 2: Spatial outliers identification

• The notion of a spatial outlier is different from that of a standard outlier. It represents a value that departs dramatically from the values observed in its neighbourhood.

• The notion of a spatial outlier requires the formal definition of a neighborhood.

• In particular we considered neighbours to a data collection point, the first 5 neighbours observed at a distance less than 10 km.

𝑃𝑙 > 𝑙𝑎𝑔(𝑃𝑙) + 𝑟 𝑠𝑑(𝑃𝑙) or 𝑃𝑙 < 𝑙𝑎𝑔(𝑃𝑙) − 𝑟 𝑠𝑑(𝑃𝑙)

𝑃𝑙: price observed by the lth collector, 𝑙𝑎𝑔(𝑃𝑙): average price in the neighbourhood, sd(P): standard deviation of error in the spatial regression and r: usually equal to 3

An example of spatial outliers

2

310

89

11

4 3

7

8

5

0

2

4

6

8

10

0 1 2 3 4 5 6 7 8

Lati

tud

e

Longitude

Commodity prices observed in a geographic area

The value 10 is not an outlier

It does not exceed 3 standard deviations from the mean Mean = 6.36St dev = 2.96Limit for outliers (+/-3 times standard deviation) = 15.25 and -2.52

However, it is an spatial outlier

The average of the immediate neighbors is 2.5

This might be interpreted as a clue of a non-sampling error.

Spatial outliers thus identified are then replaced by the average of the neighbouring observations.

Which distances are captured?

• Euclidean distance between 2 geo-referenced points

• Travel distances (i.e. km) can be automatically obtained through Google Maps Distance Matrix API as well as time (i.e. hours) for a matrix of origins and destinations, based on recommended routes from a start to an end point.

The post-sampling phase

The idea of a formal sample design is in sharp contrast with the notion of pure crowdsourcing:

• in a formal sample design the choice of observations is suggested by a precise mechanism (e. g. random, cluster, stratified, spatial design) which allows the calculation of the probabilities of inclusion and thus guarantees safe probabilistic inference,

• in a crowdsourcing exercise observations are self-selected. This can give rise to over- under-representativeness and will require a post-sampling where some of the data are re-weighted.

"Make sense of data"

The "post-sampling" in 3 steps

• STEP 3: First, we consider the map of , say, n observations on food prices obtained through crowdsourcing and geo-referenced. We count the # obs. per location and average prices at the level of location (or market, region or other desired aggregation)

𝑛𝑙 = σ𝑚 𝑛𝑚𝑙 (1) 𝑃𝑙𝑡 =

σ𝑚 𝑃𝑚,𝑙𝑡

𝑛𝑙(2)

• STEP 4: Second, we consider the map of points and locations as selected by an optimal spatial sample procedure (e. g. the local pivotal method 2, LPM2) with a sample size (# locations) equal to the one achieved with crowdsourcing. We call 𝑚𝑙the number of observations in each location l. So we define the post-sampling ratio as 𝑃𝑆𝑙 =

𝑚𝑙

𝑛𝑙(3)

“If we selected a formal spatial sampling design, which locations would we have observed ? “

“How much our collected data resemble a formal spatial sampling design ? “

• STEP 5: The n available observations are reweighted so that they should resemble the formal spatial sampling scheme and allow inference

Example of illustration of the post-sampling

Observations/markets in each location selected by the optimal spatial sampling

𝑚𝑙

Crowdsourced observations/markets in each location

𝑛𝑙

𝑷𝑺𝒍 =𝒎𝒍

𝒏𝒍

0

1

2

3

4

5

6

7

8

0 2 4 6 8

Lati

tud

e

Longitude

0

1

2

3

4

5

6

7

8

0 2 4 6 8

Lati

tud

e

Longitude

𝑃𝑡 =σ𝑙=1𝐿 𝑃𝑆𝑙 ∗ 𝑃𝑙

𝑡

σ𝑙=1𝐿 𝑃𝑆𝑙

1

2

3

4

5

6

Reliability of post-sampling

• If the crowdsourcing coincides perfectly with the desired formal sampling we achieve the maximum of reliability:

𝑚𝑙 − 𝑛𝑙 = 0, ∀𝑙

• The larger will be the discrepancy (both positive or negative) the lower will be the reliability.

𝐶𝑆𝑅 = 1 − 2

σ𝑙=1𝐿 (𝑚𝑙 − 𝑛𝑙)

2

𝑁

1 +σ𝑙=1𝐿 (𝑚𝑙−𝑛𝑙)

2

𝑁with N the total number of points.

CSR ranges between 0 and 1.

When 𝑚𝑖 − 𝑛𝑖 = 0, ∀𝑖 , CSR = 1 and we achieve the highest possible reliability.

When σ𝑖=1𝐿 𝑚𝑖−𝑛𝑖

2

𝑁= 1, all N points are misplaced and we have to the lowest possible

reliability (CSR = 0).

Test of the validity of the procedure

1. Simulation

2. Real data analysis integrated with simulation

3. Experimental data

1. Simulation

We simulate 1,000 individual points divided into 4 geographical strata characterized by

unequal densities represented in the 4 quadrants of a unitary square (-0.5,0.5). The

population size in the four strata is respectively given by 800, 60, 60, 80

1. Simulation

• Individuals’ locations are randomly generated according the complete spatial

randomness scheme (CSR, Diggle, 1983) in the 4 quadrants.

• The variable of interest Y is generated using a pure spatial autoregressive model

(Arbia, 2014) with spatial parameter = 0.7.

• we consider a sample of 80 individuals selected from the 1,000 population points.

• we mimic a crowdsourcing behaviour by considering a simple random sample of 20

units in each quadrant

1. Simulation

We compare the performances of three distinct estimation strategies to

estimate the mean of variable Y:

• simple random sample and Horvitz-Thompson (HT) estimator.

• data reweighted by comparing them with a random stratified design with pps and HT

estimator.

• data weighted by comparing them with a LPM2 optimal spatial sampling design and

HT estimator.

4.1. Simulation

1. Simulation

After 1,000 replications all strategies display very small absolute relative biases (with an

average of 0.005).

However, in terms of efficiency, the two post-sampling procedures largely outperform

the simple random case (mimicking the crowdsourced data) producing variances of the

estimator that on the average are 124 (standard errors 11.8) times smaller than the one

associated to the simple random case.

1. Simulation

The benefit increases linearly with the sample proportion.

y = 41.375x + 1.2717R² = 0.8455

0

2

4

6

8

10

12

0 0.05 0.1 0.15 0.2 0.25

Rel

ativ

e ef

fici

ency

of

Po

st s

amp

ling

sample proportion

Ratio of the 2 standard errors (unweighted crowdsourced/post sampled data)

CROWDSOURCED

SAMPLE ARE

USUALLY VERY

LARGE !

2. Data analysis Simulated crowdsourced data based on the exercise run in Kaduna, Nigeria (AMIS-FAO)

Here, we considered the average price calculated in each location and we apply the post sampling strategy in order to obtain the average price at a state level.

CSR = 0.60

2. Data analysis

Quality indicators: Maize, Retail price, Nov 2016 – Mar 2017

Indicator name Value Theoretical Range

Outlier 1

Outlier_perc 0.41 % 0-100

Sp_outlier 2

Sp_outlier_perc 0.82% 0-100

Time_Reliability 0.42 0-1

Crowdsourcing_reliability 0.6 0-1

In each step of the methodology we can compute indicators that provide a measure of a different quality aspects

3. Experimental data (to be implemented)

In one or more European countries one could plan to run simultaneously:• a traditional price survey, and• a crowdsourcing data collection

and compare the results to test the validity of the post-sampling strategy

Composite measure of quality of crowdsourced data

• Data quality is usually defined in a multidimensional way around e.g. Relevance, Accuracy and Reliability, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence (ESS QAF) (European Statistical System, 2015)

• We aim at measuring the quality of crowdsourced datasets around the different quality aspects. The idea is then to develop:

• Crowdsourcing specific indicators

• Develop a crowdsourcing scoreboard, which can serve to monitor the implementation of crowdsourcing approaches by tracking quality and performance

Methodology applied by JRC

Composite indicators

Quality dimensionsDimensions

Measuring indicators based on available data

Data selection

Allows comparing indicators with different scales, units

Normalization

To aggregate indicators based on weights

Weighting

Quantitative or qualitative measures

Indicators selection 1) Desk work

2) Experts panel

Example -Quality dimensions and indicators (work in process!)

Quality dimensions (ESS, 2015)

Statistical processes

Sound Methodology

Appropriate statistical procedures from data collection to validation

Non-excessive Burden on Respondents

Cost-effectiveness

Statistical Output

Relevance

Accuracy and reliability

Timeliness

Coherence and comparability

Accessibility and Clarity

Relevance

• Number of visualizations

• % new data series

Accuracy

• CSR [0,1]• Correlation

coef. [0,1]

Timeliness• Number of

days between collection and dissemination

….

THRESHOLDS

A

B

C

D

Quality labels

Conclusions

• We have introduced a quality process to produce timely & reliable price dataallowing for statistical inference.

• A series of quality indicators are provided and an overall quality assessment ofcrowdsourced data collections is proposed

• The introduction of an automated and scalable system of geo-coordinatesretrieval for the observed locations and distances

• Post-sampling procedure is flexible and expandable to other data fields

• Data application:

• The importance of the collection of large number of data

• The importance of an exact geo-coding of each data collection point

ThanksQuestions?You can find us at: [email protected]

[email protected]

[email protected]

[email protected]

mailto:[email protected]




innovation in market transparency - european commission · spatial sampling plan to obtain the...

Documents