innovation in market transparency - european commission · spatial sampling plan to obtain the...
TRANSCRIPT
Innovation in market transparency:A quality methodology for Crowdsourced data
Giuseppe Arbia1, Gloria Solano Hermosilla2 Fabio Micale2, Giampiero Genovese2
1 Universita’ Cattolica del Sacro Cuore, Milano 2EU Joint Research Center – Seville – D4 Agricultural Economics
Market transparency, Bruxelles, 30-31 May 2018
Crucial issue: relationship between transparency, efficiency and competition
WE HAVE LIMITED STATISTICAL INFORMATION (Tassos Haniotis)
WE HAVE TOO MUCH INFORMATION (Michael Sheats)
We need to distinguish DATA from INFORMATION !
and INFORMATION from QUALITY INFORMATION
Transparency is intuitively related to information, on which we have different opinions:
Crucial issue: relationship between transparency, efficiency and competition
FEW POINTS:
1. THIS RAISES DATA QUALITY ISSUE
2. TRANSPARENCY HAS AN IMPORTANT SPATIAL ASPECT. NOT ALWAYS NATIONAL INFORMATION TRANSLATE INTO RELEVANT REGIONAL INFORMATION (STEVE MC CORRISTON)THIS RAISES THE ISSUE OF SPATIAL ANALYSIS OF TRANSPARENCY
IN THIS PAPER WE CONTRIBUTE TO THESE TWO ISSUES
Crucial issue: relationship between transparency, efficiency competition
FURTHERMORE
3. CAUSALITY ISSUES AND OTHER CONFOUNDERS (OMITTED VARIABLES, ASYMMETRIES ETC. (CHAMBELLE + HANIOTIS VS GELLYNCKX)
WE WILL NOT DISCUSS THEM
Outline
1. Introduction and aims of the study
2. The concept of crowdsourcing
3. "Post-sampling" methodology and an example of application in Nigeria
4. Quality indicators
5. Conclusions and further research
© Fotolia.com - Alistair Cotton
Aims of the study
Consider an experiment of data shared by citizens (crowdsourcing) about food prices in Nigeria and
• Develop a robust statistical methodology to produce quality and timely food price indices in almost real-time (extract valuable information out of the "wisdom of the crowd")
• Provide a composite measure of quality (indicators scoreboard) for crowdsourced datasets (or data shared by citizens) (↑credibility and usability of data)
What is crowdsourcing (Citizens' science)?
Initiator/info. need
2-way process
Citizens share data using ICT's& can obtain data back
Crowd (people) + outsourcing (externalising)
As means to complement data
On-line task (e.g. reporting agri/food prices)
Open call
Volunteers (e.g. consumers, farmers,
food traders)
Work/experience
Reward
@Adobe Stock - 28835303
Why crowdsourcing?
Mobile technologies and data shared by citizens offer and enormouspotential for collecting a large amount of data, in our case study, food andagricultural prices in Nigeria:
Complement/ enhance
• Monitoring food price developments in close toreal-time at many different geographic points.
• Increase transparency across markets (↑marketintegration) and along the food chain(↑efficiency), connecting market participants toinformation as data producers and users.
• Policymaking, design, implementation andevaluation of policies
• But…
But…crowdsourcing is not like running a sample survey!
"A lot of data is not necessarily a lot of information"
In a sample survey thereliability of estimatesdepends essentially on thesample size and on thesample design.
In a crowdsourcing exercisedata do not obey to anysample design. Hence thereliability of estimatesdepends on many differentfactors.
Sampling and non-sampling errors
Sampling errorsIn crowdsourcingi) Data collectors that operate on a voluntary basis (self-selection related tothe attractiveness of the incentives) issues of representativeness, andii) Likely presence of a large number of missing data or missing items.
Non-sampling errorsi) Measurement errorsii) Fraudulent or carelessness activitiesiii) Locational errorsiv) Non-independence ("cartel")v) Coverage errorsvi) Other errors, nonresponse, inconsistent data, etc.
A methodology in 2 phases
1. A pre-processing phase to reduce non-sampling errorsSTEP 1 Outlier detection and removalSTEP 2 Spatial outlier detection and replacement
2. A post-sampling phase to let the data collected resemble a spatial sample design and enable sound statistical inference reducing sampling errors
STEP 3 Count and average geo-referenced observationsSTEP 4 Comparison of observed map of points to an optimal spatial sampling plan to obtain the post-sampling ratios STEP 5 Use post-sampling ratios to reweight and aggregate data
Result: Series of data validation and resampling algorithms programmed in Rsoftware to clean & aggregate raw crowdsourced data
Terminology
• A data collection point the food price (e.g. store, market stand, farm, etc.).
• A market neighbourhood may contain several data collection points (within a given maximum threshold distance).
• A location may contain different points and neighbourhoods (generally coincides with the notion of a city, town or village).
• Market typologies, e.g. supermarket, street market, farm, etc.
• Price typologies, i.e. farm gate, main market, local market.
STEP 1: Outliers detection and removal
• Outliers may hide measurement errors or carelessness collector activities. Inthis step we are interested in the identification of standard outliers, defined asthose values that exceed the mean of the area of interest h times thestandard deviation.
𝑃𝑙 > 𝑚 𝑃 + ℎ 𝑠𝑑(𝑃) or 𝑃𝑙 < 𝑚 𝑃 − ℎ 𝑠𝑑(𝑃)𝑃𝑙: price observed by the lth collector, m(P): mean, sd(P): standard deviation and h: usually equal to 2 or 3
• Alternative definition of outliers (e.g. IQR method) have been considered and their use is upon the user.
• Standard outliers are removed from the dataset being considered not reliable.
STEP 2: Spatial outliers identification
• The notion of a spatial outlier is different from that of a standard outlier. It represents a value that departs dramatically from the values observed in its neighbourhood.
• The notion of a spatial outlier requires the formal definition of a neighborhood.
• In particular we considered neighbours to a data collection point, the first 5 neighbours observed at a distance less than 10 km.
𝑃𝑙 > 𝑙𝑎𝑔(𝑃𝑙) + 𝑟 𝑠𝑑(𝑃𝑙) or 𝑃𝑙 < 𝑙𝑎𝑔(𝑃𝑙) − 𝑟 𝑠𝑑(𝑃𝑙)
𝑃𝑙: price observed by the lth collector, 𝑙𝑎𝑔(𝑃𝑙): average price in the neighbourhood, sd(P): standard deviation of error in the spatial regression and r: usually equal to 3
An example of spatial outliers
2
310
89
11
4 3
7
8
5
0
2
4
6
8
10
0 1 2 3 4 5 6 7 8
Lati
tud
e
Longitude
Commodity prices observed in a geographic area
The value 10 is not an outlier
It does not exceed 3 standard deviations from the mean Mean = 6.36St dev = 2.96Limit for outliers (+/-3 times standard deviation) = 15.25 and -2.52
However, it is an spatial outlier
The average of the immediate neighbors is 2.5
This might be interpreted as a clue of a non-sampling error.
Spatial outliers thus identified are then replaced by the average of the neighbouring observations.
Which distances are captured?
• Euclidean distance between 2 geo-referenced points
• Travel distances (i.e. km) can be automatically obtained through Google Maps Distance Matrix API as well as time (i.e. hours) for a matrix of origins and destinations, based on recommended routes from a start to an end point.
The post-sampling phase
The idea of a formal sample design is in sharp contrast with the notion of pure crowdsourcing:
• in a formal sample design the choice of observations is suggested by a precise mechanism (e. g. random, cluster, stratified, spatial design) which allows the calculation of the probabilities of inclusion and thus guarantees safe probabilistic inference,
• in a crowdsourcing exercise observations are self-selected. This can give rise to over- under-representativeness and will require a post-sampling where some of the data are re-weighted.
"Make sense of data"
The "post-sampling" in 3 steps
• STEP 3: First, we consider the map of , say, n observations on food prices obtained through crowdsourcing and geo-referenced. We count the # obs. per location and average prices at the level of location (or market, region or other desired aggregation)
𝑛𝑙 = σ𝑚 𝑛𝑚𝑙 (1) 𝑃𝑙𝑡 =
σ𝑚 𝑃𝑚,𝑙𝑡
𝑛𝑙(2)
• STEP 4: Second, we consider the map of points and locations as selected by an optimal spatial sample procedure (e. g. the local pivotal method 2, LPM2) with a sample size (# locations) equal to the one achieved with crowdsourcing. We call 𝑚𝑙the number of observations in each location l. So we define the post-sampling ratio as 𝑃𝑆𝑙 =
𝑚𝑙
𝑛𝑙(3)
“If we selected a formal spatial sampling design, which locations would we have observed ? “
“How much our collected data resemble a formal spatial sampling design ? “
• STEP 5: The n available observations are reweighted so that they should resemble the formal spatial sampling scheme and allow inference
Example of illustration of the post-sampling
Observations/markets in each location selected by the optimal spatial sampling
𝑚𝑙
Crowdsourced observations/markets in each location
𝑛𝑙
𝑷𝑺𝒍 =𝒎𝒍
𝒏𝒍
0
1
2
3
4
5
6
7
8
0 2 4 6 8
Lati
tud
e
Longitude
0
1
2
3
4
5
6
7
8
0 2 4 6 8
Lati
tud
e
Longitude
𝑃𝑡 =σ𝑙=1𝐿 𝑃𝑆𝑙 ∗ 𝑃𝑙
𝑡
σ𝑙=1𝐿 𝑃𝑆𝑙
1
2
3
4
5
6
Reliability of post-sampling
• If the crowdsourcing coincides perfectly with the desired formal sampling we achieve the maximum of reliability:
𝑚𝑙 − 𝑛𝑙 = 0, ∀𝑙
• The larger will be the discrepancy (both positive or negative) the lower will be the reliability.
𝐶𝑆𝑅 = 1 − 2
σ𝑙=1𝐿 (𝑚𝑙 − 𝑛𝑙)
2
𝑁
1 +σ𝑙=1𝐿 (𝑚𝑙−𝑛𝑙)
2
𝑁with N the total number of points.
CSR ranges between 0 and 1.
When 𝑚𝑖 − 𝑛𝑖 = 0, ∀𝑖 , CSR = 1 and we achieve the highest possible reliability.
When σ𝑖=1𝐿 𝑚𝑖−𝑛𝑖
2
𝑁= 1, all N points are misplaced and we have to the lowest possible
reliability (CSR = 0).
Test of the validity of the procedure
1. Simulation
2. Real data analysis integrated with simulation
3. Experimental data
1. Simulation
We simulate 1,000 individual points divided into 4 geographical strata characterized by
unequal densities represented in the 4 quadrants of a unitary square (-0.5,0.5). The
population size in the four strata is respectively given by 800, 60, 60, 80
1. Simulation
• Individuals’ locations are randomly generated according the complete spatial
randomness scheme (CSR, Diggle, 1983) in the 4 quadrants.
• The variable of interest Y is generated using a pure spatial autoregressive model
(Arbia, 2014) with spatial parameter = 0.7.
• we consider a sample of 80 individuals selected from the 1,000 population points.
• we mimic a crowdsourcing behaviour by considering a simple random sample of 20
units in each quadrant
1. Simulation
We compare the performances of three distinct estimation strategies to
estimate the mean of variable Y:
• simple random sample and Horvitz-Thompson (HT) estimator.
• data reweighted by comparing them with a random stratified design with pps and HT
estimator.
• data weighted by comparing them with a LPM2 optimal spatial sampling design and
HT estimator.
4.1. Simulation
4.1. Simulation
4.1. Simulation
4.1. Simulation
1. Simulation
After 1,000 replications all strategies display very small absolute relative biases (with an
average of 0.005).
However, in terms of efficiency, the two post-sampling procedures largely outperform
the simple random case (mimicking the crowdsourced data) producing variances of the
estimator that on the average are 124 (standard errors 11.8) times smaller than the one
associated to the simple random case.
1. Simulation
The benefit increases linearly with the sample proportion.
y = 41.375x + 1.2717R² = 0.8455
0
2
4
6
8
10
12
0 0.05 0.1 0.15 0.2 0.25
Rel
ativ
e ef
fici
ency
of
Po
st s
amp
ling
sample proportion
Ratio of the 2 standard errors (unweighted crowdsourced/post sampled data)
CROWDSOURCED
SAMPLE ARE
USUALLY VERY
LARGE !
2. Data analysis Simulated crowdsourced data based on the exercise run in Kaduna, Nigeria (AMIS-FAO)
Here, we considered the average price calculated in each location and we apply the post sampling strategy in order to obtain the average price at a state level.
CSR = 0.60
2. Data analysis
Quality indicators: Maize, Retail price, Nov 2016 – Mar 2017
Indicator name Value Theoretical Range
Outlier 1
Outlier_perc 0.41 % 0-100
Sp_outlier 2
Sp_outlier_perc 0.82% 0-100
Time_Reliability 0.42 0-1
Crowdsourcing_reliability 0.6 0-1
In each step of the methodology we can compute indicators that provide a measure of a different quality aspects
3. Experimental data (to be implemented)
In one or more European countries one could plan to run simultaneously:• a traditional price survey, and• a crowdsourcing data collection
and compare the results to test the validity of the post-sampling strategy
Composite measure of quality of crowdsourced data
• Data quality is usually defined in a multidimensional way around e.g. Relevance, Accuracy and Reliability, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence (ESS QAF) (European Statistical System, 2015)
• We aim at measuring the quality of crowdsourced datasets around the different quality aspects. The idea is then to develop:
• Crowdsourcing specific indicators
• Develop a crowdsourcing scoreboard, which can serve to monitor the implementation of crowdsourcing approaches by tracking quality and performance
Methodology applied by JRC
Composite indicators
Quality dimensionsDimensions
Measuring indicators based on available data
Data selection
Allows comparing indicators with different scales, units
Normalization
To aggregate indicators based on weights
Weighting
Quantitative or qualitative measures
Indicators selection 1) Desk work
2) Experts panel
Example -Quality dimensions and indicators (work in process!)
Quality dimensions (ESS, 2015)
Statistical processes
Sound Methodology
Appropriate statistical procedures from data collection to validation
Non-excessive Burden on Respondents
Cost-effectiveness
Statistical Output
Relevance
Accuracy and reliability
Timeliness
Coherence and comparability
Accessibility and Clarity
Relevance
• Number of visualizations
• % new data series
Accuracy
• CSR [0,1]• Correlation
coef. [0,1]
Timeliness• Number of
days between collection and dissemination
….
THRESHOLDS
A
B
C
D
Quality labels
Conclusions
• We have introduced a quality process to produce timely & reliable price dataallowing for statistical inference.
• A series of quality indicators are provided and an overall quality assessment ofcrowdsourced data collections is proposed
• The introduction of an automated and scalable system of geo-coordinatesretrieval for the observed locations and distances
• Post-sampling procedure is flexible and expandable to other data fields
• Data application:
• The importance of the collection of large number of data
• The importance of an exact geo-coding of each data collection point
ThanksQuestions?You can find us at: [email protected]