Download - Validating big data at scale
![Page 1: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/1.jpg)
Validating Data at Scale Spenser Skates
CEO at Amplitude
![Page 2: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/2.jpg)
Doing things at scale is noisy
u Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?
![Page 3: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/3.jpg)
Data from phones is noisier
u Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones
u Platforms have the craziest settings
![Page 4: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/4.jpg)
How data can get messed up
u HTTP requests get mangled in transit
u Phone might not get the acknowledgement from the server
u People’s clocks are off
u People are running weird versions of Android
u Memory/disk corruption
u Gamma ray events
![Page 5: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/5.jpg)
You can’t trust data from the client
![Page 6: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/6.jpg)
Problem: Data gets mangled in transit
u Parameters from post requests get dropped
u Within a parameter, a chunk of data may not actually reach the server
![Page 7: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/7.jpg)
Solution: Checksumming
u Send a checksum that’s a function of all the fields
u If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful
u The phone will attempt to reupload the data
![Page 8: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/8.jpg)
Problem: Client sends the same data twice
u How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back
u How does the server know that the phone has received the acknowledgement? It doesn’t!
u Equivalent to the two generals problem
u Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time
u That means all counts are inflated by about 5%!
![Page 9: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/9.jpg)
Solution: Deduplication
u Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state
u Create a unique key for every event that has been sent
u When you see an event, check your list of keys if the key is already present, discard the event
![Page 10: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/10.jpg)
Problem: Clocks are off
u Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred
u But people’s clocks are often off, occasionally by years!
u We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened
![Page 11: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/11.jpg)
Solution: Get an estimate of the actual time an event was logged
u Timestamp the upload from the phone
u For each event, let’s compare:
u The difference between the phone event timestamp and the server upload time
u The difference between the phone upload timestamp and the server upload time
![Page 12: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/12.jpg)
![Page 13: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/13.jpg)
![Page 14: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/14.jpg)
Solution: Get an estimate of the actual time an event was logged
u For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time
![Page 15: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/15.jpg)
Other Problems
u People are running weird versions of Android u MD5 library
u Memory/disk corruption
u Gamma ray events
![Page 16: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/16.jpg)
Clean Data
![Page 17: Validating big data at scale](https://reader031.vdocuments.net/reader031/viewer/2022020207/55874c1ed8b42ada168b4651/html5/thumbnails/17.jpg)
Questions?
Always happy to talk about analytics problems!
blog.amplitude.com
twitter: @amplitudemobile
MOBILE ANALYTICS FOR DECISION MAKERS