php continuous data processing
TRANSCRIPT
1 PHP & Continuous Data Processing, PHPNW 2011
PHP & CONTINUOUS DATA PROCESSINGMichael Peacock, October, 2011
2 PHP & Continuous Data Processing, PHPNW 2011
NO. NOT MILK FLOATS (ANYMORE)
ALL ELECTRIC, COMMERCIAL VEHICLES.
Photo courtesy of kenjonbro: http://www.flickr.com/photos/kenjonbro/4037649210/in/set-72157623026469013
3 PHP & Continuous Data Processing, PHPNW 2011
ABOUT MICHAEL PEACOCK
• Senior/Lead Web Developer• Web Systems Developer
• Telemetry Team – Smith Electric Vehicles US Corp
• Author• PHP 5 Social Networking, PHP 5 E-Commerce
Development, Drupal Social Networking (6 & 7), Selling online with Drupal e-Commerce, Building Websites with TYPO3
• PHPNE Volunteer• Occasional technical speaker
• PHP North-East, PHPNW 2010, SuperMondays, PHPNW 2011 Unconference, ConFoo 2012
4 PHP & Continuous Data Processing, PHPNW 2011
SMITH ELECTRIC VEHICLES & TELEMETRY
• Worlds largest manufacturer of Commercial, all-electric vehicles
• Smith Link – on-board vehicle telematics system, capturing over 2500 data points each second on the vehicle and broadcasting them over mobile network
• ~400 telemetry enabled vehicles on the road• Worlds largest telemetry project outside of F1
5 PHP & Continuous Data Processing, PHPNW 2011
SYSTEM ARCHITECTURE
6 PHP & Continuous Data Processing, PHPNW 2011
SYSTEM ARCHITECTURE
7 PHP & Continuous Data Processing, PHPNW 2011
PROBLEM #1: WE CAN’T LOOSE ANY DATA
Data is required as part of a $32 million grant from the US Department of
Energy
• Thousands of pieces of information collected on a per second basis from a range of remote collection devices
• Un-predictable amounts of data at any one time
• More vehicles rolling off the production line with telemetry enabled
• What about system downtime, upgrades, roll-outs and connectivity problems?
8 PHP & Continuous Data Processing, PHPNW 2011
MESSAGE QUEUING
Solution: We use a fast, reliable, scalable, secure, hosted message
queue
• If our systems are offline, data builds up in the external message queue
• If we are processing at full capacity, surplus builds in in the message queue
• If the vehicle loses GPRS signal, or message queue were to be inaccessible, vehicles have an internal buffer of up to 7 days
9 PHP & Continuous Data Processing, PHPNW 2011
SECRET WEAPON #1: STORMMQ
• Based on AMQP, an open standard• Secure: All data is encrypted and sent over SSL• Reliable: Huge investment in server
infrastructure• Hosted: Backed up with an SLA• Scalable: Capable of processing huge numbers
of incoming messages, with capacity to store the messages when we perform maintenance on our systems
10 PHP & Continuous Data Processing, PHPNW 2011
PROBLEM #2: PROCESSING DATA QUICKLY
We utilise a dedicated server and number of dedicated applications to pull these messages and process them
• This needs to happen quick enough for live data to be seen through the web interface
• Data is rapidly converted into batch SQL files, which are imported to MySQL via “LOAD DATA INFILE”• Results in high number of inserts per second (20,000 –
80,000)• LOAD DATA INFILE isn’t enough on its own...
11 PHP & Continuous Data Processing, PHPNW 2011
SECRET WEAPON #2: DBA
• Constantly tweaking the servers and configuration to get more and more performance
• Pushing the capabilities of our SAN, tweaking configs where no DBA has gone before
• www.samlambert.com• http://www.samlambert.com/2011/07/how-t
o-push-your-san-with-open-iscsi_13.html• http://www.samlambert.com/2011/07/diagn
osing-and-fixing-mysql-io.html• [email protected]
Sam Lambert – DBA Extraordinaire
12 PHP & Continuous Data Processing, PHPNW 2011
SHARDING
• Huge volumes of data being stored
• We shard the data based on the truck it came from, each truck has its own database
• Databases held on one of many database servers in our cluster each with ~100GB RAM
13 PHP & Continuous Data Processing, PHPNW 2011
LIVE, REAL TIME INFORMATION
[live screen photo]
14 PHP & Continuous Data Processing, PHPNW 2011
REAL TIME STATUS AND TRACKING
15 PHP & Continuous Data Processing, PHPNW 2011
LIVE, REAL TIME INFORMATION: PROBLEM
Original database design dictated:• All data-points were stored in the same table• Each type of data point required a separate
query, sub-query or join to obtain
Workings of the remote device collecting the data, and the processing server, dictated:• GPS Co-ordinates can be up to 6 separate data
points, including: Longitude; Latitude; Altitude; Speed; Number of Satellites used to get location; Direction
16 PHP & Continuous Data Processing, PHPNW 2011
REAL TIME INFORMATION: CONCURRENT
Initial Solution from the original developers:• Pull as many pieces of real time information
through asynchronously• Involved the use of Flash based “widgets”
which called separate PHP scripts to query the data
• Pages loaded relatively quickly• Data points took a little time to load
• Not good enough
17 PHP & Continuous Data Processing, PHPNW 2011
REAL TIME INFORMATION: CACHING
• High volumes of data, and varying levels of concurrent processing means query times are often not consistent
• Memcache was used when processing the data from the message queue, keeping a copy of the most recent of each data point for each truck
• Live, Real-Time information accessed directly from memcache, bypassing the database
18 PHP & Continuous Data Processing, PHPNW 2011
CACHING: REGISTRY/DI IS IDEAL
• Sporadic use of memcache within the web application – ideal use case for a lazy loading registry or DI container
• Give the registry or container details of memcache
• Object only instantiated and connection made only when data is requested from memcache
19 PHP & Continuous Data Processing, PHPNW 2011
LAZY LOADINGpublic function getObject( $key ){
if( in_array( $key, array_keys( $this->objects ) ) ){
return $this->objects[$key];}elseif( in_array( $key, array_keys( $this->objectSetup ) ) ){
if( ! is_null( $this->objectSetup[ $key ]['abstract'] ) ){
require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this->objectSetup[ $key ]['abstract'] .'.abstract.php' );
}require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]
['folder'] . '/' . $this- >objectSetup[ $key ]['file'] . '.class.php' );$o = new $this->objectSetup[ $key ]['class']( $this );$this->storeObject( $o, $key );return $o;
}elseif( $key == 'memcache' ){
// requesting memcache for the first time, instantiate, connect, store and return$mc = new Memcache();$mc->connect( MEMCACHE_SERVER, MEMCACHE_PORT );$this->storeObject( $mc, 'memcache' );return $mc;
}}
Becomes the limit for the registry pattern, DI container more suitable
20 PHP & Continuous Data Processing, PHPNW 2011
REAL TIME INFORMATION: EXTRAPOLATE AND ASSUME• Our telemetry unit broadcasts each data point
once per second
• Data doesn’t change every second, e.g.• Battery state of charge may take several minutes to
loose a percentage point• Fault flags only change to 1 when there is a fault
• Make an assumption. • We compare the data to the last known value…if it’s
the same we don’t insert, instead we assume it was the same
• Unfortunately, this requires us to put additional checks and balances in place
21 PHP & Continuous Data Processing, PHPNW 2011
EXTRAPOLATE AND ASSUME: “INTERLATION”
Built a special library which:• Accepted a number of arrays, each
representing a collection of data points for one variable on the truck
• Used key indicators and time differences to work out if/when the truck was off, and extrapolation should stop
• For each time data was recorded, pull down data for other variables for consistency
22 PHP & Continuous Data Processing, PHPNW 2011
INTERLACE
* Add an array to the interlationpublic function addArray( $name, $array )
* Get the time that we first receive data in one of our arrayspublic function getFirst( $field )
* Get the time that we last received data in any of our arrayspublic function getLast( $field )
* Generate the interlaced arraypublic function generate( $keyField, $valueField )
* Beak the interlaced array down into seperate dayspublic function dayBreak( $interlationArray )
* Generate an interlaced array and fill for all timestamps within the range of _first_ to _last_
public function generateAndFill( $keyField, $valueField )
* Populate the new combined array with key fields using the common fieldpublic function populateKeysFromField( $field, $valueField=null )
http://www.michaelpeacock.co.uk/interlation-library
23 PHP & Continuous Data Processing, PHPNW 2011
REAL TIME INFORMATION: SINGLE REQUEST
• Currently, each piece of “live data” is loaded into a flash graph or widget, which updates every 30 seconds using an AJAX request
• The move from MySQL to Memcache reduces database load, but large number of requests still add strain to web server
• Moving to image and JavaScript widgets, which are updated from a single AJAX request
24 PHP & Continuous Data Processing, PHPNW 2011
LOTS OF DATA: RACE CONDITIONS
Sessions in PHP close at the end of the execution cycle• Unpredictable query times• Large number of concurrent requests per
screen
Session Locking
Completely locks out a users session, as PHP hasn’t closed the session
25 PHP & Continuous Data Processing, PHPNW 2011
RACE CONDITIONS: PHP & SESSIONS
session_write_close()
Added after each write to the $_SESSION array. Closes the current session.
(requires a call to session_start immediately before any further reads or writes)
26 PHP & Continuous Data Processing, PHPNW 2011
RACE CONDITIONS: USE A ******* TEMPLATE ENGINE
• V1 of the system mixed PHP and HTML
• You can’t re-initialise your session once output has been sent
• All new code uses a template engine, so session interaction has no bearing on output. When the template is processed and output, all database and session work has been completed long before.
27 PHP & Continuous Data Processing, PHPNW 2011
RACE CONDITIONS: USE A SINGLE ENTRY POINT
• Race conditions are further exacerbated by the PHP timeout values
• Certain exports, actions and processes take longer than 30 seconds, so the default execution time is longer
• Initially the project lacked a single entry point, and execution flow was muddled
• Single Entry Point makes it easier to enforce a lower time out, which is overridden by intensive controllers or models
28 PHP & Continuous Data Processing, PHPNW 2011
INTENSIVE QUERIES & CALCULATIONS
• How far did this vehicle travel?• Motor RPM x Various vehicle specific constants• Calculated for every RPM value held during drive process
• How much energy did the vehicle use• Battery Current x Battery Voltage x Time• For every current and voltage value combination held
during the driving process
• How well was the vehicle driven• Analysis of idle time• Harshness of accelerator and brake pedal usage• Inappropriate duration of AC / Heater on time?
• What about for a customers fleet, or all of our vehicles sold?
29 PHP & Continuous Data Processing, PHPNW 2011
INTENSIVE QUERIES & CALCULATIONS
30 PHP & Continuous Data Processing, PHPNW 2011
INTENSIVE QUERIES & CALCULATIONS
• Involves a fair number of queries per vehicle• Calculations involve holding this data in
memory• Processing required for every single record for
that piece of data during that day
Takes a while!Solution:• Calculate information overnight• Save it as a compiled report• Lookups and comparisons only need to look at
the compiled / saved reports in the database
31 PHP & Continuous Data Processing, PHPNW 2011
REPORTS
In addition to our calculated reports, we also need to export key bits of information to grant authorities
• Initially our PHP based export scripts held one database connection per database (~400 databases)
• Re-wrote to maintain only one connection per server, and switch the database used
• Toggles to instruct the export to only apply for 1 of the servers at a time
• Modulus magic to run multiple export scripts per server
32 PHP & Continuous Data Processing, PHPNW 2011
TRIGGERS AND EVENTS
Currently a work-in-progress R&D project, evaluating two options:
• Golden hammer: Use PHP• Run PHP as a daemon• http://kevin.vanzonneveld.net/techblog/article/cre
ate_daemons_in_php/
• Continually monitor for specific changes to memcache variables
• Node.js• Light weight and fast• Give PHP another friend• Link into PHP based API to run triggers
33 PHP & Continuous Data Processing, PHPNW 2011
THE FUTURE
• More sharding• Based on time – keep the individual tables smaller
• NoSQL?• Currently investigating NoSQL solutions as alternatives
• Rationalisation• Do we need as much data as we collect?
• Abstraction• We need to continually abstract concepts and ideas to make
on-going maintenance and expansion easier; especially in terms of mapping code to database shards
• More hardware• Expand our DB cluster, more RAM, R&D
• Design• A much needed design refresh
34 PHP & Continuous Data Processing, PHPNW 2011
CONCLUSIONS
• Make the solution scalable from the start• Where data collection is critical, use a message queue,
ideally hosted or “cloud based”• Hire a genius DBA to push your database engine• Make use of data caching systems to reduce strain on
the database• Calculations and post-processing should be done
during dead time and automated• Add more tools to your toolbox – PHP needs lots of
friends in these situations• Watch out for Session race conditions: where they can’t be
avoided, use session_write_close, a template engine and a single entry point
• Reduce the number of continuous AJAX calls
35 PHP & Continuous Data Processing, PHPNW 2011
Q & A
Michael PeacockWeb Systems Developer – Telemetry Team – Smith Electric Vehicles US [email protected]
Senior / Lead Developer, Author & [email protected] www.michaelpeacock.co.uk
@michaelpeacock
http://joind.in/3808http://www.slideshare.net/michaelpeacock
Extra information!