big data: architectures and approaches

Download Big Data: Architectures and Approaches

If you can't read please download the document

Upload: thoughtworks

Post on 08-Sep-2014

15 views

Category:

Technology


6 download

DESCRIPTION

ThoughtWorkers David Elliman and Ashok Subramanian present how the big data world is moving quickly with predictions of amazing industry growth. For more information on how the 'Internet of Things' is playing an increasingly larger role, read David's blog post or watch the video from the London-based event. http://www.thoughtworks.com/insights/blog/big-data-and-internet-things

TRANSCRIPT

  • w e l c o m e BIG DATA Architectures and Approaches David Elliman & Ashok Subramanian

Luke Barrett 1971-2014 http://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg BIG DATA https://www.flickr.com/photos/katerha/8380451137/ 1944 https://www.flickr.com/photos/timetrax/376152628/sizes/l 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1961 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1971 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1996 https://www.flickr.com/photos/epsos/8336691931 ge becomes more cost effective for storing da 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1996 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1998 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1998 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 https://www.usenix.org/conference/1999-usenix-annual-technical-conference/big-data-and-next-wave-infrastress-problems 2004 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2006 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2008 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2010 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2013 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 "alottabytes" 2015 https://www.flickr.com/photos/will-lion/2595830716/ 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 https://www.flickr.com/photos/taedc/6998468974 http://blogs.gartner.com/doug-laney/batman-on-big-data/ https://www.flickr.com/photos/10ch/3347658610/ THE OPPORTUNITY Key Takeaways This isnt a new problem The problem isnt going away Remember to focus on the VALUE https://www.flickr.com/photos/djwtwo/8331524425/ Where do we https://www.flickr.com/photos/ekosystem/4334671818/ https://www.flickr.com/photos/libraryacu/7695938410/ Complexity Value Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make it happen? Analytics - Goals https://www.flickr.com/photos/lopetz/3912416793/ REAL TIME BATCH Volume Velocity REAL TIME BATCH https://www.flickr.com/photos/ingythewingy/5510406450/ THINK BIG S M A L L A C T S M A L L A C T Small is the New Big (Seth Godin) https://www.flickr.com/photos/pauldineen/4529216647/ 80% of the work in any data project is in cleaning the data D J Patil https://www.flickr.com/photos/desideratum/8595251348/ https://www.flickr.com/photos/22280677@N07/2504310138/ https://www.flickr.com/photos/jm3/4814208649/ SQL https://www.flickr.com/photos/marc_smith/6793088143/ Key Takeaways Start small Start with the ? Iteratively follow the value Using freely available tooling Volume vs Velocity https://www.flickr.com/photos/djwtwo/8331524425/ Scaling the Solution https://www.flickr.com/photos/auntiep/4310240/ https://www.flickr.com/photos/111692634@N04/11407095913/ attributed to Gene Amdahl 1967 Amdahls law is used to find the maximum expected improvement to an overall system when only part of the system is improved. https://twitter.com/PieCalculus/status/459485747842523136/photo/1 https://www.flickr.com/photos/rofi/2097239111/ Batch Speed Serving Query query = function(all data) All Data Lambda Architecture Scaled Data Store Event Processing Network QueryAll Data Lambda Architecture Batch View Realtime View Batch Write Random Write Batch Speed Serving Query query = function(all data) All Data Lambda Architecture Client Master Node JobTracker Name Node Metadata Operations to Get Block Info Job assignment to cluster Task Tracker Slave Node Data Node Map Reduce Task Tracker Slave Node Data Node Map Reduce Task Tracker Slave Node Data Node Map Reduce Task Tracker Slave Node Data Node Map Reduce 1 3 1 2 1 5 6 4 Data Replication on Multiple Nodes DataWrite DataRead Batch - Hadoop (MR1) Batch - MapReduce Map Shuffle Reduce Batch - Cascading Batch - Spark Segment Servers Query processing and data storage Network Interconnect Master Servers Query planning & dispatch External Sources Loading, streaming, etc. SQL or MapReduceBatch - MPP database Batch Speed Serving Query query = function(all data) All Data Lambda Architecture Speed - Storm CEP Batch Speed Serving Query query = function(all data) All Data Lambda Architecture Lambda Architecture - Serving http://www.wallzhq.com/wp-content/uploads/2014/02/matrix_binary-wide.jpg Pull-based Batch Loads Enterprise Data Models Complex ETL Logic Poorly Suited to Non-Relational Data Emergent design is difficult Conventional Architectures Pivotal Business Data Lake Architecture http://www.gopivotal.com/sites/default/files/Pivotal-Business-Data-Lake-Technical_Brochure_WEB.PDF DATA CORE RAW FACTUAL DATA HISTORIZED EVENTS RETAIN BUSINESS KEY DATA LINEAGE DATA INGESTION EVENT DRIVEN MESSAGE QUEUE TRICKLE FEED BATCH LOAD INFORMATION PUBLISHING TOPICAL QUEUES POST PROCESSING INFORMATION TIER PURPOSE BUILT DATA SUBSETS TRANSFORMATION DATA GOVERNANCE MDM CONCERNS POST PROCESSING PRESENTATION TIER BUSINESS VALUE APPLICATIONS DATA SERVICES AD HOC QUERYING WRITE BACK? Transformation Logic Data Post Processing Near Real Time Feed Emergent Design & Agile Delivery Apache Kafka Apache Storm Micro-data-services Drive Towards In Memory Processing https://www.tele-task.de/archive/lecture/overview/5721/ Remember https://www.flickr.com/photos/anjin/695894443/ Data Structures Algorithmshttps://www.flickr.com/photos/herrolsen/7645876896/ Raw Data Data Structure Algorithm Insight Key Takeaways Embrace the cloud Fit the Architecture to the problem Remember Knuth https://www.flickr.com/photos/djwtwo/8331524425/ https://www.flickr.com/photos/tim_norris/2789759648/ SUMMARY http://www.datameer.com/blog/uncategorized/the-hadoop-ecosystem-visualized-in-datameer.html 48 30 26 22 18 18 16 15 15 15 13 13 13 13 12 0 13 25 38 50 63 Hadoop Ecosystem https://www.flickr.com/photos/classblog/5136926303/ Commercial Open Source https://blog.cloudera.com/blog/2011/10/the-community-effect/ https://www.flickr.com/photos/ctsi-global/6556284907/ https://www.flickr.com/photos/will-lion/2597608152/ https://www.flickr.com/photos/jurvetson/14105339228/ Open Questions http://talkmarketing.co.uk/wp-content/uploads/2013/07/Open-Ended-Questions.jpg https://www.flickr.com/photos/typoatelier/5615759848/ https://www.flickr.com/photos/rembcc/3802038945/ https://www.flickr.com/photos/sidelong/246816211/ No matter how much you speed up the computers or the way you put computers together, the real issues are at the DATA LEVEL https://www.flickr.com/photos/opensourceway/5556249000/ Enterprise Master Data Management Localised Formats Single System of Record SoR is a process not a place Database Integration (by another name) http://www.bain.com/infographics/big-data/ Organisational Models