distributed processing
TRANSCRIPT
![Page 1: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/1.jpg)
Increase computational power with distributed processing
Neil Stein 03 Nov 2012
![Page 2: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/2.jpg)
![Page 3: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/3.jpg)
A Discussion Example…….. Getting the data, and ordering it as needed…..
Familiar with grep and sort?
� “grep” extracts all the matching lines
� “sort” sorts all the lines
grep “some_record_parameters” hl7_transfer.data-file | sort [2012/02/25/ 9:15] records sent to healthcare-1 [2012/02/28/ 6:15] records sent to healthcare-2 [2012/03/12/ 10:30] records sent to healthcare-3
![Page 4: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/4.jpg)
A Discussion Example…….. � As the amount of data increases, process requires more and
more resources
� What if hl7_transfor.data-file is 500GB or bigger?
� What if there are hundreds or thousands of data files?
� What if there are multiple types of data files? grep “provider 1” hl7_transfor.data-file | sort
� Ignoring the process for a moment, how do we write all the data to disk in the first place?
Need to rethink the process
![Page 5: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/5.jpg)
![Page 6: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/6.jpg)
Distributed File-System – “the cloud” � Files can be stored across many machines
� Files can be replicated across many machines
� Files can be in a hyrbid-cloud model
� Share the file-system transparently
� You simply see the usual file structure
� Opportunity to leverage private and public cloud environments
![Page 7: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/7.jpg)
![Page 8: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/8.jpg)
Map-Reduce – the cloud � A way of processing large amounts of data across many machines
� Must be able to split-up the data in chunks for processing, (Map) � Recombined after processing (Reduce) � Requires a constant flow of data from one simple state to another
� Allows for a simple way of breaking down a large task into smaller manageable tasks
� Increase the available computational power
![Page 9: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/9.jpg)
A look at Hadoop
![Page 10: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/10.jpg)
What is Hadoop � A Map-Reduce framework
� Designed to run applications on clusters of local and remote systems
� HDFS � The file system of Hadoop (Hadoop Distributed
File System) � Designed to access clusters of local and
remote systems
![Page 11: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/11.jpg)
Putting the pieces together….
![Page 12: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/12.jpg)
First, we need some code……
Map Reduce
![Page 13: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/13.jpg)
Map
Hadoop streams information on STDIN Separate value with a newline (for Hadoop)
![Page 14: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/14.jpg)
Reduce
Hadoop streams back to us on STDIN Output the aggregated records
![Page 15: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/15.jpg)
Sanity Checking
This should work with small data-sets
Command
Results
![Page 16: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/16.jpg)
Push file to “the distributed file system”
Put file on the DFS
Check that the file is in the cloud
![Page 17: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/17.jpg)
Running in “the distributed environment”
Call the Hadoop streaming command Pass the appropriate parameters
![Page 18: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/18.jpg)
Running in “the distributed environment”
![Page 19: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/19.jpg)
Running in “the distributed environment”
![Page 20: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/20.jpg)
Running in “the distributed environment”
![Page 21: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/21.jpg)
Running in “the distributed environment”
![Page 22: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/22.jpg)
Checking Status
� Cluster Summary
� Running Jobs
� Completed Jobs
� Failed Jobs
� Job Statistics
� Detailed Job Logs
![Page 23: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/23.jpg)
Checking Distributed Cluster Health
� List Data-Nodes
� Dead Nodes
� Node Heart-beat information
� Failed Jobs
� Job Statistics
� Detailed Job Logs
![Page 24: Distributed processing](https://reader030.vdocuments.net/reader030/viewer/2022020306/5560cb07d8b42a19088b4b2d/html5/thumbnails/24.jpg)
Conclusion
� A different paradigm for solving large-scale problems
� Designed to solve specific problems that can be defined in a focused map-reduce manner