from incubation to continuous ingestion the story of ... · system of hadoop hdfs. data access : an...
TRANSCRIPT
![Page 1: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/1.jpg)
From Incubation to Continuous IngestionThe Story of Apache Gora
Renato Marroquin - Lewis John McGibbney
Apache Gora
November 07, 2012
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 2: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/2.jpg)
Agenda
What Apache Gora is
How Apache Gora works
Versioning
GoraCI
Gora-DynamoDB
Future work
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 3: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/3.jpg)
What Apache Gora is
Apache Gora’s Goals:
Data Persistence : Persisting objects to Column stores,key-value stores, SQL databases and to flat files in local filesystem of Hadoop HDFS.
Data Access : An easy to use Java-friendly common API foraccessing the data regardless of its location.
Indexing : Persisting objects to Lucene and Solr indexes,accessing/querying the data with Gora API.
Analysis : Accesing the data and making analysis throughadapters for Apache Pig, Apache Hive and Cascading
MapReduce support : Out-of-the-box and extensiveMapReduce (Apache Hadoop) support for data in the datastore.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 4: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/4.jpg)
What Apache Gora is
Apache Gora’s Goals:
Data Persistence : Persisting objects to Column stores,key-value stores, SQL databases and to flat files in local filesystem of Hadoop HDFS.
Data Access : An easy to use Java-friendly common API foraccessing the data regardless of its location.
Indexing : Persisting objects to Lucene and Solr indexes,accessing/querying the data with Gora API.
Analysis : Accesing the data and making analysis throughadapters for Apache Pig, Apache Hive and Cascading
MapReduce support : Out-of-the-box and extensiveMapReduce (Apache Hadoop) support for data in the datastore.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 5: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/5.jpg)
What Apache Gora is
Apache Gora’s Goals:
Data Persistence : Persisting objects to Column stores,key-value stores, SQL databases and to flat files in local filesystem of Hadoop HDFS.
Data Access : An easy to use Java-friendly common API foraccessing the data regardless of its location.
Indexing : Persisting objects to Lucene and Solr indexes,accessing/querying the data with Gora API.
Analysis : Accesing the data and making analysis throughadapters for Apache Pig, Apache Hive and Cascading
MapReduce support : Out-of-the-box and extensiveMapReduce (Apache Hadoop) support for data in the datastore.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 6: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/6.jpg)
What Apache Gora is
Apache Gora’s Goals:
Data Persistence : Persisting objects to Column stores,key-value stores, SQL databases and to flat files in local filesystem of Hadoop HDFS.
Data Access : An easy to use Java-friendly common API foraccessing the data regardless of its location.
Indexing : Persisting objects to Lucene and Solr indexes,accessing/querying the data with Gora API.
Analysis : Accesing the data and making analysis throughadapters for Apache Pig, Apache Hive and Cascading
MapReduce support : Out-of-the-box and extensiveMapReduce (Apache Hadoop) support for data in the datastore.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 7: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/7.jpg)
What Apache Gora is
Apache Gora’s Goals:
Data Persistence : Persisting objects to Column stores,key-value stores, SQL databases and to flat files in local filesystem of Hadoop HDFS.
Data Access : An easy to use Java-friendly common API foraccessing the data regardless of its location.
Indexing : Persisting objects to Lucene and Solr indexes,accessing/querying the data with Gora API.
Analysis : Accesing the data and making analysis throughadapters for Apache Pig, Apache Hive and Cascading
MapReduce support : Out-of-the-box and extensiveMapReduce (Apache Hadoop) support for data in the datastore.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 8: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/8.jpg)
What Apache Gora is
Apache Gora’s Goals:
Data Persistence : Persisting objects to Column stores,key-value stores, SQL databases and to flat files in local filesystem of Hadoop HDFS.
Data Access : An easy to use Java-friendly common API foraccessing the data regardless of its location.
Indexing : Persisting objects to Lucene and Solr indexes,accessing/querying the data with Gora API.
Analysis : Accesing the data and making analysis throughadapters for Apache Pig, Apache Hive and Cascading
MapReduce support : Out-of-the-box and extensiveMapReduce (Apache Hadoop) support for data in the datastore.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 9: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/9.jpg)
What Apache Gora is (cont’d)
Open source framework which provides an in-memory datamodel and persistence for big data.
Gora supports:Column stores:
Key Value stores:
RDBMSs:
Flat files in local file system of Hadoop HDFS
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 10: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/10.jpg)
What Apache Gora is (cont’d)
Open source framework which provides an in-memory datamodel and persistence for big data.
Gora supports:
Column stores:
Key Value stores:
RDBMSs:
Flat files in local file system of Hadoop HDFS
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 11: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/11.jpg)
What Apache Gora is (cont’d)
Open source framework which provides an in-memory datamodel and persistence for big data.
Gora supports:Column stores:
Key Value stores:
RDBMSs:
Flat files in local file system of Hadoop HDFS
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 12: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/12.jpg)
What Apache Gora is (cont’d)
Open source framework which provides an in-memory datamodel and persistence for big data.
Gora supports:Column stores:
Key Value stores:
RDBMSs:
Flat files in local file system of Hadoop HDFS
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 13: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/13.jpg)
What Apache Gora is (cont’d)
Open source framework which provides an in-memory datamodel and persistence for big data.
Gora supports:Column stores:
Key Value stores:
RDBMSs:
Flat files in local file system of Hadoop HDFS
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 14: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/14.jpg)
What Apache Gora is (cont’d)
Open source framework which provides an in-memory datamodel and persistence for big data.
Gora supports:Column stores:
Key Value stores:
RDBMSs:
Flat files in local file system of Hadoop HDFS
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 15: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/15.jpg)
How Gora works
The Core Gora API
Store (org.apache.gora.store)Persistency (org.apache.gora.persistency)Query (org.apache.gora.query)MapReduce (org.apache.gora.mapreduce)
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 16: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/16.jpg)
How Gora works
Store API
org.apache.gora.store.∗
Core class is the DataStore.java (or FileBacked...). DataStorehandles actual object persistence. Objects can be persisted,fetched, queried or deleted by the DataStore methods.DataStores can be constructed by an instance of {@linkDataStoreFactory}Base class for client interaction is o.a.g.s.impl.DataStoreBase(or FileBacked...) which is a base class for indirectlyinteracting with the DataStore. All DataStore implementationsextend this class.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 17: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/17.jpg)
How Gora works
Store API
org.apache.gora.store.∗Core class is the DataStore.java (or FileBacked...). DataStorehandles actual object persistence. Objects can be persisted,fetched, queried or deleted by the DataStore methods.DataStores can be constructed by an instance of {@linkDataStoreFactory}
Base class for client interaction is o.a.g.s.impl.DataStoreBase(or FileBacked...) which is a base class for indirectlyinteracting with the DataStore. All DataStore implementationsextend this class.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 18: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/18.jpg)
How Gora works
Store API
org.apache.gora.store.∗Core class is the DataStore.java (or FileBacked...). DataStorehandles actual object persistence. Objects can be persisted,fetched, queried or deleted by the DataStore methods.DataStores can be constructed by an instance of {@linkDataStoreFactory}Base class for client interaction is o.a.g.s.impl.DataStoreBase(or FileBacked...) which is a base class for indirectlyinteracting with the DataStore. All DataStore implementationsextend this class.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 19: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/19.jpg)
How Gora works
Persistency API
org.apache.gora.persistency.∗
Core classes are BeanFactory, Persistent and State. Theformer enables the construction of keys and persistent objects.All objects persisted by Gora implement for second and thelatter defines the actual state of an object or field. State ismanaged through the StateManager. Objects can be NEW,CLEAN (UNMODIFIED), DIRTY (MODIFIED) or DELETEDAs with the Store API, persistency has base classes for clientinteraction e.g. o.a.g.p.impl.BeanFactoryImpl,o.a.g.impl.PersistentBase and o.a.g.impl.StateManagerImplrespectively.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 20: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/20.jpg)
How Gora works
Persistency API
org.apache.gora.persistency.∗Core classes are BeanFactory, Persistent and State. Theformer enables the construction of keys and persistent objects.All objects persisted by Gora implement for second and thelatter defines the actual state of an object or field. State ismanaged through the StateManager. Objects can be NEW,CLEAN (UNMODIFIED), DIRTY (MODIFIED) or DELETED
As with the Store API, persistency has base classes for clientinteraction e.g. o.a.g.p.impl.BeanFactoryImpl,o.a.g.impl.PersistentBase and o.a.g.impl.StateManagerImplrespectively.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 21: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/21.jpg)
How Gora works
Persistency API
org.apache.gora.persistency.∗Core classes are BeanFactory, Persistent and State. Theformer enables the construction of keys and persistent objects.All objects persisted by Gora implement for second and thelatter defines the actual state of an object or field. State ismanaged through the StateManager. Objects can be NEW,CLEAN (UNMODIFIED), DIRTY (MODIFIED) or DELETEDAs with the Store API, persistency has base classes for clientinteraction e.g. o.a.g.p.impl.BeanFactoryImpl,o.a.g.impl.PersistentBase and o.a.g.impl.StateManagerImplrespectively.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 22: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/22.jpg)
How Gora works
Query API
org.apache.gora.query.∗
Core classes are Query, PartitionQuery and Result.PartitionQuery divides the results of the Query to multipartitions, so that queries can be run locally on the nodes thathold the data. PartitionQuery’s are used for generatingHadoop InputSplits.Queries are constructed by the DataStore implementation via{@link DataStore#newQuery()}In a common theme clients access core classes through the theclasses in o.a.g.q.impl
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 23: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/23.jpg)
How Gora works
Query API
org.apache.gora.query.∗Core classes are Query, PartitionQuery and Result.
PartitionQuery divides the results of the Query to multipartitions, so that queries can be run locally on the nodes thathold the data. PartitionQuery’s are used for generatingHadoop InputSplits.Queries are constructed by the DataStore implementation via{@link DataStore#newQuery()}In a common theme clients access core classes through the theclasses in o.a.g.q.impl
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 24: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/24.jpg)
How Gora works
Query API
org.apache.gora.query.∗Core classes are Query, PartitionQuery and Result.PartitionQuery divides the results of the Query to multipartitions, so that queries can be run locally on the nodes thathold the data. PartitionQuery’s are used for generatingHadoop InputSplits.
Queries are constructed by the DataStore implementation via{@link DataStore#newQuery()}In a common theme clients access core classes through the theclasses in o.a.g.q.impl
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 25: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/25.jpg)
How Gora works
Query API
org.apache.gora.query.∗Core classes are Query, PartitionQuery and Result.PartitionQuery divides the results of the Query to multipartitions, so that queries can be run locally on the nodes thathold the data. PartitionQuery’s are used for generatingHadoop InputSplits.Queries are constructed by the DataStore implementation via{@link DataStore#newQuery()}
In a common theme clients access core classes through the theclasses in o.a.g.q.impl
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 26: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/26.jpg)
How Gora works
Query API
org.apache.gora.query.∗Core classes are Query, PartitionQuery and Result.PartitionQuery divides the results of the Query to multipartitions, so that queries can be run locally on the nodes thathold the data. PartitionQuery’s are used for generatingHadoop InputSplits.Queries are constructed by the DataStore implementation via{@link DataStore#newQuery()}In a common theme clients access core classes through the theclasses in o.a.g.q.impl
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 27: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/27.jpg)
How Gora works
MapReduce API
org.apache.gora.mapreduce.∗
This package holds ALL of the Gora MR functionality andUtilities required to utilize Gora within such an environmentIncluding GoraMapper, GoraReducer, ALL Record Counter,Reader and Writer implementations.Persistent and String Se/Deserialization is done in the Hadoopserializer using o.a.g.avro.@link PersistentDatumWriter (forwriting Avro’s dirty and readable information )with Avro’s@link BinaryEncoder.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 28: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/28.jpg)
How Gora works
MapReduce API
org.apache.gora.mapreduce.∗This package holds ALL of the Gora MR functionality andUtilities required to utilize Gora within such an environment
Including GoraMapper, GoraReducer, ALL Record Counter,Reader and Writer implementations.Persistent and String Se/Deserialization is done in the Hadoopserializer using o.a.g.avro.@link PersistentDatumWriter (forwriting Avro’s dirty and readable information )with Avro’s@link BinaryEncoder.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 29: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/29.jpg)
How Gora works
MapReduce API
org.apache.gora.mapreduce.∗This package holds ALL of the Gora MR functionality andUtilities required to utilize Gora within such an environmentIncluding GoraMapper, GoraReducer, ALL Record Counter,Reader and Writer implementations.
Persistent and String Se/Deserialization is done in the Hadoopserializer using o.a.g.avro.@link PersistentDatumWriter (forwriting Avro’s dirty and readable information )with Avro’s@link BinaryEncoder.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 30: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/30.jpg)
How Gora works
MapReduce API
org.apache.gora.mapreduce.∗This package holds ALL of the Gora MR functionality andUtilities required to utilize Gora within such an environmentIncluding GoraMapper, GoraReducer, ALL Record Counter,Reader and Writer implementations.Persistent and String Se/Deserialization is done in the Hadoopserializer using o.a.g.avro.@link PersistentDatumWriter (forwriting Avro’s dirty and readable information )with Avro’s@link BinaryEncoder.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 31: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/31.jpg)
How Gora works
The LogManager Tutorial
Hadoop has Word Count, Gora has LogManager!Tutorial Aim.
Implements a system to store web server logs (10K lines) inApache Hbase, analyze the results using Apache Hadoop andstore the results either in HSQLDB or MySQL.
Tutorial Objectives.
Set the (Gora) EnvironmentModel the DataThe Core Gora APIMapReduce Support
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 32: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/32.jpg)
How Gora works
The LogManager Tutorial
Hadoop has Word Count, Gora has LogManager!
Tutorial Aim.
Implements a system to store web server logs (10K lines) inApache Hbase, analyze the results using Apache Hadoop andstore the results either in HSQLDB or MySQL.
Tutorial Objectives.
Set the (Gora) EnvironmentModel the DataThe Core Gora APIMapReduce Support
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 33: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/33.jpg)
How Gora works
The LogManager Tutorial
Hadoop has Word Count, Gora has LogManager!Tutorial Aim.
Implements a system to store web server logs (10K lines) inApache Hbase, analyze the results using Apache Hadoop andstore the results either in HSQLDB or MySQL.
Tutorial Objectives.
Set the (Gora) EnvironmentModel the DataThe Core Gora APIMapReduce Support
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 34: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/34.jpg)
How Gora works
The LogManager Tutorial
Hadoop has Word Count, Gora has LogManager!Tutorial Aim.
Implements a system to store web server logs (10K lines) inApache Hbase, analyze the results using Apache Hadoop andstore the results either in HSQLDB or MySQL.
Tutorial Objectives.
Set the (Gora) EnvironmentModel the DataThe Core Gora APIMapReduce Support
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 35: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/35.jpg)
Versioning
Apache Hadoop - 1.0.1
Apache Cassandra - 1.1.2
Apache HBase - 0.90.4
Apache Accumulo - 1.4.0
Apache Avro - 1.3.3
Amazon DynamoDB - Amazon SDK 1.3
MySQL - 5.1.18
HDBSQL - 2.2.8
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 36: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/36.jpg)
GoraCI
Based on Cloudbase tests
Scale test
Ingest synthetic dada for extended periods.Run continuous queries.
Verify
Keeps running.Ingest and queries fast.No data lost.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 37: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/37.jpg)
GoraCI
Generate linked list of random nodes.
Query processes.
Follow linked list.Record timing info.Note missing nodes.
Write 1 random node.
Write 1 random node that references previous node.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 38: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/38.jpg)
GoraCI
Generate linked list of random nodes.
Query processes.
Follow linked list.Record timing info.Note missing nodes.
Write 1 random node.
Write 1 random node that references previous node.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 39: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/39.jpg)
GoraCI
Generate linked list of random nodes.
Query processes.
Follow linked list.Record timing info.Note missing nodes.
Write 1 random node.
Write 1 random node that references previous node.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 40: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/40.jpg)
GoraCI
Generate linked list of random nodes.
Query processes.
Follow linked list.Record timing info.Note missing nodes.
Write 1 random node.
Write 1 random node that references previous node.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 41: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/41.jpg)
GoraCI
Client buffer problem.
When client dies some nodes written, others not.Can not easily distinguish data loss from client death.
Written Buffered Written
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 42: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/42.jpg)
GoraCI
Client buffer problem.
When client dies some nodes written, others not.Can not easily distinguish data loss from client death.
Written Buffered Written
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 43: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/43.jpg)
GoraCI
Client buffer problem.
When client dies some nodes written, others not.Can not easily distinguish data loss from client death.
Written Buffered Written
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 44: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/44.jpg)
GoraCI
Only reference flushed nodes.
Buffered
Written
Written and Lost
Written
Written
Written
Buffered
Written
Written
Flush at time 2
Flush at time 1
Ingest client starts at time 0
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 45: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/45.jpg)
GoraCI
Giant Linked List.
Written
Written
Written
Written
Written
Written
Written
Written
Written
Flush at time 3
Flush at time 2
Flush at time 4
Ingest client starts at time 0
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 46: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/46.jpg)
GoraCI
Verification MapReduce.
Map.
<row>:-1<row referenced>:<row>
Reduce.
Brings references to and definition of node together.Emits references to undefined nodes.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 47: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/47.jpg)
GoraCI
Verification MapReduce.
Map.
<row>:-1<row referenced>:<row>
Reduce.
Brings references to and definition of node together.Emits references to undefined nodes.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 48: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/48.jpg)
GoraCI
Verification MapReduce.
Map.
<row>:-1<row referenced>:<row>
Reduce.
Brings references to and definition of node together.Emits references to undefined nodes.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 49: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/49.jpg)
GoraCI
Verification MapReduce.
Map.
<row>:-1<row referenced>:<row>
Reduce.
Brings references to and definition of node together.Emits references to undefined nodes.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 50: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/50.jpg)
GoraCI
Related JIRA issues:
HADOOP-6945HBASE-5754GORA-XXX
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 51: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/51.jpg)
Gora-DynamoDB
Incubated during last ApacheCon in Vancouver
Product of the GSoC 2012!!!
Motivation
Engage with the open source community by using somepopular services such as Amazon WebServices.Help n00bs to get familiar more easily with BigDatatechnologies, specially for those who don’t have access to bigcompute clusters ... like me.Make the Gora Project more robust.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 52: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/52.jpg)
Gora-DynamoDB
Incubated during last ApacheCon in Vancouver
Product of the GSoC 2012!!!
Motivation
Engage with the open source community by using somepopular services such as Amazon WebServices.Help n00bs to get familiar more easily with BigDatatechnologies, specially for those who don’t have access to bigcompute clusters ... like me.Make the Gora Project more robust.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 53: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/53.jpg)
Gora-DynamoDB
Incubated during last ApacheCon in Vancouver
Product of the GSoC 2012!!!
Motivation
Engage with the open source community by using somepopular services such as Amazon WebServices.Help n00bs to get familiar more easily with BigDatatechnologies, specially for those who don’t have access to bigcompute clusters ... like me.Make the Gora Project more robust.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 54: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/54.jpg)
Gora-DynamoDB
Incubated during last ApacheCon in Vancouver
Product of the GSoC 2012!!!
Motivation
Engage with the open source community by using somepopular services such as Amazon WebServices.Help n00bs to get familiar more easily with BigDatatechnologies, specially for those who don’t have access to bigcompute clusters ... like me.Make the Gora Project more robust.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 55: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/55.jpg)
Gora-DynamoDB
Changes within Gora
Create new abstraction for Gora internals: One for file backeddata stores, and one for web service backed data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 56: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/56.jpg)
Gora-DynamoDB
Changes within Gora
Create new abstraction for Gora internals: One for file backeddata stores, and one for web service backed data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 57: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/57.jpg)
Gora-DynamoDB
Changes within Gora
Create new abstraction for Gora internals: One for file backeddata stores, and one for web service backed data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 58: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/58.jpg)
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 59: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/59.jpg)
Gora-DynamoDB
Changes within GoraNew properties added:
Amazon endpoint (Any of the available e.g.us-east-1.amazonaws).Consistent reads (True or False).Client type (Sync or Async).
Gora-DynamoDB compiler uses Amazon SDK annotations.
@DynamoDBTable@DynamoDBHashKey@DynamoDBRangeKey@DynamoDBAttribute
Gora-DynamoDB mapping file.
Schema definition.Composite key definition: Hash key and/or range key.Provisioned throughput.Columns specifying their data types.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 60: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/60.jpg)
Gora-DynamoDB
Changes within Gora
New properties added:
Amazon endpoint (Any of the available e.g.us-east-1.amazonaws).Consistent reads (True or False).Client type (Sync or Async).
Gora-DynamoDB compiler uses Amazon SDK annotations.
@DynamoDBTable@DynamoDBHashKey@DynamoDBRangeKey@DynamoDBAttribute
Gora-DynamoDB mapping file.
Schema definition.Composite key definition: Hash key and/or range key.Provisioned throughput.Columns specifying their data types.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 61: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/61.jpg)
Gora-DynamoDB
Changes within GoraNew properties added:
Amazon endpoint (Any of the available e.g.us-east-1.amazonaws).Consistent reads (True or False).Client type (Sync or Async).
Gora-DynamoDB compiler uses Amazon SDK annotations.
@DynamoDBTable@DynamoDBHashKey@DynamoDBRangeKey@DynamoDBAttribute
Gora-DynamoDB mapping file.
Schema definition.Composite key definition: Hash key and/or range key.Provisioned throughput.Columns specifying their data types.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 62: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/62.jpg)
Gora-DynamoDB
Changes within GoraNew properties added:
Amazon endpoint (Any of the available e.g.us-east-1.amazonaws).Consistent reads (True or False).Client type (Sync or Async).
Gora-DynamoDB compiler uses Amazon SDK annotations.
@DynamoDBTable@DynamoDBHashKey@DynamoDBRangeKey@DynamoDBAttribute
Gora-DynamoDB mapping file.
Schema definition.Composite key definition: Hash key and/or range key.Provisioned throughput.Columns specifying their data types.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 63: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/63.jpg)
Gora-DynamoDB
Changes within GoraNew properties added:
Amazon endpoint (Any of the available e.g.us-east-1.amazonaws).Consistent reads (True or False).Client type (Sync or Async).
Gora-DynamoDB compiler uses Amazon SDK annotations.
@DynamoDBTable@DynamoDBHashKey@DynamoDBRangeKey@DynamoDBAttribute
Gora-DynamoDB mapping file.
Schema definition.Composite key definition: Hash key and/or range key.Provisioned throughput.Columns specifying their data types.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 64: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/64.jpg)
Future work
On our web service backed data stores:
Add new web based data stores: GAE, Microsoft DataServices.Add batch write functionality to our Gora-DynamoDB module.Add new data types for the Gora-DynamoDB module.Add cost based control for our web based data stores.Fully integrate them with Apache Nutch.
On our file backed data stores:
Add new file based data stores: HIVE, Riak.Update our AVRO backend which will us new features.
Add optimistic concurrency control for our data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 65: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/65.jpg)
Future work
On our web service backed data stores:
Add new web based data stores: GAE, Microsoft DataServices.Add batch write functionality to our Gora-DynamoDB module.Add new data types for the Gora-DynamoDB module.Add cost based control for our web based data stores.Fully integrate them with Apache Nutch.
On our file backed data stores:
Add new file based data stores: HIVE, Riak.Update our AVRO backend which will us new features.
Add optimistic concurrency control for our data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 66: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/66.jpg)
Future work
On our web service backed data stores:
Add new web based data stores: GAE, Microsoft DataServices.Add batch write functionality to our Gora-DynamoDB module.Add new data types for the Gora-DynamoDB module.Add cost based control for our web based data stores.Fully integrate them with Apache Nutch.
On our file backed data stores:
Add new file based data stores: HIVE, Riak.Update our AVRO backend which will us new features.
Add optimistic concurrency control for our data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 67: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/67.jpg)
Future work
On our web service backed data stores:
Add new web based data stores: GAE, Microsoft DataServices.Add batch write functionality to our Gora-DynamoDB module.Add new data types for the Gora-DynamoDB module.Add cost based control for our web based data stores.Fully integrate them with Apache Nutch.
On our file backed data stores:
Add new file based data stores: HIVE, Riak.Update our AVRO backend which will us new features.
Add optimistic concurrency control for our data stores.
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 68: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/68.jpg)
Questions
Renato Marroquin - Lewis John McGibbney Apache Gora
![Page 69: From Incubation to Continuous Ingestion The Story of ... · system of Hadoop HDFS. Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location](https://reader034.vdocuments.net/reader034/viewer/2022042305/5ed0b1fca233ca78797903fa/html5/thumbnails/69.jpg)
Thanks
Thanks!
Renato Marroquin - Lewis John McGibbney Apache Gora