cloud computing major project

DATA WAREHOUSING SOLUTION USING APACHE SPARK

TEAM 18AYUSH KHANDELWAL

GAURAV PARIDAANIL REDDY

MEHAK AGARWAL

INTRODUCTION TO DATA WAREHOUSEA data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making.

A data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization.

It is kept separate from the organization's operational database. There is no frequent updating done in a data warehouse.

It possesses consolidated historical data, which helps the organization to analyze its business.

Image taken from wikipedia.org/datawarehouse

KEY FEATURESSubject Oriented - A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations.

Integrated - A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.

Time Variant - The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view.

Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.

DATA WAREHOUSE VS OPERATIONAL DATABASEAn operational database is constructed for well-known tasks and

workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data.

Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database.

An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data.

An operational database maintains current data. On the other hand, a data warehouse maintains historical data.

APACHE SPARKOpen SourceAlternative to Map Reduce for certain applicationsA low latency cluster computing system For very large data sets May be 100 times faster than Map Reduce for

Iterative algorithmsInteractive data mining

Used with Hadoop / HDFSReleased under BSD License

SPARK FEATURESUses in memory cluster computing Memory access faster than disk accessHas API's written in

Scala Java Python

Can be accessed from Scala and Python shells Currently an Apache incubator projectScales to very large clustersUses in memory processing for increased speed Low latency shell access

OUR DATA WAREHOUSE SOLUTIONBuilding a data warehouse is a task that requires a lot of data to start, combined with immense computational resources.

This project deals with creating a data warehouse like system which can perform basic queries and some analytics.

Use-cases that we are dealing with:

Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc.

Movie rating progression graph

Movie recommendation engine

MOVIELENS 20M DATASETmovielens.org is a movie ratings aggregator owned by its parent company Grouplens. Grouplens provides different sized movielens datasets for free that can be found at http://grouplens.org/datasets/movielens/

For this project, we are using the Movielens 20M dataset which is the largest of all the datasets provided by movielens.

Statistics about the dataset:

20 million ratings

465,000 tag applications

27,000 movies

138,000 users

http://grouplens.org/datasets/movielens/

DESCRIBING THE DATAThe data contains 4 CSV files of which only 2 are useful for this project:

movies.csv - movieid, title, genres

ratings.csv - userid, movieid, rating, timestamp

SOME IDEAS FROM HIVEA data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis.

Supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem.

Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

FOREGROUNDTaking ideas from Apache Hive, the following solution has been proposed by us in this project:

Dataset files are stored in HDFS.

API interface has been developed using flask instead of a graphical interface. API rules have been defined for each query.

On hitting the URL for the API by passing the appropriate parameters, the results are displayed in the browser window.

BACKGROUNDThe dataset files are pushed to HDFS for faster access without any modifications.

For each query, the files are read from HDFS and converted to spark RDDs (Resilient Distributed Datasets).

RDDs are a logical collection of data partitioned across machines. They can be manipulated in parallel.

The API call is parsed for parameters, and accordingly the corresponding query function is called.

The result of the query is handed over to flask and displayed on the browser. GraphX has been used for plotting graph.

cloud computing major project

Education