flume-based independent news aggregator

27
Flume-based news aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012

Upload: mario-almeida

Post on 25-May-2015

2.712 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Flume-based Independent News Aggregator

Flume-based news aggregator service on

Amazon EC2Arinto MurdopoMário Almeida

Zafar Gilani

SDS, EMDC 2012

Page 2: Flume-based Independent News Aggregator

Outline

● Introduction○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System

● Infrastructure setup● Architecture● News recommendation● RSS News aggregator● Proof of concept

● Issues faced● Future work● Conclusions● References

Page 3: Flume-based Independent News Aggregator

Introduction

● A flume-based independent news aggregator service.

● Using:○ Amazon EC2 IaaS○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System

Page 4: Flume-based Independent News Aggregator

Cloudera Manager CDH3

● Automates the installation and configuration process of CDH3 on an entire cluster.

● We used free edition (up to 50 nodes).

Page 5: Flume-based Independent News Aggregator

Cloudera Flume

● A distributed, reliable and available system.● To efficiently collect, aggregate and move

large amounts of log data.● From many different sources to a centralized

or distributed data store (such as Hadoop HDFS).

Page 6: Flume-based Independent News Aggregator

Hadoop HDFS (1/2)

● For our purpose Hadoop handles:○ Log receipt and storage.○ Search and log processing.

● Coordinates work among cluster of machines.

Page 7: Flume-based Independent News Aggregator

Hadoop HDFS (2/2)

Page 8: Flume-based Independent News Aggregator

Infrastructure setup

● 2 Agent nodes collecting data:○ Source: RSS feed○ Sink: Collector

● 1 Agent node (Collector):○ Source: Agents○ Sink: HDFS

● HDFS NameNode:○ Replicates data to DataNodes 1, 2 and 3.

● Cloudera Manager CDH3 node:○ Managing all our nodes (Agents and HDFS nodes).

Page 9: Flume-based Independent News Aggregator

Architecture

Page 10: Flume-based Independent News Aggregator

News Recommendation

● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/

● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...

Page 11: Flume-based Independent News Aggregator

RSS News aggregator

● We wrote a Java application to read RSS feeds using:○ java.net.URL to handle the resource pointed-to by

the URL.○ javax.xml.parsers for XML parsing.○ org.w3c.dom provides interfaces for DOM to process

XML.

Page 12: Flume-based Independent News Aggregator

Proof of concept (1/3)

● Our Agent collects the RSS feeds and sends it to the Collector Agent.

Page 13: Flume-based Independent News Aggregator

● The collector receives the events from both Agents and stores them into the HDFS.

Proof of concept (2/3)

Page 14: Flume-based Independent News Aggregator

Proof of concept (3/3)

● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.

Page 15: Flume-based Independent News Aggregator

Issues faced (1/4)

● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB.○ This means that if a datanode has less than 10 GB of

capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)

Page 16: Flume-based Independent News Aggregator

Issues faced (2/4)

● In order for CDH Manager to work, all nodes must run either Suse or RedHat.

● The CDH Manager cannot run on a AWS EC2 micro instance.

● Upon instance restart, its IP changes.○ So the CDH Manager loses track of the node

● CDH Manager operates with private DNS and so any references it makes point to this private DNS.○ Web UI's are only accessible from our machines web

browsers through public DNS names.

Page 17: Flume-based Independent News Aggregator

Issues faced (3/4)

● Some installation guides forget to mention the required ports to allow communication with its services.○ Cloudera provides a page with all the required ports.

● The creating folders and changing user permissions is not mentioned in the user guide.○ We needed to access hadoop with username hdfs and

create the flume folder and change its owner to flume using chown command. (AccessControlException)

Page 18: Flume-based Independent News Aggregator

Issues faced (4/4)

● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.

Page 19: Flume-based Independent News Aggregator

Future work

● Expand RSS sources.● Implement a web UI.● Provide search services on the HDFS.● Improve the HDFS load balancing.

Page 20: Flume-based Independent News Aggregator

Conclusions (1/3)

● HDFS default configuration parameters are not suitable for deploying it in AWS EC2.

● Cloudera Manager makes installation and configuration process much easier!○ but it also introduces a few constraints that might

result in higher operating costs.● Adapting the RSS reader of the agents is not

trivial!○ different RSS sources have different contents (e.g.

posts with ad banners).

Page 21: Flume-based Independent News Aggregator

Conclusions (2/3)

● Amazon EC2 service is easier to use and more reliable than other cloud providers!○ E.g. PlanetLab.

● Flume's architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents.

● Flume is horizontally scalable!○ because its performance is proportional to the

number of machines on which it is deployed.

Page 22: Flume-based Independent News Aggregator

● Fine tunage of Flume's configuration files is not trivial!

● HDFS NameNode is no longer a single point of failure!○ since NameNode replication was introduced. Adding

passive NameNodes affects the overall performance of the HDFS cluster though.

Conclusions (3/3)

Page 24: Flume-based Independent News Aggregator

References (2/2)

● Find more detailed information on our setup and configuration on our personal blogs:○ http://www.aknahs.pt/○ http://www.otnira.com/○ http://115.186.131.91/~zafar/

Page 25: Flume-based Independent News Aggregator

Easter Egg: Issues faced

● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project.○ Turns out it was a male.

● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off.○ Turns out that screenshots are better than demos.

● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves.○ Turns out that he just took a very hot shower.

Page 26: Flume-based Independent News Aggregator

Special Thanks

● Leandro Navarro - UPC● Amazon● jarcec - #flume on irc.freenode.net● mids - #cloudera on irc.freenode.net

(@mids106) Hanging out in IRC is useful!

Page 27: Flume-based Independent News Aggregator

News aggregator service on Amazon EC2

Arinto MurdopoMário Almeida

Zafar Gilani

SDS, EMDC 2012