elysium pro titles with abstracts 2018-19 · elysium pro titles with abstracts 2018-19 tasks or...

Elysium PRO Titles with Abstracts 2018-19


Big data is becoming a research focus in intelligent transportation systems (ITS), which can be seen in

many projects around the world. Intelligent transportation systems will produce a large amount of data.

The produced big data will have profound impacts on the design and application of intelligent

transportation systems, which makes ITS safer, more efficient, and profitable. Studying big data

analytics in ITS is a flourishing field. This paper first reviews the history and characteristics of big data

and intelligent transportation systems. The framework of conducting big data analytics in ITS is

discussed next, where the data source and collection methods, data analytics methods and platforms,

and big data analytics application categories are summarized. Several case studies of big data analytics

applications in intelligent transportation systems, including road traffic accidents analysis, road traffic

flow prediction, public transportation service plan, personal travel route plan, rail transportation

management and control, and assets maintenance are introduced. Finally, this paper discusses some

open challenges of using big data analytics in ITS.

EPRO BD -

001

Big Data Analytics in Intelligent Transportation Systems: A Survey

Virtualized infrastructure in cloud computing has become an attractive target for cyber attackers to

launch advanced attacks. This paper proposes a novel big data based security analytics approach to

detecting advanced attacks in virtualized infrastructures. Network logs as well as user application logs

collected periodically from the guest virtual machines (VMs) are stored in the Hadoop Distributed File

System (HDFS). Then, extraction of attack features is performed through graph-based event correlation

and Map Reduce parser based identification of potential attack paths. Next, determination of attack

presence is performed through two-step machine learning, namely logistic regression is applied to

calculate attack's conditional probabilities with respect to the attributes, and belief propagation is

applied to calculate the belief in existence of an attack based on them. Experiments are conducted to

evaluate the proposed approach using well-known malware as well as in comparison with existing

security techniques for virtualized infrastructure. The results show that our proposed approach is

effective in detecting attacks with minimal performance overhead.

EPRO BD -

002

Big Data Based Security Analytics for Protecting Virtualized Infrastructures in Cloud

Computing


Microblogging, a popular social media service platform, has become a new information channel for

users to receive and exchange the most up-to-date information on current events. Consequently, it is a

crucial platform for detecting newly emerging events and for identifying influential spreaders who

have the potential to actively disseminate knowledge about events through microblogs. However,

traditional event detection models require human intervention to detect the number of topics to be

explored, which significantly reduces the efficiency and accuracy of event detection. In addition, most

existing methods focus only on event detection and are unable to identify either influential spreaders

or key event-related posts, thus making it challenging to track momentous events in a timely manner.

To address these problems, we propose a Hypertext-Induced Topic Search (HITS) based Topic-

Decision method (TD-HITS), and a Latent Dirichlet Allocation (LDA) based Three-Step model (TS-

LDA). TDHITS can automatically detect the number of topics as well as identify associated key posts

in a large number of posts. TS-LDA can identify influential spreaders of hot event topics based on both

post and user information. The experimental results, using a Twitter dataset, demonstrate the

effectiveness of our proposed methods for both detecting events and identifying influential spreaders.

EPRO BD -

003

Event detection and identification of influential spreaders in social media data

streams

The proliferation of ubiquitous Internet and mobile devices has brought about the exponential growth

of individual data in big data era. The network user data has been confronted with serious privacy

concerns for extracting valuable information during the process of data mining. Differential privacy

preservation is a new paradigm independent of the adversaries' prior knowledge, which protects

sensitive data while maintaining certain statistical properties by adding random noise. In this paper, we

put forward a differential privacy preservation multiple cores DBSCAN clustering schema based on

the powerful differential privacy and DBSCAN algorithm for network user data to effectively leverage

the privacy leakage issue in the process of data mining, enhancing data clustering efficaciously by

adding Laplace noise. We perform extensive theoretical analysis and simulations to evaluate our

schema and the results show better efficiency, accuracy, and privacy preservation effect than previous

schemas.

EPRO BD -

004

DP-MCDBSCAN: Differential Privacy Preserving Multi-Core DBSCAN Clustering

for Network User Data


With the widespread popularity of Internet-enabled devices, there is an exponential increase in the

information sharing among different geographically located smart devices. These smart devices may

be heterogeneous in nature and may use different communication protocols for information sharing

among themselves. Moreover, the data shared may also change with respect to various Vs (volume,

velocity, variety, and value) to categorize it as big data. However, as these devices communicate with

each other using an open channel, the Internet, there is a higher chance of information leakage during

communication. Most of the existing solutions reported in the literature ignore these facts. Keeping

focus on these points, in this article, we propose secure storage, verification, and auditing (SecSVA)

of big data in cloud environment. SecSVA includes the following modules: an attribute-based secure

data deduplication framework for data storage on the cloud, Kerberos-based identity verification and

authentication, and Merkle hash-tree-based trusted third-party auditing on cloud. From the analysis, it

is clear that SecSVA can provide secure third party auditing with integrity preservation across multiple

domains in the cloud environment.

EPRO BD -

005

SecSVA: Secure Storage, Verification, and Auditing of Big Data in the Cloud

Environment

With an exponential increase in smart device users, there is an increase in the bulk amount of data

generation from various smart devices, which varies with respect to all the essential V's used to

categorize it as big data. Generally, most service providers, including Google, Amazon, Microsoft and

so on, have deployed a large number of geographically distributed data centers to process this huge

amount of data generated from various smart devices so that users can get quick response time. For

this purpose, Hadoop, and SPARK are widely used by these service providers for processing large

datasets. However, less emphasis has been given on the underlying infrastructure (the network through

which data flows), which is one of the most important components for successful implementation of

any designed solution in this environment. In the worst case, due to heavy network traffic with respect

to data migrations across different data centers, the underlying network infrastructure may not be able

to transfer data packets from source to destination, resulting in performance degradation. Focusing on

all these issues, in this article, we propose a novel SDN-based big data management approach with

respect to the optimized network resource consumption such as network bandwidth and data storage

units. We analyze various components at both the data and control planes that can enhance the

optimized big data analytics across multiple cloud data centers. For example, we analyze the

performance of the proposed solution using Bloom-filter-based insertion and deletion of an element in

the flow table maintained at the OpenFlow controller.

EPRO BD -

006

Optimized Big Data Management across Multi-Cloud Data Centers: Software-

Defined-Network-Based Analysis


Rapid urbanization has posed significant burden on urban transportation infrastructures. In today's

cities, both private and public transits have clear limitations to fulfill passengers' needs for quality of

experience (QoE): Public transits operate along fixed routes with long wait time and total transit time;

Private transits, such as taxis, private shuttles and ride-hailing services, provide point-to-point transits

with high trip fare. In this paper, we propose CityLines, a transformative urban transit system,

employing hybrid hub-and-spoke transit model with shared shuttles. Analogous to Airlines services,

the proposed CityLines system routes urban trips among spokes through a few hubs or direct paths,

with travel time as short as private transits and fare as low as public transits. CityLines allows both

point-to-point connection to improve the passenger QoE, and hub-and-spoke connection to reduce the

system operation cost. To evaluate the performance of CityLines, we conduct extensive data-driven

experiments using one-month real-world trip demand data (from taxis, buses and subway trains)

collected from Shenzhen, China. The results demonstrate that CityLines reduces 12.5%-44% average

travel time, and aggregates 8.5%-32.6% more trips with ride-sharing over other implementation

baselines.

EPRO BD -

007

CityLines: Designing Hybrid Hub-and-Spoke Transit System with Urban Big Data

Nowadays machine learning is becoming a new paradigm for mining hidden knowledge in big data.

The collection and manipulation of big data not only create considerable values, but also raise serious

privacy concerns. To protect the huge amount of potentially sensitive data, a straightforward approach

is to encrypt data with specialized cryptographic tools. However, it is challenging to utilize or operate

on encrypted data, especially to perform machine learning algorithms. In this paper, we investigate the

problem of training high quality word vectors over large-scale encrypted data (from distributed data

owners) with the privacy-preserving collaborative neural network learning algorithms. We leverage

and also design a suite of arithmetic primitives (e.g., multiplication, fixed-point representation and

sigmoid function computation etc.) on encrypted data, served as components of our construction. We

theoretically analyze the security and efficiency of our proposed construction, and conduct extensive

experiments on representative real-world datasets to verify its practicality and effectiveness.

EPRO BD -

008

Privacy-Preserving Collaborative Model Learning: The Case of Word Vector

Training


Cloud storage auditing schemes for shared data refer to checking the integrity of cloud data shared by

a group of users. User revocation is commonly supported in such schemes, as users may be subject to

group membership changes for various reasons. Previously, the computational overhead for user

revocation in such schemes is linear with the total number of file blocks possessed by a revoked user.

The overhead, however, may become a heavy burden because of the sheer amount of the shared cloud

data. Thus, how to reduce the computational overhead caused by user revocations becomes a key

research challenge for achieving practical cloud data auditing. In this paper, we propose a novel storage

auditing scheme that achieves highly-efficient user revocation independent of the total number of file

blocks possessed by the revoked user in the cloud. This is achieved by exploring a novel strategy for

key generation and a new private key update technique. Using this strategy and the technique, we

realize user revocation by just updating the nonrevoked group users' private keys rather than

authenticators of the revoked user. The integrity auditing of the revoked user's data can still be correctly

performed when the authenticators are not updated. Meanwhile, the proposed scheme is based on

identity-base cryptography, which eliminates the complicated certificate management in traditional

Public Key Infrastructure (PKI) systems. The security and efficiency of the proposed scheme are

validated via both analysis and experimental results.

EPRO BD -

009

Enabling Efficient User Revocation in Identity-based Cloud Storage Auditing for

Shared Big Data

In order to solve the problems of efficient functioning of cloud centers with a high degree of

virtualization, by increasing the life cycle of physical machines, in the conditions of group How of

tasks or processing of flows of type Big Data, we propose a new algorithm for clustering and

selection of the main cluster node using fuzzy logic and Voronoi diagrams. Criteria for the selection

of the main node of the cluster are the centrality of the Voronoi diagrams, residual energy, and

availability of free resources of physical machines.

EPRO BD -

010

Clustering model of cloud centers for big data processing


Location prediction is the key technique in many location based services including route navigation,

dining location recommendations, and traffic planning and control, to mention a few. This survey

provides a comprehensive overview of location prediction, including basic definitions and concepts,

algorithms, and applications. First, we introduce the types of trajectory data and related basic concepts.

Then, we review existing location-prediction methods, ranging from temporal-pattern-based prediction

to spatiotemporal-pattern-based prediction. We also discuss and analyze the advantages and

disadvantages of these algorithms and briefly summarize current applications of location prediction in

diverse fields. Finally, we identify the potential challenges and future research directions in location

prediction.

EPRO BD -

011

Location prediction on trajectory data: A review

Pattern mining is one of the most important tasks to extract meaningful and useful information from

raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in

data. Although many efficient algorithms have been developed in this regard, the growing interest in

data has caused the performance of existing pattern mining techniques to be dropped. The goal of this

paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series

of algorithms based on the MapReduce framework and the Hadoop open-source implementation have

been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms

[Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed,

which extract any existing item-set in data. Second, two algorithms (space pruning AprioriMR and top

AprioriMR) that prune the search space by means of the well-known anti-monotone property are

proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed

representations of frequent patterns. To test the performance of the proposed algorithms, a varied

collection of big data datasets have been considered, comprising up to 3 · 10#x00B9;⁸ transactions and

more than 5 million of distinct single-items. The experimental stage includes comparisons against

highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying

MapReduce versions when complex problems are considered, and also the unsuitability of this

paradigm when dealing with small data.

EPRO BD –

012

33

Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data


Social Internet of Things (SIoT) supports many novel applications and networking services for the IoT

in a more powerful and productive way. In this paper, we have introduced a hierarchical framework

for feature extraction in SIoT big data using map-reduced framework along with a supervised classifier

model. Moreover, a Gabor filter is used to reduce noise and unwanted data from the database, and

Hadoop Map Reduce has been used for mapping and reducing big databases, to improve the efficiency

of the proposed work. Furthermore, the feature selection has been performed on a filtered data set by

using Elephant Herd Optimization. The proposed system architecture has been implemented using

Linear Kernel Support Vector Machine-based classifier to classify the data and for predicting the

efficiency of the proposed work. From the results, the maximum accuracy, specificity, and sensitivity

of our work is 98.2%, 85.88%, and 80%, moreover analyzed time and memory, and these results have

been compared with the existing literature.

EPRO BD -

013

Effective Features to Classify Big Data Using Social Internet of Things

The advances in smart grids are enabling huge amount of data to be aggregated and analyzed for various

smart grid applications. However, the traditional smart grid data management systems cannot scale and

provide sufficient storage and processing capabilities. To address these challenges, this paper presents

a smart grid big data eco-system based on the state-of-the-art Lambda architecture that is capable of

performing parallel batch and real-time operations on distributed data. Further, the presented eco-

system utilizes a Hadoop Big Data Lake to store various types of smart grid data including smart meter,

images and video data. An implementation of the smart grid big data eco-system on a cloud computing

platform is presented. To test the capability of the presented eco-system, real-time visualization and

data mining applications were performed on real smart grid data. The results of those applications on

top of the eco-system suggest that it is capable of performing numerous smart grid big data analytics.

EPRO BD -

014

Data Lake Lambda Architecture for Smart Grids Big Data Analytics


In this big data era, knowledge becomes increasingly linked, along with the rapid growth in data

volume. Connected knowledge is naturally represented and stored as knowledge graphs, which are of

great importance for many frontier research areas. Effectively finding relations between entities in

large knowledge graphs plays a key role in many knowledge graph applications, as the most valuable

part of a knowledge graph is its rich connectedness, which captures rich information about the real-

world objects. However, due to the intrinsic complexity of real-world knowledge, finding semantically

close relations by navigation in a large knowledge graph is challenging. Canonical graph exploration

methods inevitably result in combinatorial explosion especially when the paths connecting two entities

are long: the search space is O(dl), where d is the average graph node degree and l is the path length.

In this paper, we will systematically study the semantic navigation problem for large knowledge

graphs. Inspired by AlphaGo, which was overwhelmingly successful in game Go, we designed an

efficient semantic navigation method based on a well-tailored Monte Carlo Tree Search algorithm with

the unique characteristics of knowledge graphs considered. Extensive experiments show that our

method is not only effective but also very efficient

EPRO BD -

015

Neurally-Guided Semantic Navigation in Knowledge Graph

Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some

of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers,

L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high

locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider

fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm

by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops

whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order

dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a

single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to

our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast

Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like

grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for

all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop

iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the

application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates

automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky

decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive

experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach.

We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic

or by a user-guided analysis of the data

EPRO BD -

016

A Novel Hilbert Curve for Cache-locality Preserving Loops


Hadoop has become a popular framework for processing data-intensive applications in cloud

environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and

monitoring the jobs and tasks, and rescheduling them in case of failures. Although fault-tolerance

mechanisms have been proposed for Hadoop, the performance of Hadoop can be significantly impacted

by unforeseen events in the cloud environment. In this paper, we introduce a dynamic and failure-

aware framework that can be integrated within Hadoop scheduler and adjust the scheduling decisions

based on collected information about the cloud environment. Our framework relies on predictions made

by machine learning algorithms and scheduling policies generated by a Markovian Decision Process

(MDP), to adjust its scheduling decisions on the fly. Instead of the fixed heartbeat-based failure

detection commonly used in Hadoop to track active TaskTrackers (i.e., nodes that process the

scheduled tasks), our proposed framework implements an adaptive algorithm that can dynamically

detect the failures of the TaskTracker. To deploy our proposed framework, we have built, ATLAS+,

an AdapTive Failure-Aware Scheduler for Hadoop. To assess the performance of ATLAS+, we

conduct a large empirical study on a 100- node Hadoop cluster deployed on Amazon Elastic

MapReduce (EMR), comparing the performance of ATLAS+ with those of three Hadoop schedulers

(FIFO, Fair, and Capacity).

EPRO BD -

017

A Dynamic and Failure-aware Task Scheduling Framework for Hadoop

Fuzzy associative classification has not been widely analyzed in the literature, although associative

classifiers (ACs) have proved to be very effective in different real domain applications. The main

reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To

overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative

classification approach based on the MapReduce paradigm. The approach exploits a novel distributed

discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a

set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the

well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types

of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows

distributing storage and processing of very large data sets on computer clusters built from commodity

hardware. We have performed an extensive experimentation and a detailed analysis of the results using

six very large datasets with up to 11,000,000 instances. We have also experimented different types of

reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we

compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs

recently proposed in the literature. We highlight that, although the accuracies result to be comparable,

the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy

distributed approach is lower than the one of the nonfuzzy classifiers.

EPRO BD -

018

A Distributed Fuzzy Associative Classifier for Big Data


Whereas traditional scientific applications are computationally intensive, recent applications require

more data-intensive analysis and visualization to extract knowledge from the explosive growth of

scientific information and simulation data. As the computational power and size of compute clusters

continue to increase, the I/O read rates and associated network for these data-intensive applications

have been unable to keep pace. These applications suffer from long I/O latency due to the movement

of “big data” from the network/parallel file system, which results in a serious performance bottleneck.

To address this problem, we proposed a novel approach called “ODDS” to optimize data-locality

access in scientific data analysis and visualization. ODDS leverages a distributed file system (DFS) to

provide scalable data access for scientific analysis. Through exploiting the information of underlying

data distribution in DFS, ODDS employs a novel data-locality scheduler to transform a compute-

centric mapping into a data-centric one and enables each computational process to access the needed

data from a local or nearby storage node. ODDS is suitable for parallel applications with dynamic

process-to-data scheduling and for applications with static process-to-data assignment.

EPRO BD –

020

ODDS: Optimizing Data-locality Access for Scientific Data Analysis

In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge

graph, discover the meta paths between the entities and calculate the relevancy between entities in the

graph . The core infrastructure is established on the secured and private Amazon Elastic Compute

Cloud(Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories,

processes the data parallel by utilizing open source Spark framework, manages computing resources

optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributively between

different types of entities. We incorporate four relationship discovery tasks including citation

recommendation, potential collaborator discovery, similar venue measurement and paper to venue

recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and

weighted meta path (MWMP) method to explore the potential relationship between different types of

entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters

through Amazon EC2 platform.

EPRO BD -

019

Distributed Relationship Mining over Big Scholar Data


In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge

graph, discover the meta paths between the entities and calculate the relevancy between entities in the

graph . The core infrastructure is established on the secured and private Amazon Elastic Compute

Cloud(Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories,

processes the data parallel by utilizing open source Spark framework, manages computing resources

optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributively between

different types of entities. We incorporate four relationship discovery tasks including citation

recommendation, potential collaborator discovery, similar venue measurement and paper to venue

recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and

weighted meta path (MWMP) method to explore the potential relationship between different types of

entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters

through Amazon EC2 platform.

EPRO BD -

021

Distributed Relationship Mining over Big Scholar Data

While Hadoop ecosystems become increasingly important for practitioners of large-scale data analysis,

they also incur tremendous energy cost. This trend is driving up the need for designing energy-efficient

Hadoop clusters in order to reduce the operational costs and the carbon emission associated with its

energy consumption. However, despite extensive studies of the problem, existing approaches for

energy efficiency have not fully considered the heterogeneity of both workload and machine hardware

found in production environments. In this paper, we find that heterogeneity-oblivious task assignment

approaches are detrimental to both performance and energy efficiency of Hadoop clusters. Our

observation shows that even heterogeneity-aware techniques that aim to reduce the job completion

time do not guarantee a reduction in energy consumption of heterogeneous machines. We propose a

heterogeneity-aware task assignment approach, E-Ant, that aims to improve the overall energy

consumption in a heterogeneous Hadoop cluster without sacrificing job performance. It adaptively

schedules heterogeneous workloads on energy-efficient machines, without a priori knowledge of the

workload properties. E-Ant employs an ant colony optimization approach that generates task

assignment solutions based on the feedback of each task's energy consumption reported by Hadoop

TaskTrackers in an agile way. Furthermore, we integrate DVFS technique with E-Ant to further

improve the energy efficiency of heterogeneous Hadoop clusters by 23 and 17 percent compared to

Fair Scheduler and Tarazu, respectively.

EPRO BD -

022

Energy Efficiency Aware Task Assignment with DVFS in Heterogeneous Hadoop

Clusters


We propose a sampling-based framework for privacy-preserving approximate data search in the

context of big data. The framework is designed to bridge multi-target query needs from users and the

data platform, including required query accuracy, timeliness, and query privacy constraints. A novel

privacy metric, (ε, δ)-approximation, is presented to uniformly measure accuracy, efficiency and

privacy breach risk. Based on this, we employ bootstrapping to efficiently produce approximate results

that meet the preset query requirements. Moreover, we propose a quick response mechanism to deal

with homogeneous queries, and discuss the reusage of results when appending data. Theoretical

analyses and experimental results demonstrate that the framework is capable of effectively fulfilling

multi-target query requirements with high efficiency and accuracy.

EPRO BD -

023

Hermes: A Privacy-Preserving Approximate Search Framework for Big Data

This paper proposes an alternate ranking system based on social network metrics and their evaluation

in a composite distributed framework. The most important aspect of social network domain is

voluminous data. In order to know key trends, predictive analysis of huge sets of raw data can be

effective. Again these analytics are generally very compute intensive. The most efficient solution of

such compute-intensive analysis is to map such problems in a distributed domain. The main

contributions of this paper are twofold. First, the analysis of cricket community from the viewpoint of

social network and finding ranking of players and countries based on the properties of graph centrality

measures. Second, the paper proposes a comprehensive distributed framework to offer infrastructural

support for the large data analysis as well as graph processing. Hadoop is an open-source framework

to store and process large data sets in a cloud environment. MapReduce is a popular programming

model for processing that data in a distributed manner. However, MapReduce alone is not efficient for

graph processing. Giraph is an alternative programming paradigm for graph processing in Hadoop.

Using a practical case study of social network analysis of cricket community, this paper captures the

significance of the alternate ranking in this sport, as well as shows effectiveness of the proposed

framework in the process of analyzing voluminous data.

EPRO BD -

024

Social Network Analysis of Cricket Community Using a Composite Distributed

Framework: From Implementation Viewpoint


The capability of switching into the islanded operation mode of microgrids has been advocated as a

viable solution to achieve high system reliability. This paper proposes a new model for the microgrids

optimal scheduling and load curtailment problem. The proposed problem determines the optimal

schedule for local generators of microgrids to minimize the generation cost of the associated distribution

system in the normal operation. Moreover, when microgrids have to switch into the islanded operation

mode due to reliability considerations, the optimal generation solution still guarantees for the minimal

amount of load curtailment. Due to the large number of constraints in both normal and islanded

operations, the formulated problem becomes a large-scale optimization problem and is very challenging

to solve using the centralized computational method. Therefore, we propose a decomposition algorithm

using the alternating direction method of multipliers that provides a parallel computational framework.

The simulation results demonstrate the efficiency of our proposed model in reducing generation cost,

as well as guaranteeing the reliable operation of microgrids in the islanded mode. We finally describe

the detailed implementation of parallel computation for our proposed algorithm to run on a computer

cluster using the Hadoop Map Reduce software framework.

EPRO BD -

025

A Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids

To address the computing challenge of `big data', a number of data-intensive computing frameworks

(e.g., MapReduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto

resource management platform that enables these frameworks running together in a shared system.

However, we observe that, in cloud computing environment, the fair resource allocation policy

implemented in YARN is not suitable because of its memoryless resource allocation fashion leading

to violations of a number of good properties in shared computing systems. This paper attempts to

address these problems for YARN. Both single-level and hierarchical resource allocations are

considered. For single-level resource allocation, we propose a novel fair resource allocation mechanism

called Long-Term Resource Fairness (LTRF)for such computing. For hierarchical resource allocation,

we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending LTRF. We show that

both LTRF and H-LTRF can address these fairness problems of current resource allocation policy and

are thus suitable for cloud computing. Finally, we have developed LTYARN by implementing LTRF

and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than

existing fair schedulers of YARN.

EPRO BD -

026

Fair Resource Allocation for Data-Intensive Computing in the Cloud


In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the

Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which

is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or

inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some

computational nodes carrying out much more workload than others; furthermore, it leads to inefficient

sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We

conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling

occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over

distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain

the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called

ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we

employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient

parallel execution. Our proposed method can benefit different sub-dataset analyses with various

computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed

and the results show the performance benefits of DataNet.

EPRO BD -

027

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

Cloud computing has arisen as the mainstream platform of utility computing paradigm that offers

reliable and robust infrastructure for storing data remotely, and provides on demand applications and

services. Currently, establishments that produce huge volume of sensitive data, leverage data

outsourcing to reduce the burden of local data storage and maintenance. The outsourced data, however,

in the cloud are not always trustworthy because of the inadequacy of physical control over the data for

data owners. To better streamline this issue, scientists have now focused on relieving the security

threats by designing remote data checking (RDC) techniques. However, the majority of these

techniques are inapplicable to big data storage due to incurring huge computation cost on the user and

cloud sides. Such schemes in existence suffer from data dynamicity problem from two sides. First, they

are only applicable for static archive data and are not subject to audit the dynamic outsourced data.

Second, although, some of the existence methods are able to support dynamic data update, increasing

the number of update operations impose high computation and communication cost on the auditor due

to maintenance of data structure, i.e., merkle hash tree. This paper presents an efficient RDC method

on the basis of algebraic properties of the outsourced files in cloud computing, which inflicts the least

computation and communication cost. The main contribution of this paper is to present a new data

structure, called Divide and Conquer Table (D&CT), which proficiently supports dynamic data for

normal file sizes. Moreover, this data structure empowers our method to be applicable for large-scale

data storage with minimum computation cost.

EPRO BD -

028

Auditing Big Data Storage in Cloud Computing Using Divide and Conquer Tables

elysium pro titles with abstracts 2018-19 · elysium pro titles with abstracts 2018-19 tasks or...

Documents