elysium pro titles with abstracts 2018-19 · elysium pro titles with abstracts 2018-19 tasks or...
TRANSCRIPT
Elysium PRO Titles with Abstracts 2018-19
Elysium PRO Titles with Abstracts 2018-19
Elysium PRO Titles with Abstracts 2018-19
Big data is becoming a research focus in intelligent transportation systems (ITS), which can be seen in
many projects around the world. Intelligent transportation systems will produce a large amount of data.
The produced big data will have profound impacts on the design and application of intelligent
transportation systems, which makes ITS safer, more efficient, and profitable. Studying big data
analytics in ITS is a flourishing field. This paper first reviews the history and characteristics of big data
and intelligent transportation systems. The framework of conducting big data analytics in ITS is
discussed next, where the data source and collection methods, data analytics methods and platforms,
and big data analytics application categories are summarized. Several case studies of big data analytics
applications in intelligent transportation systems, including road traffic accidents analysis, road traffic
flow prediction, public transportation service plan, personal travel route plan, rail transportation
management and control, and assets maintenance are introduced. Finally, this paper discusses some
open challenges of using big data analytics in ITS.
EPRO BD -
001
Big Data Analytics in Intelligent Transportation Systems: A Survey
Virtualized infrastructure in cloud computing has become an attractive target for cyber attackers to
launch advanced attacks. This paper proposes a novel big data based security analytics approach to
detecting advanced attacks in virtualized infrastructures. Network logs as well as user application logs
collected periodically from the guest virtual machines (VMs) are stored in the Hadoop Distributed File
System (HDFS). Then, extraction of attack features is performed through graph-based event correlation
and Map Reduce parser based identification of potential attack paths. Next, determination of attack
presence is performed through two-step machine learning, namely logistic regression is applied to
calculate attack's conditional probabilities with respect to the attributes, and belief propagation is
applied to calculate the belief in existence of an attack based on them. Experiments are conducted to
evaluate the proposed approach using well-known malware as well as in comparison with existing
security techniques for virtualized infrastructure. The results show that our proposed approach is
effective in detecting attacks with minimal performance overhead.
EPRO BD -
002
Big Data Based Security Analytics for Protecting Virtualized Infrastructures in Cloud
Computing
Elysium PRO Titles with Abstracts 2018-19
Microblogging, a popular social media service platform, has become a new information channel for
users to receive and exchange the most up-to-date information on current events. Consequently, it is a
crucial platform for detecting newly emerging events and for identifying influential spreaders who
have the potential to actively disseminate knowledge about events through microblogs. However,
traditional event detection models require human intervention to detect the number of topics to be
explored, which significantly reduces the efficiency and accuracy of event detection. In addition, most
existing methods focus only on event detection and are unable to identify either influential spreaders
or key event-related posts, thus making it challenging to track momentous events in a timely manner.
To address these problems, we propose a Hypertext-Induced Topic Search (HITS) based Topic-
Decision method (TD-HITS), and a Latent Dirichlet Allocation (LDA) based Three-Step model (TS-
LDA). TDHITS can automatically detect the number of topics as well as identify associated key posts
in a large number of posts. TS-LDA can identify influential spreaders of hot event topics based on both
post and user information. The experimental results, using a Twitter dataset, demonstrate the
effectiveness of our proposed methods for both detecting events and identifying influential spreaders.
EPRO BD -
003
Event detection and identification of influential spreaders in social media data
streams
The proliferation of ubiquitous Internet and mobile devices has brought about the exponential growth
of individual data in big data era. The network user data has been confronted with serious privacy
concerns for extracting valuable information during the process of data mining. Differential privacy
preservation is a new paradigm independent of the adversaries' prior knowledge, which protects
sensitive data while maintaining certain statistical properties by adding random noise. In this paper, we
put forward a differential privacy preservation multiple cores DBSCAN clustering schema based on
the powerful differential privacy and DBSCAN algorithm for network user data to effectively leverage
the privacy leakage issue in the process of data mining, enhancing data clustering efficaciously by
adding Laplace noise. We perform extensive theoretical analysis and simulations to evaluate our
schema and the results show better efficiency, accuracy, and privacy preservation effect than previous
schemas.
EPRO BD -
004
DP-MCDBSCAN: Differential Privacy Preserving Multi-Core DBSCAN Clustering
for Network User Data
Elysium PRO Titles with Abstracts 2018-19
With the widespread popularity of Internet-enabled devices, there is an exponential increase in the
information sharing among different geographically located smart devices. These smart devices may
be heterogeneous in nature and may use different communication protocols for information sharing
among themselves. Moreover, the data shared may also change with respect to various Vs (volume,
velocity, variety, and value) to categorize it as big data. However, as these devices communicate with
each other using an open channel, the Internet, there is a higher chance of information leakage during
communication. Most of the existing solutions reported in the literature ignore these facts. Keeping
focus on these points, in this article, we propose secure storage, verification, and auditing (SecSVA)
of big data in cloud environment. SecSVA includes the following modules: an attribute-based secure
data deduplication framework for data storage on the cloud, Kerberos-based identity verification and
authentication, and Merkle hash-tree-based trusted third-party auditing on cloud. From the analysis, it
is clear that SecSVA can provide secure third party auditing with integrity preservation across multiple
domains in the cloud environment.
EPRO BD -
005
SecSVA: Secure Storage, Verification, and Auditing of Big Data in the Cloud
Environment
With an exponential increase in smart device users, there is an increase in the bulk amount of data
generation from various smart devices, which varies with respect to all the essential V's used to
categorize it as big data. Generally, most service providers, including Google, Amazon, Microsoft and
so on, have deployed a large number of geographically distributed data centers to process this huge
amount of data generated from various smart devices so that users can get quick response time. For
this purpose, Hadoop, and SPARK are widely used by these service providers for processing large
datasets. However, less emphasis has been given on the underlying infrastructure (the network through
which data flows), which is one of the most important components for successful implementation of
any designed solution in this environment. In the worst case, due to heavy network traffic with respect
to data migrations across different data centers, the underlying network infrastructure may not be able
to transfer data packets from source to destination, resulting in performance degradation. Focusing on
all these issues, in this article, we propose a novel SDN-based big data management approach with
respect to the optimized network resource consumption such as network bandwidth and data storage
units. We analyze various components at both the data and control planes that can enhance the
optimized big data analytics across multiple cloud data centers. For example, we analyze the
performance of the proposed solution using Bloom-filter-based insertion and deletion of an element in
the flow table maintained at the OpenFlow controller.
EPRO BD -
006
Optimized Big Data Management across Multi-Cloud Data Centers: Software-
Defined-Network-Based Analysis
Elysium PRO Titles with Abstracts 2018-19
Rapid urbanization has posed significant burden on urban transportation infrastructures. In today's
cities, both private and public transits have clear limitations to fulfill passengers' needs for quality of
experience (QoE): Public transits operate along fixed routes with long wait time and total transit time;
Private transits, such as taxis, private shuttles and ride-hailing services, provide point-to-point transits
with high trip fare. In this paper, we propose CityLines, a transformative urban transit system,
employing hybrid hub-and-spoke transit model with shared shuttles. Analogous to Airlines services,
the proposed CityLines system routes urban trips among spokes through a few hubs or direct paths,
with travel time as short as private transits and fare as low as public transits. CityLines allows both
point-to-point connection to improve the passenger QoE, and hub-and-spoke connection to reduce the
system operation cost. To evaluate the performance of CityLines, we conduct extensive data-driven
experiments using one-month real-world trip demand data (from taxis, buses and subway trains)
collected from Shenzhen, China. The results demonstrate that CityLines reduces 12.5%-44% average
travel time, and aggregates 8.5%-32.6% more trips with ride-sharing over other implementation
baselines.
EPRO BD -
007
CityLines: Designing Hybrid Hub-and-Spoke Transit System with Urban Big Data
Nowadays machine learning is becoming a new paradigm for mining hidden knowledge in big data.
The collection and manipulation of big data not only create considerable values, but also raise serious
privacy concerns. To protect the huge amount of potentially sensitive data, a straightforward approach
is to encrypt data with specialized cryptographic tools. However, it is challenging to utilize or operate
on encrypted data, especially to perform machine learning algorithms. In this paper, we investigate the
problem of training high quality word vectors over large-scale encrypted data (from distributed data
owners) with the privacy-preserving collaborative neural network learning algorithms. We leverage
and also design a suite of arithmetic primitives (e.g., multiplication, fixed-point representation and
sigmoid function computation etc.) on encrypted data, served as components of our construction. We
theoretically analyze the security and efficiency of our proposed construction, and conduct extensive
experiments on representative real-world datasets to verify its practicality and effectiveness.
EPRO BD -
008
Privacy-Preserving Collaborative Model Learning: The Case of Word Vector
Training
Elysium PRO Titles with Abstracts 2018-19
Cloud storage auditing schemes for shared data refer to checking the integrity of cloud data shared by
a group of users. User revocation is commonly supported in such schemes, as users may be subject to
group membership changes for various reasons. Previously, the computational overhead for user
revocation in such schemes is linear with the total number of file blocks possessed by a revoked user.
The overhead, however, may become a heavy burden because of the sheer amount of the shared cloud
data. Thus, how to reduce the computational overhead caused by user revocations becomes a key
research challenge for achieving practical cloud data auditing. In this paper, we propose a novel storage
auditing scheme that achieves highly-efficient user revocation independent of the total number of file
blocks possessed by the revoked user in the cloud. This is achieved by exploring a novel strategy for
key generation and a new private key update technique. Using this strategy and the technique, we
realize user revocation by just updating the nonrevoked group users' private keys rather than
authenticators of the revoked user. The integrity auditing of the revoked user's data can still be correctly
performed when the authenticators are not updated. Meanwhile, the proposed scheme is based on
identity-base cryptography, which eliminates the complicated certificate management in traditional
Public Key Infrastructure (PKI) systems. The security and efficiency of the proposed scheme are
validated via both analysis and experimental results.
EPRO BD -
009
Enabling Efficient User Revocation in Identity-based Cloud Storage Auditing for
Shared Big Data
In order to solve the problems of efficient functioning of cloud centers with a high degree of
virtualization, by increasing the life cycle of physical machines, in the conditions of group How of
tasks or processing of flows of type Big Data, we propose a new algorithm for clustering and
selection of the main cluster node using fuzzy logic and Voronoi diagrams. Criteria for the selection
of the main node of the cluster are the centrality of the Voronoi diagrams, residual energy, and
availability of free resources of physical machines.
EPRO BD -
010
Clustering model of cloud centers for big data processing
Elysium PRO Titles with Abstracts 2018-19
Location prediction is the key technique in many location based services including route navigation,
dining location recommendations, and traffic planning and control, to mention a few. This survey
provides a comprehensive overview of location prediction, including basic definitions and concepts,
algorithms, and applications. First, we introduce the types of trajectory data and related basic concepts.
Then, we review existing location-prediction methods, ranging from temporal-pattern-based prediction
to spatiotemporal-pattern-based prediction. We also discuss and analyze the advantages and
disadvantages of these algorithms and briefly summarize current applications of location prediction in
diverse fields. Finally, we identify the potential challenges and future research directions in location
prediction.
EPRO BD -
011
Location prediction on trajectory data: A review
Pattern mining is one of the most important tasks to extract meaningful and useful information from
raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in
data. Although many efficient algorithms have been developed in this regard, the growing interest in
data has caused the performance of existing pattern mining techniques to be dropped. The goal of this
paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series
of algorithms based on the MapReduce framework and the Hadoop open-source implementation have
been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms
[Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed,
which extract any existing item-set in data. Second, two algorithms (space pruning AprioriMR and top
AprioriMR) that prune the search space by means of the well-known anti-monotone property are
proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed
representations of frequent patterns. To test the performance of the proposed algorithms, a varied
collection of big data datasets have been considered, comprising up to 3 · 10#x00B9;⁸ transactions and
more than 5 million of distinct single-items. The experimental stage includes comparisons against
highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying
MapReduce versions when complex problems are considered, and also the unsuitability of this
paradigm when dealing with small data.
EPRO BD –
012
33
Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data
Elysium PRO Titles with Abstracts 2018-19
Social Internet of Things (SIoT) supports many novel applications and networking services for the IoT
in a more powerful and productive way. In this paper, we have introduced a hierarchical framework
for feature extraction in SIoT big data using map-reduced framework along with a supervised classifier
model. Moreover, a Gabor filter is used to reduce noise and unwanted data from the database, and
Hadoop Map Reduce has been used for mapping and reducing big databases, to improve the efficiency
of the proposed work. Furthermore, the feature selection has been performed on a filtered data set by
using Elephant Herd Optimization. The proposed system architecture has been implemented using
Linear Kernel Support Vector Machine-based classifier to classify the data and for predicting the
efficiency of the proposed work. From the results, the maximum accuracy, specificity, and sensitivity
of our work is 98.2%, 85.88%, and 80%, moreover analyzed time and memory, and these results have
been compared with the existing literature.
EPRO BD -
013
Effective Features to Classify Big Data Using Social Internet of Things
The advances in smart grids are enabling huge amount of data to be aggregated and analyzed for various
smart grid applications. However, the traditional smart grid data management systems cannot scale and
provide sufficient storage and processing capabilities. To address these challenges, this paper presents
a smart grid big data eco-system based on the state-of-the-art Lambda architecture that is capable of
performing parallel batch and real-time operations on distributed data. Further, the presented eco-
system utilizes a Hadoop Big Data Lake to store various types of smart grid data including smart meter,
images and video data. An implementation of the smart grid big data eco-system on a cloud computing
platform is presented. To test the capability of the presented eco-system, real-time visualization and
data mining applications were performed on real smart grid data. The results of those applications on
top of the eco-system suggest that it is capable of performing numerous smart grid big data analytics.
EPRO BD -
014
Data Lake Lambda Architecture for Smart Grids Big Data Analytics
Elysium PRO Titles with Abstracts 2018-19
In this big data era, knowledge becomes increasingly linked, along with the rapid growth in data
volume. Connected knowledge is naturally represented and stored as knowledge graphs, which are of
great importance for many frontier research areas. Effectively finding relations between entities in
large knowledge graphs plays a key role in many knowledge graph applications, as the most valuable
part of a knowledge graph is its rich connectedness, which captures rich information about the real-
world objects. However, due to the intrinsic complexity of real-world knowledge, finding semantically
close relations by navigation in a large knowledge graph is challenging. Canonical graph exploration
methods inevitably result in combinatorial explosion especially when the paths connecting two entities
are long: the search space is O(dl), where d is the average graph node degree and l is the path length.
In this paper, we will systematically study the semantic navigation problem for large knowledge
graphs. Inspired by AlphaGo, which was overwhelmingly successful in game Go, we designed an
efficient semantic navigation method based on a well-tailored Monte Carlo Tree Search algorithm with
the unique characteristics of knowledge graphs considered. Extensive experiments show that our
method is not only effective but also very efficient
EPRO BD -
015
Neurally-Guided Semantic Navigation in Knowledge Graph
Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some
of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers,
L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high
locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider
fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm
by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops
whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order
dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a
single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to
our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast
Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like
grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for
all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop
iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the
application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates
automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky
decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive
experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach.
We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic
or by a user-guided analysis of the data
EPRO BD -
016
A Novel Hilbert Curve for Cache-locality Preserving Loops
Elysium PRO Titles with Abstracts 2018-19
Hadoop has become a popular framework for processing data-intensive applications in cloud
environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and
monitoring the jobs and tasks, and rescheduling them in case of failures. Although fault-tolerance
mechanisms have been proposed for Hadoop, the performance of Hadoop can be significantly impacted
by unforeseen events in the cloud environment. In this paper, we introduce a dynamic and failure-
aware framework that can be integrated within Hadoop scheduler and adjust the scheduling decisions
based on collected information about the cloud environment. Our framework relies on predictions made
by machine learning algorithms and scheduling policies generated by a Markovian Decision Process
(MDP), to adjust its scheduling decisions on the fly. Instead of the fixed heartbeat-based failure
detection commonly used in Hadoop to track active TaskTrackers (i.e., nodes that process the
scheduled tasks), our proposed framework implements an adaptive algorithm that can dynamically
detect the failures of the TaskTracker. To deploy our proposed framework, we have built, ATLAS+,
an AdapTive Failure-Aware Scheduler for Hadoop. To assess the performance of ATLAS+, we
conduct a large empirical study on a 100- node Hadoop cluster deployed on Amazon Elastic
MapReduce (EMR), comparing the performance of ATLAS+ with those of three Hadoop schedulers
(FIFO, Fair, and Capacity).
EPRO BD -
017
A Dynamic and Failure-aware Task Scheduling Framework for Hadoop
Fuzzy associative classification has not been widely analyzed in the literature, although associative
classifiers (ACs) have proved to be very effective in different real domain applications. The main
reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To
overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative
classification approach based on the MapReduce paradigm. The approach exploits a novel distributed
discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a
set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the
well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types
of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows
distributing storage and processing of very large data sets on computer clusters built from commodity
hardware. We have performed an extensive experimentation and a detailed analysis of the results using
six very large datasets with up to 11,000,000 instances. We have also experimented different types of
reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we
compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs
recently proposed in the literature. We highlight that, although the accuracies result to be comparable,
the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy
distributed approach is lower than the one of the nonfuzzy classifiers.
EPRO BD -
018
A Distributed Fuzzy Associative Classifier for Big Data
Elysium PRO Titles with Abstracts 2018-19
Whereas traditional scientific applications are computationally intensive, recent applications require
more data-intensive analysis and visualization to extract knowledge from the explosive growth of
scientific information and simulation data. As the computational power and size of compute clusters
continue to increase, the I/O read rates and associated network for these data-intensive applications
have been unable to keep pace. These applications suffer from long I/O latency due to the movement
of “big data” from the network/parallel file system, which results in a serious performance bottleneck.
To address this problem, we proposed a novel approach called “ODDS” to optimize data-locality
access in scientific data analysis and visualization. ODDS leverages a distributed file system (DFS) to
provide scalable data access for scientific analysis. Through exploiting the information of underlying
data distribution in DFS, ODDS employs a novel data-locality scheduler to transform a compute-
centric mapping into a data-centric one and enables each computational process to access the needed
data from a local or nearby storage node. ODDS is suitable for parallel applications with dynamic
process-to-data scheduling and for applications with static process-to-data assignment.
EPRO BD –
020
ODDS: Optimizing Data-locality Access for Scientific Data Analysis
In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge
graph, discover the meta paths between the entities and calculate the relevancy between entities in the
graph . The core infrastructure is established on the secured and private Amazon Elastic Compute
Cloud(Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories,
processes the data parallel by utilizing open source Spark framework, manages computing resources
optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributively between
different types of entities. We incorporate four relationship discovery tasks including citation
recommendation, potential collaborator discovery, similar venue measurement and paper to venue
recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and
weighted meta path (MWMP) method to explore the potential relationship between different types of
entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters
through Amazon EC2 platform.
EPRO BD -
019
Distributed Relationship Mining over Big Scholar Data
Elysium PRO Titles with Abstracts 2018-19
In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge
graph, discover the meta paths between the entities and calculate the relevancy between entities in the
graph . The core infrastructure is established on the secured and private Amazon Elastic Compute
Cloud(Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories,
processes the data parallel by utilizing open source Spark framework, manages computing resources
optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributively between
different types of entities. We incorporate four relationship discovery tasks including citation
recommendation, potential collaborator discovery, similar venue measurement and paper to venue
recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and
weighted meta path (MWMP) method to explore the potential relationship between different types of
entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters
through Amazon EC2 platform.
EPRO BD -
021
Distributed Relationship Mining over Big Scholar Data
While Hadoop ecosystems become increasingly important for practitioners of large-scale data analysis,
they also incur tremendous energy cost. This trend is driving up the need for designing energy-efficient
Hadoop clusters in order to reduce the operational costs and the carbon emission associated with its
energy consumption. However, despite extensive studies of the problem, existing approaches for
energy efficiency have not fully considered the heterogeneity of both workload and machine hardware
found in production environments. In this paper, we find that heterogeneity-oblivious task assignment
approaches are detrimental to both performance and energy efficiency of Hadoop clusters. Our
observation shows that even heterogeneity-aware techniques that aim to reduce the job completion
time do not guarantee a reduction in energy consumption of heterogeneous machines. We propose a
heterogeneity-aware task assignment approach, E-Ant, that aims to improve the overall energy
consumption in a heterogeneous Hadoop cluster without sacrificing job performance. It adaptively
schedules heterogeneous workloads on energy-efficient machines, without a priori knowledge of the
workload properties. E-Ant employs an ant colony optimization approach that generates task
assignment solutions based on the feedback of each task's energy consumption reported by Hadoop
TaskTrackers in an agile way. Furthermore, we integrate DVFS technique with E-Ant to further
improve the energy efficiency of heterogeneous Hadoop clusters by 23 and 17 percent compared to
Fair Scheduler and Tarazu, respectively.
EPRO BD -
022
Energy Efficiency Aware Task Assignment with DVFS in Heterogeneous Hadoop
Clusters
Elysium PRO Titles with Abstracts 2018-19
We propose a sampling-based framework for privacy-preserving approximate data search in the
context of big data. The framework is designed to bridge multi-target query needs from users and the
data platform, including required query accuracy, timeliness, and query privacy constraints. A novel
privacy metric, (ε, δ)-approximation, is presented to uniformly measure accuracy, efficiency and
privacy breach risk. Based on this, we employ bootstrapping to efficiently produce approximate results
that meet the preset query requirements. Moreover, we propose a quick response mechanism to deal
with homogeneous queries, and discuss the reusage of results when appending data. Theoretical
analyses and experimental results demonstrate that the framework is capable of effectively fulfilling
multi-target query requirements with high efficiency and accuracy.
EPRO BD -
023
Hermes: A Privacy-Preserving Approximate Search Framework for Big Data
This paper proposes an alternate ranking system based on social network metrics and their evaluation
in a composite distributed framework. The most important aspect of social network domain is
voluminous data. In order to know key trends, predictive analysis of huge sets of raw data can be
effective. Again these analytics are generally very compute intensive. The most efficient solution of
such compute-intensive analysis is to map such problems in a distributed domain. The main
contributions of this paper are twofold. First, the analysis of cricket community from the viewpoint of
social network and finding ranking of players and countries based on the properties of graph centrality
measures. Second, the paper proposes a comprehensive distributed framework to offer infrastructural
support for the large data analysis as well as graph processing. Hadoop is an open-source framework
to store and process large data sets in a cloud environment. MapReduce is a popular programming
model for processing that data in a distributed manner. However, MapReduce alone is not efficient for
graph processing. Giraph is an alternative programming paradigm for graph processing in Hadoop.
Using a practical case study of social network analysis of cricket community, this paper captures the
significance of the alternate ranking in this sport, as well as shows effectiveness of the proposed
framework in the process of analyzing voluminous data.
EPRO BD -
024
Social Network Analysis of Cricket Community Using a Composite Distributed
Framework: From Implementation Viewpoint
Elysium PRO Titles with Abstracts 2018-19
The capability of switching into the islanded operation mode of microgrids has been advocated as a
viable solution to achieve high system reliability. This paper proposes a new model for the microgrids
optimal scheduling and load curtailment problem. The proposed problem determines the optimal
schedule for local generators of microgrids to minimize the generation cost of the associated distribution
system in the normal operation. Moreover, when microgrids have to switch into the islanded operation
mode due to reliability considerations, the optimal generation solution still guarantees for the minimal
amount of load curtailment. Due to the large number of constraints in both normal and islanded
operations, the formulated problem becomes a large-scale optimization problem and is very challenging
to solve using the centralized computational method. Therefore, we propose a decomposition algorithm
using the alternating direction method of multipliers that provides a parallel computational framework.
The simulation results demonstrate the efficiency of our proposed model in reducing generation cost,
as well as guaranteeing the reliable operation of microgrids in the islanded mode. We finally describe
the detailed implementation of parallel computation for our proposed algorithm to run on a computer
cluster using the Hadoop Map Reduce software framework.
EPRO BD -
025
A Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids
To address the computing challenge of `big data', a number of data-intensive computing frameworks
(e.g., MapReduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto
resource management platform that enables these frameworks running together in a shared system.
However, we observe that, in cloud computing environment, the fair resource allocation policy
implemented in YARN is not suitable because of its memoryless resource allocation fashion leading
to violations of a number of good properties in shared computing systems. This paper attempts to
address these problems for YARN. Both single-level and hierarchical resource allocations are
considered. For single-level resource allocation, we propose a novel fair resource allocation mechanism
called Long-Term Resource Fairness (LTRF)for such computing. For hierarchical resource allocation,
we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending LTRF. We show that
both LTRF and H-LTRF can address these fairness problems of current resource allocation policy and
are thus suitable for cloud computing. Finally, we have developed LTYARN by implementing LTRF
and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than
existing fair schedulers of YARN.
EPRO BD -
026
Fair Resource Allocation for Data-Intensive Computing in the Cloud
Elysium PRO Titles with Abstracts 2018-19
In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the
Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which
is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or
inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some
computational nodes carrying out much more workload than others; furthermore, it leads to inefficient
sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We
conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling
occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over
distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain
the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called
ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we
employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient
parallel execution. Our proposed method can benefit different sub-dataset analyses with various
computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed
and the results show the performance benefits of DataNet.
EPRO BD -
027
Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets
Cloud computing has arisen as the mainstream platform of utility computing paradigm that offers
reliable and robust infrastructure for storing data remotely, and provides on demand applications and
services. Currently, establishments that produce huge volume of sensitive data, leverage data
outsourcing to reduce the burden of local data storage and maintenance. The outsourced data, however,
in the cloud are not always trustworthy because of the inadequacy of physical control over the data for
data owners. To better streamline this issue, scientists have now focused on relieving the security
threats by designing remote data checking (RDC) techniques. However, the majority of these
techniques are inapplicable to big data storage due to incurring huge computation cost on the user and
cloud sides. Such schemes in existence suffer from data dynamicity problem from two sides. First, they
are only applicable for static archive data and are not subject to audit the dynamic outsourced data.
Second, although, some of the existence methods are able to support dynamic data update, increasing
the number of update operations impose high computation and communication cost on the auditor due
to maintenance of data structure, i.e., merkle hash tree. This paper presents an efficient RDC method
on the basis of algebraic properties of the outsourced files in cloud computing, which inflicts the least
computation and communication cost. The main contribution of this paper is to present a new data
structure, called Divide and Conquer Table (D&CT), which proficiently supports dynamic data for
normal file sizes. Moreover, this data structure empowers our method to be applicable for large-scale
data storage with minimum computation cost.
EPRO BD -
028
Auditing Big Data Storage in Cloud Computing Using Divide and Conquer Tables
Elysium PRO Titles with Abstracts 2018-19
Elysium PRO Titles with Abstracts 2018-19