(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-79557786-1', 'auto'); ga('send', 'pageview');
For an up-to-date list of my publications please check my profiles at:
Secure outsourcing of computation has gained importance with the proliferation of cloud services. However, existing outsourcing protocol specification languages are mainly suitable for secure multi-party computation. They offer limited support for secure outsourcing of computation of large datasets in cloud computing environments. This paper presents a model driven approach to define then coordinate the execution of secure outsourcing protocols. First we present the details of our Outsourcing Protocol Definition Language (OPDL) used to define a machine-process able protocols in an abstract and declarative way while leaving the implementation details to the underlying runtime components. The proposed language aims to simplify the design of these protocols while allowing their verification and the generation of cloud services composition to coordinate the protocol execution. We evaluated the expressiveness of OPDL by using it to define a set of representative secure outsourcing protocols from the literature.
The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easy-to-implement solution for one of these problems, namely, the all-pairs suffix-prefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this time-efficient and space-economical tree data structure. The paper presents techniques for parallel implementations of the proposed solution. Experimental evaluation indicates superior results in terms of space and time over existing solutions. Results also show that the proposed technique is highly scalable in a parallel execution environment.
Compressed indices are important data structures in stringology. Compressed versions of many well-known data structures such as suffix tree and suffix array, which are used in string matching problems, have been studied and proposed. This paper takes advantage of a very recent compressed suffix array to build a space-economic solution for an important bioinformatics problem, namely the all-pairs suffix prefix problem. The paper also presents a simple technique for parallelizing the solution. Our results show that the proposed solution consumes less than one fifth of the space required by other solutions based on standard data structures. In addition, our results demonstrate that good performance scalability can be achieved by employing the proposed parallel algorithm.
In a cloud computing paradigm, energy efficient allocation of different virtualized ICT resources (servers, storage disks, and networks, and the like) is a complex problem due to the presence of heterogeneous application (e.g., content delivery networks, MapReduce, web applications, and the like) workloads having contentious allocation requirements in terms of ICT resource capacities (e.g., network bandwidth, processing speed, response time, etc.). Several recent papers have tried to address the issue of improving energy efficiency in allocating cloud resources to applications with varying degree of success. However, to the best of our knowledge there is no published literature on this subject that clearly articulates the research problem and provides research taxonomy for succinct classification of existing techniques. Hence, the main aim of this paper is to identify open challenges associated with energy efficient resource allocation. In this regard, the study, first, outlines the problem and existing hardware and software-based techniques available for this purpose. Furthermore, available techniques already presented in the literature are summarized based on the energy-efficient research dimension taxonomy. The advantages and disadvantages of the existing techniques are comprehensively analyzed against the proposed research dimension taxonomy namely: resource adaption policy, objective function, allocation method, allocation operation, and interoperability.
In this paper, we study the privacy breach caused by unsafe correlations in transactional data where individuals have multiple tuples in a dataset. We provide two safety constraints to guarantee safe correlation of the data: (1) the safe grouping constraint to ensure that quasi-identifier and sensitive partitions are bounded by l-diversity and (2) the schema decomposition constraint to eliminate non-arbitrary correlations between non-sensitive and sensitive values to protect privacy and at the same time increase the aggregate analysis. In our technique, values are grouped together in unique partitions that enforce l-diversity at the level of individuals. We also propose an association preserving technique to increase the ability to learn/analyze from the anonymized data. To evaluate our approach, we conduct a set of experiments to determine the privacy breach and investigate the anonymization cost of safe grouping and preserving associations.
Data curation activities in collaborative databases mandate that collaborators interact until they converge and agree on the content of their data. Typically, updates by a member of the collaboration are made visible to all collaborators for comments but at the same time are pending the approval or rejection of the data custodian, e.g., the principal scientist or investigator (PI). In current database technologies, approval and authorization of updates is based solely on the identity of the user, e.g., via the SQL GRANT and REVOKE commands. However, in collaborative environments, the updated data is open for collaborators for discussion and further editing and is finally approved or rejected by the PI based on the content of the data and not on the identity of the updater. In this paper, we introduce a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. We realize content-based update approval and history tracking of updates inside HBase, a distributed and scalable open-source cluster-based database. The design and implementation as well as a detailed performance study of several approaches for update approval are presented and contrasted in the paper.
Recently, large bio-projects dealing with the release of different genomes have transpired. Most of these projects use next-generation sequencing platforms. As a consequence, many de novo assembly tools have evolved to assemble the reads generated by these platforms. Each tool has its own inherent advantages and disadvantages, which make the selection of an appropriate tool a challenging task.
Anonymization methods are an important tool to protect privacy. The goal is to release data while preventing individuals from being identified. Most approaches generalize data, reducing the level of detail so that many individuals appear the same. An alternate class of methods, including anatomy, fragmentation, and slicing, preserves detail by generalizing only the link between identifying and sensitive data. We investigate learning association rules on such a database. Association rule mining on a generalized database is challenging, as specific values are replaced with generalizations, eliminating interesting fine-grained correlations. We instead learn association rules from a fragmented database, preserving fine-grained values. Only rules involving both identifying and sensitive information are affected; we demonstrate the efficacy of learning in such environment.
Efficient evaluation of distributed computation on large-scale data is prominent in modern scientific computation; especially analysis of big data, image processing and data mining applications. This problem is particularly challenging in distributed environments such as campus clusters, grids or clouds on which the basic computation routines are offered as web/cloud services. In this paper, we propose a locality-aware workflow-based solution for evaluation of large-scale matrix expressions in a distributed environment. Our solution is based on automatic generation of BPEL workflows in order to coordinate long running, asynchronous and parallel invocation of services. We optimize the input expression in order to maximize parallel execution of independent operations while reducing the matrix transfer cost to a minimum. Our approach frees the end-user of the system from the burden of writing and debugging lengthy BPEL workflows. We evaluated our solution on realistic mathematical expressions executed on large-scale matrices distributed on multiple clouds.
In the emerging cloud computing model, security is of paramount concern. This paper discusses the need for practical techniques that enable private outsourcing on the cloud by allowing the service provider to work on clients’ data or computations without seeing the data being processed. The paper briefly discusses two examples of such techniques.
Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions.
To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.
Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions. To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Mapand Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead
We have developed a distributed parallel storage system that employs the aggregate bandwidth of multiple data servers connected by a high-speed wide-area network to achieve scalability and high data throughput. This paper studies different schemes to enhance the reliability and availability of such network-based distributed storage systems. The general approach of this paper employs “erasure” error-correcting codes that can be used to reconstruct missing information caused by hardware, software, or human faults. The paper describes the approach and develops optimized algorithms for the encoding and decoding operations. Moreover, the paper presents techniques for reducing the communication and computation overhead incurred while reconstructing missing data from the redundant information. These techniques include clustering, multidimensional coding, and the full two-dimensional parity schemes. The paper considers trade-offs between redundancy, fault tolerance, and complexity of error recovery
In relational distributed databases a query cost consists of a local cost and a transmission cost. Query optimization is a combinatorial optimization problem. As the query size grows, the optimization methods based on exhaustive search become too expensive. We propose the following strategy for solving large distributed query optimization problems in relational database systems: (1) represent each query-processing schedule by a labeled directed graph; (2) reduce the number of different schedules by pruning away invalid or high-cost solutions; and (3) find a suboptimal schedule by combinatorial optimization. We investigate several combinatorial optimization techniques: random search, single start, multistart, simulated annealing, and a combination of random search and local simulated annealing. The utility of combinatorial optimization is demonstrated in the problem of finding the (sub)optimal semijoin schedule that fully reduces all relations of a tree query. The combination of random search and local simulated annealing was superior to other tested methods
The rapid growth of large-data processing has brought in the MapReduce programming model as a widely accepted solution. However, MapReduce limits itself to a one map-to-one-reduce framework. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among concurrent applications and/or jobs. The performance might be improved when the shared and frequently accessed data is read from local instead of distributed file system.To enhance the performance of big data applications, this paper presents Concurrent MapReduce, a new programming model built on top of MapReduce that deals with large amount of shared data items. Concurrent MapReduce provides support for processing heterogeneous sources of input datasets and offers optimization when the datasets are partially or fully shared. Experimental evaluation has shown an execution runtime speedup of 4X compared to traditional nonconcurrent MapReduce implementation with a manageable time overhead.
This paper presents a methodology for running NGS read mapping tools in the cloud environment based on the MapReduce programming paradigm. As a demonstration, the recently developed and robust sequence alignment tool, BFAST, is used within our methodology to handle massive datasets. The results of our experiments show that the transformation of existing read mapping tools to run within the MapReduce framework dramatically reduces the total execution time and enables the user to utilize the resources provided by the cloud.
This paper presents a technique for mapping artificial neural networks (ANNs) on hypercube massively parallel machines. The paper starts by synthesizing a parallel structure, the mesh-of-appendixed-trees (MAT), for fast ANN implementation. Then, it presents a recursive procedure to embed the MAT structure into the hypercube topology. This procedure is used as the basis for an efficient mapping of ANN computations on hypercube systems. Both the multilayer feedforward with backpropagation (FFBP) and the Hopfield ANN models are considered. Algorithms to implement the recall and the training phases of the FFBP model as well as the recall phase of the Hopfield model are provided. The major advantage of our technique is high performance. Unlike the other techniques presented in the literature which require O(n) time, where N is the size of the largest layer, our implementation requires only O(log N) time. Moreover, it allows pipelining of more than one input pattern and thus further improves the performance
Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using prede- fined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing stateof-the-art approaches, while saving more than ten times in memory space.
In this paper, we present a study to counter privacy violation due to unsafe data correlation. We propose a safe correlation requirement to keep correlated values bounded by l-diversity and evaluate the trade-off to be made for the sake of a strong privacy guarantee. Finally, we present a correlation sanitization algorithm that enforces our safety constraint and demonstrates its efficiency.
The rapid advances in genome sequencing leads to the generation of huge amount of data in a single sequencing experiment. Several genome assemblers with different objectives were developed to process these genomic data. Obviously, the output assemblies produced by these assemblers have different qualities due to their diverse nature. Recent research efforts concluded that combining the assemblies from different assemblers would enhance the quality of the output assembly. Based on this, our study combines the five best assemblies of three fungal genomes and evaluates the quality of the output assembly as compared to that produced by individual assemblers. The results conclude that the output assembly quality is influenced by the increase of the number of gaps in the input assemblies more than the increase in N50 size. Based on this conclusion, we propose a set of guidelines to get better output assemblies.
With the wide adoption of cloud computing paradigm, it is important to develop appropriate techniques to protect client data privacy in the cloud. Encryption is one of the major techniques that could be used to achieve this gaol. However, data encryption at the rest along is insufficient for secure cloud computation environments. Further efficient techniques for carrying out computation over encrypted data are also required. Fully homomorphic encryption (FHE) and garbled circuits are naturally used to process encrypted data without leaking any information about the data. However, existing FHE schemes are inefficient for processing large amount of data in cloud and garbled circuits are one time programs and cannot be reused. Using modern technologies such as FHE, several authors have developed reusable garbled circuit techniques in recent years. But they are not efficient either and could not be deployed at a large scale. By relaxing the privacy definition from perfect forward secrecy to all-or-nothing privacy, we are able to design efficient reusable garbled circuits in this paper. These reusable garbled computation techniques could be used for processing encrypted cloud data efficiently.
This paper argues that the contextual properties of cloud-enabled software architecture should be identified and understood differently by cloud consumers. The existing architectures are not designed to exploit the contextual properties of cloud computing. The emergence of cloud computing virtually forces cloud consumers to re-evaluate their software architectures in light of the cloud computing context which requires relinquish control over most architectural components to cloud providers. In a cloud-enabled architecture, the shift of ownership of and control over architectural components from consumers to cloud providers has profound impact on the ways cloud consumers design their software architecture. In this perspective, we go beyond the traditional definition of software architecture, and introduce the concepts of architectural scope, ownership, and control as the contextual properties of an architecture.
With the advent of cloud computing there is an increased interest in outsourcing an organization’s data to a remote provider in order to reduce the costs associated with self-hosting. If that database contains information about individuals (such as medical information), it is increasingly important to also protect the privacy of the individuals contained in the database. Existing work in this area has focused on preventing the hosting provider from ascertaining individually identifiable sensitive data from the database, through database encryption or manipulating the data to provide privacy guarantees based on privacy models such as k-anonymity. Little work has been done to ensure that information contained in queries on the data, in conjunction with the data, does not result in a privacy violation. In this work we present a hash based method which provably allows the privacy constraint of an unencrypted database to be extended to the queries performed on the database. In addition, we identify a privacy limitation of such an approach, describe how it could be exploited using a known-query attack, and propose a counter-measure based on oblivious storage.
Similarity group-by (SGB, for short) has been proposed as a relational database operator to match the needs of emerging database applications. Many SGB operators that extend SQL have been proposed in the literature, e.g., similarity operators in the one-dimensional space. These operators have various semantics. Depending on how these operators are implemented, some of the implementations may lead to different groupings of the data. Hence, if SQL code is ported from one database system to another, it is not guaranteed that the code will produce the same results. In this paper, we investigate the various semantics for the relational similarity group-by operators in the multi-dimensional space. We define the class of order-independent SGB operators that produce the same results regardless of the order in which the input data is presented to them. Using the notion of interval graphs borrowed from graph theory, we prove that, for certain SGB operators, there exist order-independent implementations. For each of these operators, we provide a sample algorithm that is order-independent. Also, we prove that for other SGB operators, there does not exist an order-independent implementation for them, and hence these SGB operators are ill-defined and should not be adopted in extensions to SQL to realize similarity group-by. In this paper, we introduce an SGB operator, namely SGB-All, for grouping multi-dimensional data using similarity. SGB-All forms groups such that a data item, say O, belongs to a group, say G, if and only if O is within a user-defined threshold from all other data items in G. In other words, each group in SGB-All forms a clique of nearby data items in the multi-dimensional space. We prove that SGB-All are order-independent, i.e., there is at least one algorithm for each option that is independent of the presentation order of the input data.
Distributed storage systems are increasing being used by data-intensive applications for efficient and reliable data delivery. The Network Storage Manager (NSM) is a distributed storage framework with a unique architecture that maximizes applications control over many of the storage and retrieval policies. Several applications are utilizing NSM for efficient, tunable, and controllable performance. Data layout is one policy that is considered to be application dependant and tailored algorithms are preferred for application with complex or irregular access patters. Experimental results have shown dramatic performance enhancement when optimized layout policies override the default NSM implementation. Layout algorithms are more effective when proper prefetching and cache replacement policies are implemented.
Paillier’s additive homomorphic encryption is increasingly used in recent research in the field of cloud secure outsourcing and privacy-preserving computation in addition to other cryptographic tools such as garbled circuits. In this paper, we review Paillier’s encryption and its application to privacy-preserving computation outsourcing and secure online voting. We present a new implementation of Paillier’s cryptosystem using Python as for interface language and fast GMP C-routines for arithmetic operations.
As we delve deeper into the ‘Digital Age’, we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis that originated from myriad of sources and applications including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, etc. Such ‘Data Explosions’ has led to one of the most challenging research issues of the current Information and Communication Technology era: how to optimally manage (e.g., store, replicated, filter, and the like) such large amount of data and identify new ways to analyze large amounts of data for unlocking information. It is clear that such large data streams cannot be managed by setting up on-premises enterprise database systems as it leads to a large up-front cost in buying and administering the hardware and software systems. Therefore, next generation data management systems must be deployed on cloud. The cloud computing paradigm provides scalable and elastic resources, such as data and services accessible over the Internet Every Cloud Service Provider must assure that data is efficiently processed and distributed in a way that does not compromise end-users’ Quality of Service (QoS) in terms of data availability, data search delay, data analysis delay, and the like. In the aforementioned perspective, data replication is used in the cloud for improving the performance (e.g., read and write delay) of applications that access data. Through replication a data intensive application or system can achieve high availability, better fault tolerance, and data recovery. In this paper, we survey data management and replication approaches (from 2007 to 2011) that are developed by both industrial and research communities. The focus of the survey is to discuss and characterize the existing approaches of data replication and management that tackle the resource usage and QoS provisioning with different levels of efficiencies. Moreover, the breakdown of both influential expressions (data replication and management) to provide different QoS attributes is deliberated. Furthermore, the performance advantages and disadvantages of data replication and management approaches in the cloud computing environments are analyzed. Open issues and future challenges related to data consistency, scalability, load balancing, processing and placement are also reported.
Designing a Video-on-Demand (VoD) system imposes a great challenge due to huge storage requirements and real time constraints of continuous media. Continuous media are different from traditional textual data. First, the retrieval and display of continuous media are subject to real-time constraints. Second, objects of this media type are typically large in size. In this paper we present the performance evaluation of VoD system utilizing a distributed parallel data delivery mechanism for storage and retrieval of continuous media that satisfy the VoD QoS parameters in terms of delay jitter over best effort networks.
Cloud computing enables a cost effective outsourcing of storage and resource-intensive computations. Secure outsourcing of data and computation is challenging in this emerging computing model. In particular, outsourcing of sensitive computations should be assured in terms of input privacy and cheating detection. Existing approaches use either expensive homomorphic encryption which is not yet efficient enough for practical applications or secure tamper-proof hardware that are expensive and not scalable. Works on devising a real-world framework enabling such secured outsourcing are still needed. In this paper, we propose practical protocols for secure outsourcing of matrix algebra using randomization and without using cryptography. We address the issues of a real deployment in terms of distributed, reliable and secure storage, secure data transfer and computation on multiple, non-colluding clouds. We present the architecture and the APIs of our framework as well as some experimental results demonstrating the effectiveness of the proposed approach.
This paper attempts to identify the role of contextual properties of enterprise systems architecture in relation to service migration to cloud computing. In a cloud-based service architecture, the shift of ownership, scope, and control over architectural elements from consumers to cloud providers has a profound impact on ways cloud consumers design and manage their systems architecture. In this perspective, we introduce the concepts of architectural scope, ownership, and control as the contextual properties of systems architecture. The paper explores ways in which these properties can be mapped into a quantifiable framework that could be used to measure the degree of changes of contextual properties due to service migration to cloud computing. We seek here to address the service migration problems from a different perspective, namely, focusing on the contextual properties of architectural elements. Copyright © 2013 John Wiley & Sons, Ltd.
We treat the problem of secure outsourcing of sequence comparisons by a client to remote servers, which given two strings λ and μ of respective lengths n and m, consists of finding a minimum-cost sequence of insertions, deletions, and substitutions (also called an edit script) that transform λ intoμ. In our setting a client owns λ and μ and outsources the computation to two servers without revealing to them information about either the input strings or the output sequence. Our solution is non-interactive for the client (who only sends information about the inputs and receives the output) and the client’s work is linear in its input/output. The servers’ performance is O(σmn) computation (which is optimal) and communication, where σ is the alphabet size, and the solution is designed to work when the servers have only O(σ(m + n)) memory. By utilizing garbled circuit evaluation in a novel way, we completely avoid public-key cryptography, which makes our solution particularly efficient.
There has been much recent work on secure storage outsourcing, where an organization wants to store its data at untrusted remote cloud servers in an encrypted form, such that its own employees can query the encrypted data using weak devices (both computationally and storage-wise). Or a weak client wants to outsource an expensive computational task without revealing to the servers either the inputs or the computed outputs. The framework requires that the bulk of the computational burden of query-processing be placed on the remote servers, without revealing to these servers anything about the data. Most of the existing work in this area deals with non-image data that is keyword based, and the present paper is to deal with raw image data (without any keyword annotations). We demonstrate that shape-based image feature extraction, a particularly computationally intensive task, can be carried out within this framework, by presenting two schemes for doing so, and demonstrating their viability by experimentally evaluating them. Our results can be used in a number of practical situations. In one scenario the client has images and wants to securely outsource shape-based feature extraction on them, in another the server has encrypted images and the client wants a feature-extracted representation of those that are feature-rich.
This paper reports the design of a cloud-based service for coordinating secure outsourcing of storage and computation of scientific data, particularly matrices. While this service may support different secure outsourcing protocols and mechanisms (e.g. homomorphic encryption, secret sharing and randomization), we hide all the complexity from end-users and move it to a middleware broker. The broker manages the communication between the client and one or more clouds. The requests submitted by users are automatically translated into WS-BPEL workflows and then executed by the broker workflow engine to coordinate the control flow and the data flow for outsourcing of matrix operations. We detail the architecture of our framework and the design of its key components. Our work facilitates real-world and practical deployment of recently proposed secure outsourcing protocols.
In this paper, we identify a new and challenging application for the growing field of research on data anonymization and secure outsourcing of storage and computations to the cloud. Network flow data analysis is of high importance for network monitoring and management. Network monitoring applications reveal new challenges not yet addressed in the secure outsourcing literature. The secure and verifiable outsourcing of computation on anonymized network flow records provides a practical tool for network operators in order to harness the cloud benefits, which untapped until now because of privacy concerns. We present representative use-cases and problems, and identify sample related work that can be utilized for developing an effective solution.
The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.
A cloud mashup is composed of multiple services with shared datasets and integrated functionalities. For example, the elastic compute cloud (EC2) provided by Amazon Web Service (AWS), the authentication and authorization services provided by Facebook, and the Map service provided by Google can all be mashed up to deliver real-time, personalized driving route recommendation service. To discover qualified services and compose them with guaranteed quality of service (QoS), we propose an integrated skyline query processing method for building up cloud mashup applications. We use a similarity test to achieve optimal localized skyline. This mashup method scales well with the growing number of cloud sites involved in the mashup applications. Faster skyline selection, reduced composition time, dataset sharing, and resources integration assure the QoS over multiple clouds. We experiment with the quality of Web service (QWS) benchmark over 10,000 Web services along six QoS dimensions. By utilizing block-elimination, data-space partitioning, and service similarity pruning, the skyline process is shortened by three times, when compared with two state-of-the-art methods.
The hypercube structure is a very widely used interconnection topology because of its appealing topological properties. For massively parallel systems with thousands of processors, the hypercube suffers from a high node fanout which makes such systems impractical and infeasible. In this paper, we introduce an interconnection network called The Extended Cube Connected Cycles (ECCC) which is suitable for massively parallel systems. In this topology the processor fanout is fixed to four. Other attractive properties of the ECCC include a diameter of logarithmic order and a small average interprocessor communication distance which imply fast data transfer. The paper presents two algorithms for data communication in the ECCC. The first algorithm is for node-to-node communication and the second is for node-to-all broadcasting. Both algorithms take O(log N) time units, where N is the total number of processors in the system. In addition, the paper shows that a wide class of problems, the divide and conquer class, is easily and efficiently solvable on the ECCC topology. The solution of a divide and conquer problem of size N requires O(log N) time units
Interconnection networks play a crucial role in the performance of parallel systems. This paper introduces a new interconnection topology that is called the hierarchical hypercube (HHC). This topology is suitable for massively parallel systems with thousands of processors. An appealing property of this network is the low number of connections per processor, which enhances the VLSI design and fabrication of the system. Other alluring features include symmetry and logarithmic diameter, which imply easy and fast algorithms for communication. Moreover, the HHC is scalable; that is it can embed HHC’s of lower dimensions. The paper presents two algorithms for data communication in the HHC. The first algorithm is for one-to-one transfer, and the second is for one-to-all broadcasting. Both algorithms take O(log2 k), where k is the total number of processors in the system. A wide class of problems, the divide & conquer class (D&Q), is shown to be easily and efficiently solvable on the HHC topology. Parallel algorithms are provided to describe how a D&Q problem can be solved efficiently on an HHC structure. The solution of a D&Q problem instance having up to k inputs requires a time complexity of O(log2 k)
Identifying similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, little has been achieved for relational set operators, namely Intersection, Difference, and Union. In this paper, we propose to extend the semantics of relational set operators to take into account the similarity of values. We develop efficient query processing algorithms for evaluating them, and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based set operators, we perform extensive experiments that demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.
Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, e.g., similarity joins and similarity group-by. This paper extends the semantics of the set intersection operator to operate over similar values. The paper describes the semantics of the similarity-based set intersection operator, and develops an efficient query processing algorithm for evaluating it. The proposed operator is implemented inside an open-source database system, namely PostgreSQL. Several queries from the TPC-H benchmark are extended to include similarity-based set intersetion predicates. Performance results demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.
A massively parallel architecture called the mesh-of-appendixed-trees (MAT) is shown to be suitable for processing artificial neural networks (ANNs). Both the recall and the learning phases of the multilayer feedforward with backpropagation ANN model are considered. The MAT structure is refined to produce two special-purpose array processors; FMAT1 and FMAT2, for efficient ANN computation. This refinement tends to reduce circuit area and increase hardware utilization. FMAT1 is a simple structure suitable for the recall phase. FMAT2 requires little extra hardware but supports learning as well. A major characteristic of the proposed neurocomputers is high performance. It takesO (logN) time to process a neural network withN neurons in its largest layer. Our proposed architecture is shown to provide the best number of connections per unit time when compared to several major techniques in the literature. Another important feature of our approach is its ability to pipeline more than one input pattern which further improves the performance.
We introduce operations to safely update an anatomized database. The result is a database where the view of the server satisfies standards such as k-anonymity or l-diversity, but the client is able to query and modify the original data. By exposing data where possible, the server can perform value-added services such as data analysis not possible with fully encrypted data, while still being unable to violate privacy constraints. Update is a key challenge with this model; naïve application of insertion and deletion operations reveals the actual data to the server. This paper shows how data can be safely inserted, deleted, and updated. The key ideas are that data is inserted or updated into an encrypted temporary table until enough data is available to safely decrypt, and that sensitive information of deleted tuples is left behind to ensure privacy of both deleted and undeleted individuals. This approach is proven effective in maintaining the privacy constraint against an adversarial server. The paper also gives empirical results on how much data remains encrypted, and the resulting quality of the server’s (anatomized) view of the data, for various update and delete rates.
In this paper, we address privacy breaches in transactional data where individuals have multiple tuples in a dataset. We provide a safe grouping principle to ensure that correlated values are grouped together in unique partitions that enforce l-diversity at the level of individuals. We conduct a set of experiments to evaluate privacy breach and the anonymization cost of safe grouping.
The all-pairs suffix-prefix matching problem is a basic problem in string processing. It has an application in the de novo genome assembly task, which is one of the major bioinformatics problems. Due to the large size of the input data, it is crucial to use fast and space efficient solutions. In this paper, we present a space-economical solution to this problem using the generalized Sadakane compressed suffix tree. Furthermore, we present a parallel algorithm to provide more speed for shared memory computers. Our sequential and parallel algorithms are optimized by exploiting features of the Sadakane compressed index data structure. Experimental results show that our solution based on the Sadakane’s compressed index consumes significantly less space than the ones based on noncompressed data structures like the suffix tree and the enhanced suffix array. Our experimental results show that our parallel algorithm is efficient and scales well with increasing number of processors.