Date of Award
Campus Access Dissertation
Doctor of Philosophy (PhD)
The explosion of big-data in many different domains, which combines with large-scale parallel computing, has created new opportunities, from enabling scientific discoveries to healthcare and economic measurements. The beauty of big-data is its resource manager that enables computers with different configurations, different computing capability to unify and collaborate. Resource manager in big-data keeps track of resource availability and effectively ”move compute to data.” Therefore, many times, we see that big-data is running on various computing environments from traditional cloud computing. Such an environment can be an IoT cloud, heterogeneous systems, and even cluster of unprotected hosts, recruited on the internet. Big-data in new computing environments is still an intricate, time-consuming process that often requires integrating technologies from multiple fields. Including network optimization, resource management, and security. Nevertheless, such integration is far from optimal. There is undoubtedly a gap between the true-potential of these computing environments and how they were being used to empower big data-driven applications. This dissertation uncovers new strategies for bridging this gap and improving the performance such as reducing execution time and network optimization by developing a dynamic network, efficient resource management, and extra protection for big-data in different environments. This dissertation mainly focuses on coping with the challenges of three different big-data environments.
In the first part of this dissertation, we focus on improving the performance of big-data in a homogeneous cluster when considering a new type of bundled job where the input data and associated application jobs are in a bundle. Our object is to break the barrier between resource management and the underlying storage layer. Therefore, improve data locality, a critical performance factor for resource management, from the perspective of the storage system. We develop a sampling-based randomized algorithm for the storage system to arrange the placement of input data blocks. The principal idea is to query a selected set of candidate nodes and estimate their workload at run time in coupling centralized and per-node information. The node with the least workload is selected to store the data block. In the second part, we study the security of big-data processing in hostile environments from the aspects of resource abuse and data verification. We develop RoVEr, an efficient and verifiable ECS for big-data platforms. In RoVEr, every piece of data associates with its checksums that reside on a set of witnesses. A bloom-filter data-structure is used on each witness to keep the records of the checksums. The data verification approves on the majority voting. RoVEr also supports a fast reconstruction of Bloom-Filter when a node recovers from a failure
Lastly, we argue that a cluster of IoT computers can support big-data by considering the information fed by big data applications and boost network throughput to big-data computing in the IoT environments. We propose owlBIT, a cross-layer framework that uses application-layer information to guide the packet scheduling at the link layer. We implement our system as an extension module in an Hadoop-Yarn system. The experimental evaluation shows notable performance improvement.
Nguyen, Nam S., "Holistic Cross-Layer Improvements for Big Data Processing in New Computing Environments" (2020). Graduate Doctoral Dissertations. 602.