A Study on the Improvement of Data Collection in Data Centers and Its Analysis on Deep Learning-based Applications
Doctor of Philosophy (PhD)
Division of Computer Science and Engineering
Big data are usually stored in data center networks for processing and analysis through various cloud applications. Such applications are a collection of data-intensive jobs which often involve many parallel flows and are network bound in the distributed environment. The recent networking abstraction, coflow, for data parallel programming paradigm to express the communication requirements has opened new opportunities to network scheduling for such applications. Therefore, I propose coflow based network scheduling algorithm, Coflourish, to enhance the job completion time for such data-parallel applications, in the presence of the increased background traffic to mimic the cloud environment infrastructure. It outperforms Varys, the state-of-the-art coflow scheduling technique, by 75.5% under various workload conditions. However, such technique often requires customized operating systems, customized computing frameworks or external proprietary software-defined networking (SDN) switches. Consequently, in order to achieve the minimal application completion time, through coflow scheduling, coflow routing, and per-rate per-flow scheduling paradigm with minimum customization to the hosts and switches, I propose another scheduling technique, MinCOF which exploits the OpenFlow SDN. MinCOF provides faster deployability and no proprietary system requirements. It also decreases the average coflow completion time by 12.94% compared to the latest OpenFlow-based coflow scheduling and routing framework. Although the challenges related to analysis and processing of big data can be handled effectively through addressing the network issues. Sometimes, there are also challenges to analyze data effectively due to the limited data size. To further analyze such collected data, I use various deep learning approaches. Specifically, I design a framework to collect Twitter data during natural disaster events and then deploy deep learning model to detect the fake news spreading during such crisis situations. The wide-spread of fake news during disaster events disrupts the rescue missions and recovery activities, costing human lives and delayed response. My deep learning model classifies such fake events with 91.47% accuracy and F1 score of 90.89 to help the emergency managers during crisis. Therefore, this study focuses on providing network solutions to decrease the application completion time in the cloud environment, in addition to analyze the data collected using the deployed network framework to further use it to solve the real-world problems using the various deep learning approaches.
Singh, Dipak Kumar, "A Study on the Improvement of Data Collection in Data Centers and Its Analysis on Deep Learning-based Applications" (2020). LSU Doctoral Dissertations. 5285.