Improving Cloud Storage Performance with Adaptive Resource Management
Fetching and storing content on today's cloud storage services are associated with high latency variance, and throughput degradation due to intrinsically complex issues such as contention of shared resources, interference from background tasks such as data scrubbing, backfilling, recovery, etc., and the difference in processing capabilities of heterogeneous servers in a cloud datacenter, which are gradually upgraded and replaced. This has a significant impact on a broad range of applications (web search, social networking, data analytics, etc.) that are characterized by massive working sets, and real-time constraints. Existing cloud storage services have little or no ability to react quickly to performance hotspots created by hardware heterogeneity and shared resource contention. Existing techniques that focus on improving cloud storage performance often rely on client-centric application-specific fine tuning which cannot be generalized to a broad range of applications. There are server-centric approaches that focus on auto-scaling, and storage node partitioning but they do not scale well mainly due to the associated overheads of data movement among the storage nodes, and their inability to adapt to previously unseen workloads.
In this research, we propose to enable cloud storage services to be resilient to the above-mentioned challenges by adapting to heterogeneous hardware configurations and changing workload conditions. Our initial study, DLR (Dynamic Load Redistribution technique) using Ceph, an open-source distributed storage platform, examined the feasibility of improving cloud storage performance through statically-tuned load balancing rules. We further identified the challenges in scaling such approach to handle diverse workload mixes and resource bottlenecks. To address these challenges, we developed a machine learning based system adaptation technique that enables a cloud storage system to manage itself through load balancing and data migration with the aim of delivering optimal performance in the face of diverse workload patterns and resource bottlenecks. In particular, we applied a stochastic policy gradient based reinforcement learning technique to track performance hotspots in the storage cluster, and take appropriate corrective actions to maximize future performance under a variety of complex scenarios. For this purpose, we leveraged system-level performance monitoring and commonly available control-knobs in object-based cloud storage systems. We implemented the developed techniques to build an Adaptive Resource Management (ARM) system for object based storage cluster, and evaluated its performance on NSF Cloud's Chameleon testbed. Experiments using Cloud Object Storage Benchmark (COSBench) show that, ARM improves the cloud storage operations comparing to the default storage scenario and our previous approach, DLR while having minimal overhead on cloud storage operations.