Adaptive Cloud Resource Management with Reinforcement Learning

Date

2018

Authors

Mehra, Rohit

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Cloud computing encapsulates a dynamic environment, whose performance is affected by the internal resources as well as the external workloads. Hence, these services sometimes come with a cost of high latency variance and throughput degradation due to load imbalance, interference from background tasks such as data scrubbing, backfilling, recovery, and the difference in processing capabilities of heterogeneous servers in a datacenter. Resource management is the key to get maximum performance even when the system faces problems such as heterogeneity, background interferences and/or varying workload conditions. However, it is challenging for human operators to effectively monitor cloud-based systems health and hand-tune various control-knobs in a cloud-scale cluster for maintaining optimal performance under diverse workload conditions.

This study presents an Adaptive Cloud Resource Management Framework to automate the configuration processes of cloud-based systems by effectively monitoring systems health and predicting workload conditions. At its core, the framework leverages system-level performance monitoring and a model-free reinforcement learning technique to track performance hotspots in the cluster and take appropriate corrective actions to maximize future performance under a variety of complex scenarios. This study applies the proposed framework to a cloud storage system Ceph, thus enabling it to manage itself through load balancing and data migration with the aim of delivering optimal performance in the face of diverse workload patterns and resource bottlenecks. Experiments using Cloud Object Storage Benchmark (COSBench) show that ACRMF improves the average read and write response time of Ceph storage cluster by up to 50% and 33% respectively, compared to the default case. It also outperforms a state-of-the-art dynamic load rebalancing technique in terms of read and write performance of Ceph storage by 43% and 36% respectively.

Description

This item is available only to currently enrolled UTSA students, faculty or staff. To download, navigate to Log In in the top right-hand corner of this screen, then select Log in with my UTSA ID.

Keywords

Ceph, Cloud, Cloud Computing, Machine Learning, Reinforcement Learning, Storage

Citation

Department

Computer Science