Improving performance and predictability of storage arrays
Massive amount of data is generated everyday through sensors, Internet transactions, social networks, video, and all other digital sources available. Many organizations store this data to enable breakthrough discoveries and innovation in science, engineering, medicine, and commerce. Such massive scale of data poses new research problems called big data challenges. As the amount of data grows, disk I/O performance needs further attention since it can significantly limit the performance and scalability of applications. Storage arrays have emerged as a promising technology to address the challenges of scalable storage and efficient retrieval of growing data.
Storage arrays are multi-disks multi-processor storage systems requiring efficient schemes for high performance parallel I/O. Declustering and replication are two common techniques used in storage arrays to reduce query response times through parallel I/O by distributing multiple copies of data among parallel disks. One class of declustering called periodic allocations are based on number-theory and provide performance near to optimal. However, finding the best performing periodic allocation is challenging since the search space is large and computation is costly. Furthermore, existing retrieval techniques are designed for centralized and homogeneous storage arrays composed of identical disks. Recently, heterogeneous storage arrays consisting of solid-state and rotating disks have appeared on the market. In order to address the retrieval problem in general, new retrieval strategies extending the current techniques to heterogeneous and distributed storage architectures are necessary.
In this dissertation, various techniques to improve performance and predictability of storage arrays are investigated. First, equivalent disk allocations are explored. By using the equivalence information, it is possible to reduce the complexity of searching for good disk allocations under various criteria. Next, the generalized retrieval problem that supports heterogeneous and distributed storage arrays are studied. Maximum flow based retrieval algorithm guaranteeing the optimal response time is presented for the generalized retrieval problem. Furthermore, execution time of this retrieval algorithm is improved by using sequential and parallel push-relabel based maximum flow implementations. In addition to the optimal response time retrieval of a single request, various strategies to retrieve continuous disk requests are also investigated. Finally, a Quality of Service (QoS) framework to improve the predictability of storage arrays is presented. All proposed techniques are evaluated using extensive experimentations and comparisons with other state-of-the-art approaches are performed.