A File Storage Service on a Cloud Computing Environment for Digital Libraries Victor Jesús Sosa-Sosa and Emigdio M. Hernandez-Ramirez INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2012 34 ABSTRACT The growing need for digital libraries to manage large amounts of data requires storage infrastructure that libraries can deploy quickly and economically. Cloud computing is a new model that allows the provision of information technology (IT) resources on demand, lowering management complexity. This paper introduces a file-storage service that is implemented on a private/hybrid cloud-computing environment and is based on open-source software. The authors evaluated performance and resource consumption using several levels of data availability and fault tolerance. This service can be taken as a reference guide for IT staff wanting to build a modest cloud storage infrastructure. INTRODUCTION The information technology (IT) revolution has led to the digitization of every kind of information.1 Digital libraries are appearing as one more step toward easy access to information spread throughout a variety of media. The digital storage of data facilitates information retrieval, allowing a new wave of services and web applications that take advantage of the huge amount of data available.2 The challenges of preserving and sharing data stored on digital media are significant compared to the print world, in which data “stored” on paper can still be read centuries or millennia later. In contrast, only ten years ago, floppy disks were a major storage medium for digital data, but now the vast majority of computers no longer support this type of device. In today’s environment, selecting a good data repository is important to ensure that data are preserved and accessible. Likewise, defining the storage requirements for digital libraries has become a big challenge. In this context, IT staff—those responsible for predicting what storage resources will be needed in the medium term—often face the following scenarios: • Prediction of storage requirements turn out to be below real needs, resulting in resource deficits. • Prediction of storage requirements turn out to be above real needs, resulting in expenditure and administration overhead for resources that end up not being used. In these situations, considering only an efficient strategy to store documents is not enough.3 The acquisition of storage services that implement an elastic concept (i.e., storage capacity that can be Victor Jesús Sosa-Sosa (vjsosa@tamps.cinvestav.mx) is Professor and Researcher at the Information Technology Laboratory at CINVESTAV, Campus Tamaulipas, Mexico. Emigdio M. Hernandez-Ramirez (emhr1983@gmail.com) is Software Developer, SVAM International, Ciudad Victoria, Mexico. INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2012 35 increased or reduced on demand, with a cost of acquisition and management relatively low) becomes attractive. Cloud computing is a current trend that considers the Internet as a platform providing on-demand computing and software as a service to anyone, anywhere, and at any time. Digital libraries naturally should be connected to cloud computing to obtain mutual benefits and enhance both perspectives.4 In this model, storage resources are provisioned on demand and are paid according to consumption. Services deployment in a cloud-computing environment can be implemented three ways: private, public, or hybrid. In the private option, infrastructure is operated solely for a single organization; most of the time, it requires an initial strong investment because the organization must purchase a large amount of storage resources and pay for the administration costs. The public cloud is the most traditional version of cloud computing. In this model, infrastructure belongs to an external organization where costs are a function of the resources used. These costs include administration. Finally, the hybrid model contains a mixture of private and public. A cloud-computing environment is mainly supported by technologies such as virtualization and service-oriented architectures. A cloud environment provides omnipresence and facilitates deployment of file-storage services. It means that users can access their files via the Internet from anywhere and without requiring the installation of a special application. The user only needs a web browser. Data availability, scalability, elastic service, and pay-per-use are attractive characteristics found in the cloud service model. Virtualization plays an important role in cloud computing. With this technology, it is possible to have facilities such as multiple execution environments, sandboxing, server consolidation, use of multiple operating systems, and software migration, among others. Besides virtualization technologies, emerging tools that allow the creation of cloud-computing environments also support this type of computing model, providing dynamic instantiation and release of virtual machines and software migration. Currently, it is possible to find several examples of public cloud storage, such as Amazon S3 (http://aws.amazon.com/en/s3), RackSpace (http://www.rackspace.com/cloud/public/files), and Google Storage (https://developers.google.com/storage), each of which provide high availability, fault tolerance, and services and administration at low cost. For organizations that do not want to use a third-party environment to store their data, private cloud services may offer a better option, although the cost is higher. In this case, a hybrid cloud model could be an affordable solution. Organizations or individual users, can store sensitive or frequently used information in the private infrastructure and less sensitive data in the public cloud. The development of a prototype of a file-storage service implemented on a private and hybrid cloud environment using mainly free and open-source software (FOSS) helped us to analyze the behavior of different replication techniques. We paid special attention to the cost of the system implementation, system efficiency, resource consumption, and different levels of data privacy and availability that can be achieved by each type of system. http://aws.amazon.com/en/s3 http://www.rackspace.com/cloud/public/files https://developers.google.com/storage A FILE STORAGE SERVICE ON A CLOUD COMPUTING ENVIRONMENT FOR DIGITAL LIBRARIES | SOSA-SOSA 36 INFRASTRUCTURE DESCRIPTION The aim of this prototyping project was to design and implement scalable and elastic distributed storage architecture in a cloud-computing environment using free, well-known, open-source tools. This architecture represents a feasible option that digital libraries can adopt to solve financial and technical challenges when building a cloud-computing environment. The architecture combines private and public clouds by creating a hybrid cloud environment. For this purpose, we evaluated tools such as KVM and XEN, which are useful for creating virtual machines (VM).5 Open Nebula (http://opennebula.org), Eucalyptus (http://www.eucalyptus.com), and OpenStack (http://www.openstack.org) are good, free options for managing a cloud environment. We selected Open Nebula for this prototype. Commodity hard drives have a relatively high failure rate, hence our main motivation to evaluate different replication mechanisms, providing several levels of data availability and fault tolerance. Figure 1(a) shows the core components of our storage architecture (the private cloud), and figure 1(b) shows a distributed storage web application named Distributed Storage On the Cloud (DISOC), used as a proof of concept. The private cloud also has an interface to access a public cloud, thus creating a hybrid environment. Figure 1. Main Components of the Cloud Storage Architecture The core components and modules of the architecture are the following: • Virtual Machine (VM). We evaluated different open-source were evaluated, such as KVM and XEN, for the creation of virtual machines.6 Some performance tests were done, and KVM showed a slightly higher performance than XEN. We selected KVM as the main Virtual Machine Manager (VMM) for the proposed architecture. VMMs also are called http://opennebula.org/ http://www.eucalyptus.com/ http://www.openstack.org/ INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2012 37 Hypervisors. Each VM has a Linux operating system that is optimized to work in virtual environments and requires a minimum consumption of disk space. The VM also includes an Apache web server, a PHP module, and some basic tools that were used to build the DISOC web application. Every VM is able to transparently access a pool of disks through a special data access module, which we called DAM. More details about DAM follow. • Virtual Machine Manager Module (VMMM). This has the function of dynamic instantiation and de-instantiation of virtual machines depending on the current load on the infrastructure. • Data Access Module (DAM). All of the virtual disk space required by every VM was obtained through the Data Access Module Interface (DAM-I). DAM-I allows VMs to access disk space by calling DAM, which provides transparent access to the different disks that are part of the storage infrastructure. DAM allocates and retrieves files stored throughout multiple file servers. • Load Balancer Module (LBM). This distributes the load among different VMs instantiated on the physical servers that make up the private cloud. • Load Manager (LM). This monitors the load that can occur in the private cloud. • Distributed Storage on the Cloud (DISOC). This is a web-based file-storage system that is used as a proof of concept and was implemented based on the proposed architecture. REPLICATION TECHNIQUES High availability is one of the important features offered in a storage service deployed in the cloud. The use of replication techniques has been the most useful proposal to achieve this feature. DAM is the component that provides different levels of data availability. It currently includes the following replication policies: no-replication, total-replication, mirroring, and IDA-based replication. • No-Replication. This replication policy represents the data availability method with the lowest level of fault tolerance. In this method, only the original version of a file is stored in the disk pool. It follows a round-robin allocation policy whereby load assignation is made based on a circularly linked list, taking into account disk availability. This policy prevents all files from being allocated to the same server, providing a minimal fault tolerance in case a server failure. • Mirroring. This replication technique is a simple way to ensure higher availability without high resource consumption. In this replication, every time a file is stored in a disk, the DAM creates a copy and places it on a different disk. • Total-replication. This represents the highest data availability approach. In this technique, a copy of the file is stored on all of the file servers available. Total-replication also requires the highest consumption of resources. • IDA-based replication. To provide higher data availability with less impact on the consumption of resources, an alternative approach based on information-dispersal techniques can be used. The Information Dispersal Algorithm (IDA) is an example of this A FILE STORAGE SERVICE ON A CLOUD COMPUTING ENVIRONMENT FOR DIGITAL LIBRARIES | SOSA-SOSA 38 strategy.7 When a file (of size |F|) is required to be stored using the IDA, the file is partitioned into n fragments of size |F|/m, where m