Does Hadoop Offer an Alternative to the Use of RDBMS?

Cloud Computing

Definition Cloud computing as concept have gained a lot of attention in both industry and academia (Donno, 2019). It has become more confusing due to the fact that it is a “perfect marketing buzzword” (Wayner, 2008), or as Weiss labeled it—“a buzzword almost designed to be vague” (Weiss, 2007). Cloud computing is undoubtedly a neologism that must be added to the knowledge of every person working in the field of information technology—a neologism that is also very popular for its various and ambiguous definitions (Rimal & Choi, 2009). From a reading of the cloud computing literature, various companies or IT industry professionals define cloud computing in ways that reflect their own views, understanding and business goals. As Donno (Donno, 2019) mentions, a clear and neat definition of the cloud computing paradigms is hard to find in the literature. This makes it difficult for researchers new to this area to get a concrete picture of the paradigm. According to Liming (Liming, 2008), cloud computing is a delivery of a resource and a usage pattern, in other words, it is getting a resource (hardware or software) through a network. The network in this particular case is called “Cloud”.

The hardware resource in the network seems extensible to infinity and can be used anytime and anywhere. Another proposed definition of cloud computing comes from Foster (Foster, Yong, Raicu and Lu, 2008): “Cloud computing is a large-scale distributed technology that is driven mainly by economies of scale, where a group of services such as storage, platforms, and computing power are dynamically scalable and delivered on demand to external customers through the internet”. The basic concepts expressed by the term are quite simple. The word computing refers to any activity that involves computer processing or storage (Shackelford et al., 2006). A computer manipulates and stores data, which commonly resides on the hard drive of a computer, or in other hardware such as NAS (network-attached storage), commodity hardware (affordable and readily available hardware), the mainframe, etc. The hard drive can store anything, including structured data, unstructured data, software, databases, etc. In cloud computing terminology, these capabilities are designed as services, and these services are offered in a cloud, from and to any place where an Internet network is available. In other words, its location does not matter.

Where the confusion begins, and why so many different definitions were generated, is if those definitions try to include: 1) different perspectives (i.e. Infrastructure versus Software Engineering); 2) too many technical details; 3) a specific technology point of view. For example, some definitions include concepts like billing features, type of access, security issues, ownership of data, and even quality features associated with the technology. Since these concepts vary depending on the technology and can evolve, the definition of cloud computing can become broader and fuzzier over time. One of the broadly adopted definitions (Bohn et al., 2011) for cloud computing has been proposed by the National Institute of Standards and Technology (NIST): “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability, and is composed of five essential characteristics, three service models, and four deployment models” (Mell & Grance, 2011).

The NIST definition (Mell & Grance, 2011) implies that the terminology of the essential characteristics, service models, and deployment models must also be precisely defined. Table 1.2 shows how these three concepts are defined in the NIST proposal. The NIST definition shown in Table 1.2 was initially intended to serve as a means for broad comparisons of cloud services and deployment strategies, and to provide a baseline for discussion—from what cloud computing is, to how best to use it. The NIST definition raises a key issue. Cloud computing is complex, especially considering the nature of its components. Defining it requires that every essential characteristic, service model, and deployment model be well defined and that these elements do not change over time.

Relational Database

Limitations Relational databases offer many benefits including robustness, simplicity, flexibility, scalability, compatibility, and more. But it is not necessarily the best compared with a solution that focuses only on few benefits such as fault tolerance and scalability. The latter has become a primary need, especially after the emergence of cloud computing which dramatically increases the number of users who deposit, transit and move data permanently on a SaaS application. Data from a cloud-based application can double their data loads and calculations in a matter of days, as was the case for YouTube (J. Cryans, A. April and A. Abran, 2008), and this is difficult to manage with a relational database that is hosted on a single server. Relational databases evolve very well as long as the database remains in a single server. When it reaches limits in computational power and space, distribution on several servers becomes inevitable and this is where this technology shows its limits. Distributing a relational database over several hundred or thousands of servers is not an easy task. The distribution adds a lot of complexity to the data model because of the relationships between the tables and the characteristics that allowed the databases to be robust, simple to use, etc., dramatically reducing its ability to manage a large amount of data and calculations across a vast pool of interconnected servers. There are a number of challenges that a relational database will face when attempting to scale:

• If the service grows in popularity, too many reads will hit the database and cached memory will have to be added to the common queries. Reads will no longer have the ACID (Atomicity, Consistency, Isolation and Durability) proprieties; • If the service continues to gain in popularity, and too many writes are hitting the database, a vertical scale (a server upgrade) will be required, which means that the cost will rise because new hardware will have to be purchased;

At this point, scaling horizontally (adding more servers) is needed with attempts to build some sort of partitioning on the largest tables, or looking into some of the commercial solutions that provide multiple master capabilities. Ultimately, the conversion from a single or sharded relational database to a shared, remotely hosted relational database using a NoSQL schema may be considered. Many examples of this progression are provided in the literature: for instance, the YouTube example. YouTube first used a relational database (i.e. MySQL) with a master-slave replication, but eventually arrived at a point where the writes were using all the capacity of the slaves. Like many other organizations facing this situation, they tried partitioning their tables into shards so that the sets of machines hosting the various databases were optimized for their tasks (J.-D. Cryans, A. April and A. Abran, 2008). Ultimately they had to convert from relational database technology to NoSQL database technology. The next section provides an overview of the NoSQL distributed database model

Database Conversion

Literature Review There are only few research papers published that present case studies of conversion from relational databases to NoSQL databases. Most proposals found share the same goal: how to efficiently convert an existing relational database to a NoSQL database. Given that RDB technology is rooted in mathematical theory, it should be easy to convert any specific relational database implementation, such as a MySQL database for example, to any other relational database such as Oracle or MSSQL (Li, Xu, Zhao and Deng, 2011). NoSQL databases on the other hand, like HBase, do not use relational algebra and have a completely different schema design (Lars, 2013b). Another author highlights that NoSQL databases are typically designed considering a specific use case: queries and access patterns rather than using a general relation and normalization process. There is very little information available in the literature on how to conduct schema translation. In this research, a schema translation uses an existing relational data model as an input and a non-relational (e.g., NoSQL) data model as an output. NoSQL databases are categorized by many types such as: Wide Column Store/Column Families, Document Store, and Key Value/Tuple Store among many others. The focus of this research has been set on a Wide Column Store/Column Families category, more specifically the popular Hadoop database called HBase. It has also been reported in the literature that there is a difficulty when the time comes to convert an existing legacy system based on RDB technology to NoSQL database technologies for RDB-trained software engineers

Table des matières

INTRODUCTION
CHAPTER 1 OVERVIEW OF THE CLOUD COMPUTING CONCEPT AND DEFINITION
1.1 Cloud Computing Definition
1.2 Cloud Computing Usage Model: Computer Utilities
1.3 Cloud Computing Types
1.4 Cloud Computing and Other Similar Concepts
1.4.1 Cloud Computing and Grid Computing
1.4.2 SaaS and Cloud Computing
1.5 Conclusion
CHAPTER 2 INTRODUCTION TO BIG DATA TECHNOLOGIES THAT ARE PROMOTED BY CLOUD COMPUTING TECHNOLOGIES
2.1 Relational vs. Non-Relational Database
2.1.1 The Relational Database Model
2.1.2 Relational Database Limitations
2.1.3 Distributed Database Model
2.2 Hadoop Project
2.2.1 The Initial Need that Led to the Creation of Hadoop
2.2.2 Does Hadoop Offer an Alternative to the Use of RDBMS?
2.2.3 Hadoop and Volunteer Computing
2.3 HBase Characteristics
2.3.1 Associative Table (a MAP)
2.3.2 Persistent
2.3.3 Distributed
2.3.4 Sparse
2.3.5 Column Oriented
2.3.6 High Availability and High Performance
2.4 Why HBase?
2.5 HBase Architecture
2.6 HBase Accessibility
2.7 Conclusion
CHAPTER 3 LITERATURE REVIEW OF DATABASE CONVERSION BETWEEN RELATIONAL AND NON-RELATIONAL DATABASES
3.1 Introduction 1
3.2 Overview of Typical RDB to NoSQL Conversion Steps
3.3 Database Conversion State of the Art
3.4 Books, Blogs and Web Discussions
3.5 Conclusion
CHAPTER 4 RESEARCH METHODOLOGY, ACTIVITIES AND EXPECTED RESULTS
4.1 Research Methodology
4.2 Definition Phase
4.3 Planning phase
4.4 Operation Phase
4.5 Interpretation Phase
4.6 Summary of the Research Methodology
CHAPTER 5 CLARIFYING THE CLOUD COMPUTING DEFINITION FOR OUR RESEARCH
5.1 Introduction
5.2 ISO and NIST Cloud Computing Definitions
5.3 The Car Analogy
5.3.1 Common Factors Observed when Converting to Cloud Computing
5.3.2 NIST Definition Clarifications for this Research
5.4 Conclusion
CHAPTER 6 RDB TO NOSQL CONVERSION PROBLEM
6.1 Background
6.2 HBase Schema Basics and Design Fundamentals
6.2.1 Row Key Design in HBase
6.2.2 Columns and Column Family in HBase
6.2.3 HBase Design Fundamentals
6.3 Experiment Description and Results
6.3.1 Experiment 1
6.3.2 Conversion Design Patterns
6.3.3 Experiment 2
6.3.4 Rules Extraction
6.4 Conclusions and Future Research
CONCLUSION
APPENDIX I THE NIST DEFINTION OF CLOUD COMPUTING
APPENDIX II ISO/IEC JTC 1 N9687 – A STANDARDIZATION INITIATIVE FOR CLOUD COMPUTING
APPENDIX III EXPERIMENT 1 – SURVEY
APPENDIX IV EXPERIMENT 2 – ASSIGNMENT DESCRIPTION
BIBLIOGRAPHY