«Investigating storage solutions for large data A comparison of well performing and scalable data storage solutions for real time extraction and batch ...»
Investigating storage solutions for large data
A comparison of well performing and scalable data storage
solutions for real time extraction and batch insertion of data
Master of Science Thesis
Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
G¨teborg, Sweden, 2010
The Author grants to Chalmers University of Technology and University of
Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet.
The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law.
The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet.
Investigating storage solutions for large data ADAM LITH
This thesis describes a systematic exploration and testing of possible solutions, with the goal of recommending one of these for Burt AB. Inherent properties of the dataset itself are investigated and a set of very diﬀerent database management systems are combined with a set of database schemas in order to form a total of eleven potential solutions of interest.
We show that the relational model suits the data well and that the maturity of MySQL gives us conﬁdence when recommending it compared to the more recently developed systems. Furthermore, indexing using an inverted index is found to yield the best results.
Sammanfattning Det ﬁnns ett stort antal system som utvecklats f¨r att l¨sa problemet med att o o hantera mycket data, men vilken l¨sning som ¨r b¨st beror p˚ vilken typ av data o aa a man har. B
Preface Adam Lith and Jakob Mattsson both study Computer Science at Chalmers University of Technology in Gothenburg, Sweden. This thesis is part of their master’s degree. The work was performed at Burt’s oﬃce in Gothenburg during spring 2010.
Acknowledgements The authors would like to especially thank the supervisors—Graham Kemp at Chalmers and John Sj¨lander at Burt—for their help and support.
o The authors would also like to thank the entire staﬀ at Burt for their support during the thesis work.
4.1 The CAP-theorem and the investigated systems relations to it 25
7.1 Insertion in Cassandra, inverted indexing........... 34
7.2 Querying in Cassandra, inverted indexing........... 34
7.3 Insertion in Cassandra, inverted indexing, batch insertion.. 35
7.4 Querying in Cassandra, inverted indexing, batch insertion.. 35
7.5 Insertion in HBase, inverted indexing.............. 36
7.6 Querying in HBase, inverted indexing............. 36
7.7 Insertion in MongoDB, inverted indexing, string hashing... 37
7.8 Querying in MongoDB, inverted indexing, string hashing... 37
7.9 Insertion in MongoDB, inverted indexing........... 38
7.10 Querying in MongoDB, inverted indexing........... 38
7.11 Insertion in MySQL, duplication, string hashing........ 39
7.12 Querying in MySQL, duplication, string hashing....... 39
7.13 Insertion in MySQL, inverted indexing, string hashing.... 40
7.14 Querying in MySQL, inverted indexing, string hashing.... 40
7.15 Insertion in MySQL, inverted indexing............. 41
7.16 Querying in MySQL, inverted indexing............. 41
7.17 Insertion in MySQL, multiple indexes, string hashing..... 42
7.18 Querying in MySQL, multiple indexes, string hashing.... 42
7.19 Insertion in Voldemort DB, inverted indexing, string hashing 43
7.20 Querying in Voldemort DB, inverted indexing, string hashing 43
7.21 Insertion in MongoDB, inverted indexing, string hashing... 45
7.22 Querying in MongoDB, inverted indexing, string hashing... 45
7.23 Insertion in MongoDB, inverted indexing........... 46
7.24 Querying in MongoDB, inverted indexing........... 46
7.25 Insertion in MySQL, duplication, string hashing........ 47
7.26 Querying in MySQL, duplication, string hashing....... 47
7.27 Insertion in MySQL, inverted indexing, string hashing.... 48
7.28 Querying in MySQL, inverted indexing, string hashing.... 48
7.29 Insertion in MySQL, inverted indexing............. 49
7.30 Querying in MySQL, inverted indexing............. 49
7.31 Insertion in MySQL, multiple indexes, string hashing..... 50
7.32 Querying in MySQL, multiple indexes, string hashing.... 50
7.33 Insertion in MySQL, duplication, string hashing, multiple tables 53
7.34 Querying in MySQL, duplication, string hashing, multiple tables 53
7.35 Insertion in MySQL, duplication, string hashing, multiple tables 55
7.36 Querying in MySQL, duplication, string hashing, multiple tables 55
7.37 Insertion in MySQL, inverted indexing............. 56
7.38 Querying in MySQL, inverted indexing............. 56 List of Tables
8.1 Schema alteration possibilities for the tested systems..... 59 Chapter 1 Introduction
1.1 Burt and Rich The main stakeholder for this masters thesis is a technology startup named Burt.
As described on their web page, ”Burt creates software to help advertisers and agencies improve the eﬃciency and eﬀect of their online campaigns”1. One of their products is a metrics tool called Rich, which ”gives agencies faster implementation, a more focused feature set and above all diﬀerent - and better
- metrics. The result being a better understanding of the online advertising environment, leading to more eﬀective ads and increased ROI for all parts of the ecosystem”2. In simple terms, Rich gathers advertisement traﬃc and produces reports that are easy to understand. From a technical point of view, Rich is non-trivial. The amount of data gathered is immense and the performance requirements are extremely high, with billions of visits processed every day and live reports available instantly over the web.
The parsed data are then transformed into the ﬁnal presentation data using Hadoop, ”a Java software framework that supports data-intensive distributed applications [...] inspired by Google’s MapReduce”3. This readily calculated data is stored in a relational database in order to be accessible to an end user via a web application.
There are two reasons for doing all of this processing, rather than simply storing the logged user events directly in the database and perform calculations
on the ﬂy when requested from the web application:
• Storing individual user data is not only superﬂuous as this kind of data is never presented, it is also ethically unjustiﬁable.
1 http://www.byburt.com, Retrieved on May 20, 2010 2 http://richmetrics.com,Retrieved on May 20, 2010 3 http://wiki.apache.org/hadoop/ProjectDescription, May 20, 2010
• According to the CTO at Burt, a single large global advertisementcampaign may be viewed in the order of 109 times during a month. Calculating statistical properties of this data on the ﬂy is simply not viable.
1.2 Understanding the existing system Today Rich arranges the gathered session data into two diﬀerent piles; categories and metrics. For one tuple of data there are around 35 to 45 measurements.
These measurements are divided as 20 to 30 ”categories”, and around 15 other measurements we will call ”metrics”. The diﬀerentiation of these two groups are just a simpliﬁcation for Burt to handle the data, in the sense that questions upon the data are only made on the categories, whereas the metrics are simply interesting as output. What kind of measured value is considered a category or a metric is decided by a domain expert at Burt, since there is no inherent property of the data that puts it in either group. An example category is website, and an example metric is the number of times an advertisement has been clicked.
What is then performed is a MapReduce job (described in section 2.4.1), where the mapping is done on the tuples containing date, campaign id and a category, for each possible category. The reduction step aggregates the metrics in a non-trivial way, i.e. it is not simply an additive process, but the details are not of interest for the project. The result of the MapReduce is stored in accordance with a very minimalistic MySQL schema, with a column for date, campaign id, category name, value of the category and one column for each of the metrics.
1.3 Initial problem statement Even with a system like the one described above, the management of data of this magnitude is a challenge. Due to this, Burt has currently limited the ﬁnal presentation of the data to only include metrics that are highly relevant and relatively easy to calculate, as well as limiting themselves to only being able to specify, or ﬁxate, at most one of the categories at a time. The problem is that agencies using Rich can not request information that is out of the ordinary. To make things worse, ”the ordinary” is quite narrow. Hence our problem was to ﬁnd a way of redesigning the highlighted sections of ﬁgure 1.2 to allow for more complex questions to be answered by the data.
Figure 1.2: The parts of the Rich system that we are to redesign
The ﬁrst delimitation we have to make is what kind of more complex questions do we want to be able to answer. This was relatively straightforward to determine since the request was to enable the querying of data where more than one category was speciﬁed and ﬁxed. In other words, we want to ask a question where we specify a set of categories, and ﬁx their values. This kind of question we will henceforth call a closed question. The second kind of query is where we ask for data where a set of categories are speciﬁed, and all but one of them are ﬁxed. This kind of question we will henceforth call an open question. The number of categories in a question is by Burt called the drill-down depth.
Whatever changes suggested, or whatever system to use, they are still subject
to the following criteria:
• Any query of the above described type should be able to be answered in real-time.
• Insertion of data will be done once every 24 hours, and will be in the order of 109 rows.
• The system must be able to scale well horizontally (often also refered to as scale out, which means it should be possible to add more nodes to the system, i.e. more computers).
Chapter 2 Concepts
2.1 Storage solutions The term database is used to describe a variety of systems employed to organize storage of data. There are diﬀerent types of systems as well as models that do this in diﬀerent ways and here we try to examine the fundamental diﬀerences of the more prominent ones.
First, we will explain the general concepts of row- vs column-oriented storage, that can be applied in a variety of scenarios. Then we will cover the diﬀerences between the traditional relational database systems that have been around for several decades and some systems that have emerged from the recent NoSQL movement, reﬂecting renewed interest in non-relational storage models.
2.1.1 Row-oriented A row oriented system concerns tuples and attributes. A tuple represents an object and an attribute represents one piece of information about an object.
Usually, this is then thought of (or even stored as) a table where the tuples corresponds to rows and the attributes to columns. This is the primary way of thinking about data in the relational model and will be outlined further in section 2.2.
2.1.2 Column-oriented A column-oriented database is similar to a row-oriented, but as the name reveals it focuses on columns (attributes) rather than rows (tuples). A row-oriented database can easily operate on all the tuples in a system, while the columnoriented one is tuned to operate eﬃciently on a set of attributes for all tuples . To make this clearer, consider this table of 1980’s heavy metal songs:
An example tuple is (Aces High, Iron Maiden, Powerslave, 1984) and an example attribute is Album.
In a row-oriented database this would be stored somewhat like this: