WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

«Investigating storage solutions for large data A comparison of well performing and scalable data storage solutions for real time extraction and batch ...»

-- [ Page 1 ] --

Investigating storage solutions for large data

A comparison of well performing and scalable data storage

solutions for real time extraction and batch insertion of data

Master of Science Thesis

ADAM LITH

JAKOB MATTSSON

Department of Computer Science and Engineering

CHALMERS UNIVERSITY OF TECHNOLOGY

G¨teborg, Sweden, 2010

o

The Author grants to Chalmers University of Technology and University of

Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet.

The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law.

The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet.

Investigating storage solutions for large data ADAM LITH

JAKOB MATTSSON

c ADAM LITH, June 2010 c JAKOB MATTSSON, June 2010 Examiner: GRAHAM KEMP Department of Computer Science and Engineering Chalmers University of Technology SE-412 96 G¨teborg o Sweden Telephone + 46 (0)31-772 1000 Department of Computer Science and Engineering G¨teborg, Sweden, 2010 o Abstract There are several systems developed today to handle the problem of storing large amounts of data. But for each type of data and set of operations different systems differ in suitability. Burt AB stores a large dataset, enlarged in batches in a regular and controlled way, but never updated. Query times are critical and must have real-time performance.

This thesis describes a systematic exploration and testing of possible solutions, with the goal of recommending one of these for Burt AB. Inherent properties of the dataset itself are investigated and a set of very different database management systems are combined with a set of database schemas in order to form a total of eleven potential solutions of interest.

We show that the relational model suits the data well and that the maturity of MySQL gives us confidence when recommending it compared to the more recently developed systems. Furthermore, indexing using an inverted index is found to yield the best results.

Sammanfattning Det finns ett stort antal system som utvecklats f¨r att l¨sa problemet med att o o hantera mycket data, men vilken l¨sning som ¨r b¨st beror p˚ vilken typ av data o aa a man har. B

–  –  –

Preface Adam Lith and Jakob Mattsson both study Computer Science at Chalmers University of Technology in Gothenburg, Sweden. This thesis is part of their master’s degree. The work was performed at Burt’s office in Gothenburg during spring 2010.

Acknowledgements The authors would like to especially thank the supervisors—Graham Kemp at Chalmers and John Sj¨lander at Burt—for their help and support.

o The authors would also like to thank the entire staff at Burt for their support during the thesis work.

Contents

–  –  –

4.1 The CAP-theorem and the investigated systems relations to it 25

7.1 Insertion in Cassandra, inverted indexing........... 34

7.2 Querying in Cassandra, inverted indexing........... 34

7.3 Insertion in Cassandra, inverted indexing, batch insertion.. 35

7.4 Querying in Cassandra, inverted indexing, batch insertion.. 35

7.5 Insertion in HBase, inverted indexing.............. 36

7.6 Querying in HBase, inverted indexing............. 36

7.7 Insertion in MongoDB, inverted indexing, string hashing... 37

7.8 Querying in MongoDB, inverted indexing, string hashing... 37

7.9 Insertion in MongoDB, inverted indexing........... 38

7.10 Querying in MongoDB, inverted indexing........... 38

7.11 Insertion in MySQL, duplication, string hashing........ 39

7.12 Querying in MySQL, duplication, string hashing....... 39

7.13 Insertion in MySQL, inverted indexing, string hashing.... 40

7.14 Querying in MySQL, inverted indexing, string hashing.... 40

7.15 Insertion in MySQL, inverted indexing............. 41

7.16 Querying in MySQL, inverted indexing............. 41

7.17 Insertion in MySQL, multiple indexes, string hashing..... 42

7.18 Querying in MySQL, multiple indexes, string hashing.... 42

7.19 Insertion in Voldemort DB, inverted indexing, string hashing 43

7.20 Querying in Voldemort DB, inverted indexing, string hashing 43

7.21 Insertion in MongoDB, inverted indexing, string hashing... 45

7.22 Querying in MongoDB, inverted indexing, string hashing... 45

7.23 Insertion in MongoDB, inverted indexing........... 46

7.24 Querying in MongoDB, inverted indexing........... 46





7.25 Insertion in MySQL, duplication, string hashing........ 47

7.26 Querying in MySQL, duplication, string hashing....... 47

7.27 Insertion in MySQL, inverted indexing, string hashing.... 48

7.28 Querying in MySQL, inverted indexing, string hashing.... 48

7.29 Insertion in MySQL, inverted indexing............. 49

7.30 Querying in MySQL, inverted indexing............. 49

7.31 Insertion in MySQL, multiple indexes, string hashing..... 50

7.32 Querying in MySQL, multiple indexes, string hashing.... 50

7.33 Insertion in MySQL, duplication, string hashing, multiple tables 53

7.34 Querying in MySQL, duplication, string hashing, multiple tables 53

7.35 Insertion in MySQL, duplication, string hashing, multiple tables 55

7.36 Querying in MySQL, duplication, string hashing, multiple tables 55

7.37 Insertion in MySQL, inverted indexing............. 56

7.38 Querying in MySQL, inverted indexing............. 56 List of Tables

–  –  –

8.1 Schema alteration possibilities for the tested systems..... 59 Chapter 1 Introduction

1.1 Burt and Rich The main stakeholder for this masters thesis is a technology startup named Burt.

As described on their web page, ”Burt creates software to help advertisers and agencies improve the efficiency and effect of their online campaigns”1. One of their products is a metrics tool called Rich, which ”gives agencies faster implementation, a more focused feature set and above all different - and better

- metrics. The result being a better understanding of the online advertising environment, leading to more effective ads and increased ROI for all parts of the ecosystem”2. In simple terms, Rich gathers advertisement traffic and produces reports that are easy to understand. From a technical point of view, Rich is non-trivial. The amount of data gathered is immense and the performance requirements are extremely high, with billions of visits processed every day and live reports available instantly over the web.

The flow of the system is visualized in figure 1.1, where the first step in this process is the execution of a script inside an Internet advertisement using Rich. This is usually a Flash ActionScript or JavaScript triggered to respond to various user events, such as clicks and mouse movements, as well as timed events like how long the user saw the advertisement. The script then sends the associated data to a central log server. The parser reads these logs at scheduled times and interprets the logged events into a session aggregated form.

The parsed data are then transformed into the final presentation data using Hadoop, ”a Java software framework that supports data-intensive distributed applications [...] inspired by Google’s MapReduce”3. This readily calculated data is stored in a relational database in order to be accessible to an end user via a web application.

There are two reasons for doing all of this processing, rather than simply storing the logged user events directly in the database and perform calculations

on the fly when requested from the web application:

• Storing individual user data is not only superfluous as this kind of data is never presented, it is also ethically unjustifiable.

1 http://www.byburt.com, Retrieved on May 20, 2010 2 http://richmetrics.com,Retrieved on May 20, 2010 3 http://wiki.apache.org/hadoop/ProjectDescription, May 20, 2010

–  –  –

• According to the CTO at Burt, a single large global advertisementcampaign may be viewed in the order of 109 times during a month. Calculating statistical properties of this data on the fly is simply not viable.

1.2 Understanding the existing system Today Rich arranges the gathered session data into two different piles; categories and metrics. For one tuple of data there are around 35 to 45 measurements.

These measurements are divided as 20 to 30 ”categories”, and around 15 other measurements we will call ”metrics”. The differentiation of these two groups are just a simplification for Burt to handle the data, in the sense that questions upon the data are only made on the categories, whereas the metrics are simply interesting as output. What kind of measured value is considered a category or a metric is decided by a domain expert at Burt, since there is no inherent property of the data that puts it in either group. An example category is website, and an example metric is the number of times an advertisement has been clicked.

What is then performed is a MapReduce job (described in section 2.4.1), where the mapping is done on the tuples containing date, campaign id and a category, for each possible category. The reduction step aggregates the metrics in a non-trivial way, i.e. it is not simply an additive process, but the details are not of interest for the project. The result of the MapReduce is stored in accordance with a very minimalistic MySQL schema, with a column for date, campaign id, category name, value of the category and one column for each of the metrics.

1.3 Initial problem statement Even with a system like the one described above, the management of data of this magnitude is a challenge. Due to this, Burt has currently limited the final presentation of the data to only include metrics that are highly relevant and relatively easy to calculate, as well as limiting themselves to only being able to specify, or fixate, at most one of the categories at a time. The problem is that agencies using Rich can not request information that is out of the ordinary. To make things worse, ”the ordinary” is quite narrow. Hence our problem was to find a way of redesigning the highlighted sections of figure 1.2 to allow for more complex questions to be answered by the data.

Figure 1.2: The parts of the Rich system that we are to redesign

The first delimitation we have to make is what kind of more complex questions do we want to be able to answer. This was relatively straightforward to determine since the request was to enable the querying of data where more than one category was specified and fixed. In other words, we want to ask a question where we specify a set of categories, and fix their values. This kind of question we will henceforth call a closed question. The second kind of query is where we ask for data where a set of categories are specified, and all but one of them are fixed. This kind of question we will henceforth call an open question. The number of categories in a question is by Burt called the drill-down depth.

Whatever changes suggested, or whatever system to use, they are still subject

to the following criteria:

• Any query of the above described type should be able to be answered in real-time.

• Insertion of data will be done once every 24 hours, and will be in the order of 109 rows.

• The system must be able to scale well horizontally (often also refered to as scale out, which means it should be possible to add more nodes to the system, i.e. more computers).

Chapter 2 Concepts

2.1 Storage solutions The term database is used to describe a variety of systems employed to organize storage of data. There are different types of systems as well as models that do this in different ways and here we try to examine the fundamental differences of the more prominent ones.

First, we will explain the general concepts of row- vs column-oriented storage, that can be applied in a variety of scenarios. Then we will cover the differences between the traditional relational database systems that have been around for several decades and some systems that have emerged from the recent NoSQL movement, reflecting renewed interest in non-relational storage models.

2.1.1 Row-oriented A row oriented system concerns tuples and attributes. A tuple represents an object and an attribute represents one piece of information about an object.

Usually, this is then thought of (or even stored as) a table where the tuples corresponds to rows and the attributes to columns.[2] This is the primary way of thinking about data in the relational model and will be outlined further in section 2.2.

2.1.2 Column-oriented A column-oriented database is similar to a row-oriented, but as the name reveals it focuses on columns (attributes) rather than rows (tuples). A row-oriented database can easily operate on all the tuples in a system, while the columnoriented one is tuned to operate efficiently on a set of attributes for all tuples [4]. To make this clearer, consider this table of 1980’s heavy metal songs:

–  –  –

An example tuple is (Aces High, Iron Maiden, Powerslave, 1984) and an example attribute is Album.

In a row-oriented database this would be stored somewhat like this:



Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |


Similar works:

«eNews. Welcome to February's edition of eNews. You will see that we have changed Contents the content slightly to reflect the launch of Employment Insight our new website exclusively developed for clients. Employment Insight tracks and summarises major developments in employment legislation. From now on, News bites 1 Features 6 eNews will highlight a new legislative development and link you directly to Guidance from the the correct page of Employment Insight. Other developments in the law,...»

«MacBook User’s Guide Includes setup, expansion, and troubleshooting information for your MacBook computer K Apple Computer, Inc The Bluetooth® word mark and logos are owned by the © 2006 Apple Computer, Inc. All rights reserved. Bluetooth SIG, Inc. and any use of such marks by Apple Under the copyright laws, this manual may not be Computer, Inc. is under license. copied, in whole or in part, without the written consent Other company and product names mentioned herein of Apple. are...»

«edition one A BARRISTER’S GUIDE TO YOUR PERSONAL INJURY CLAIM A Legal Lifeline Julian Benson edition one A BARRISTER’S GUIDE TO YOUR PERSONAL INJURY CLAIM A Legal Lifeline Julian Benson A Barrister’s Guide to Your Injury Claim – First Edition, 1.0, August 2012 Published by Julian Benson Publishing Copyright © 2012 Julian Benson ISBN: 978-0-9574064-0-7 The guide is intended to provide accurate information. It cannot and does not pretend to give advice in any specific situation. It is...»

«From the Oath of the Athenian City State Inscribed in the lobby of Maxwell Hall We will ever strive for the ideals and sacred things of the city, both alone and with many; we will unceasingly seek to quicken the sense of public duty; we will revere and obey the city’s laws; we will transmit this city not only less, but greater, better and more beautiful than it was transmitted to us. Samantha Power Maxwell School of Syracuse University A More Beautiful City May 10, 2013 Good morning Maxwell...»

«CITY COUNCIL, CITY OF ROCKFORD JOURNAL OF PROCEEDINGS MARCH 30, 2009 COUNCIL CONVENED AT 6:17 P.M.1. The invocation was given by Ron Montanye, St. Sebastian Orthodox Catholic Church/Police Chaplain and the Pledge of Allegiance was led by Council Page Mercedes Martinez.2. Roll Call: Mayor Lawrence J. Morrissey Aldermen: Sosnowski, Curran, Mark, Wasco, Bell, Jacobson, Thompson-Kelly, Johnson, Timm, Beach, Holt, Beck, McNeely, Conness -14Absent: -03. Alderman Mark moved to accept the Journal of...»

«    [TEMPORARY DESIGN] Principles for guaranteeing diversity and pluralism in broadcasting and audiovisual communication services1 1. Every person has the right to investigate, seek, receive and disseminate information, opinions and ideas, without prior censorship, through radio, TV and other audiovisual communications services, besides any other procedure of their choice, in the framework of respect to the rule of law and human rights. This right includes the creation of mass media...»

«Kan barn tala? 1 2 Jeanette Sundhall Kan barn tala? En genusvetenskaplig undersökning av ålder i familjerättsliga utredningstexter 3 Avhandling för filosofie doktorsexamen i genusvetenskap, Göteborgs universitet 20120616 © Jeanette Sundhall 2012 Omslag och inlaga: Anna Ahlberg Författarfoto: Maria Kanflo Teglund Typsnitt: Adobe Garamond Pro Tryck: Ineko AB, 2012 ISBN: 978 91 85974 17 7 E-publikation: http:/hdl.handle.net/2077/29026 Distribution: Institutionen för Kulturvetenskaper,...»

«Sermon #2974 Metropolitan Tabernacle Pulpit 1 A WAFER OF HONEY NO. 2974 A SERMON PUBLISHED ON THURSDAY, FEBRUARY 8, 1906. DELIVERED BY C. H. SPURGEON, AT THE METROPOLITAN TABERNACLE, NEWINGTON, IN THE YEAR 1863. “My Grace is sufficient for you.” 2 Corinthians 12:9. LET no Christian imagine that he will ever have immunity from trouble while he continues in the body. Should you be favored with visions and Revelations of the Lord, caught up to the third Heaven, admitted into Paradise and...»

«Rule 6.01 to 6.02 6.00—The Batter 6.01 (a) Each player of the offensive team shall bat in the order that his name appears in his team’s batting order. (b) The first batter in each inning after the first inning shall be the player whose name follows that of the last player who legally completed his time at bat in the preceding inning. 6.02 (a) The batter shall take his position in the batter’s box promptly when it is his time at bat. (b) The batter shall not leave his position in the...»

«GUNSHANNON, ET AL. V. ALBERT/CAROL MUELLER T-A MCDONALDS, ET AL. EXPERT REPORT OF STEPHEN G. HARVEY I have been engaged as an expert witness by counsel (Caroselli Beachler McTiernan & Conboy, L.L.C.) for Natalie Gunshannon and other similarly situated persons (collectively “Plaintiffs”) in their suit against McDonald franchises owned or controlled by the Albert and Carol Mueller Limited Partnership and/or Albert and Carol Mueller (“McDonalds” or “Defendants”). At issue here is...»

«Abstracts | Code List KIF02151 Harnessing Assessment and Feedback to Assure Quality Outcomes for Graduate Capability Development: A Legal Education Case Study. Sally Kift Assistant Dean, Teaching and Learning QUT Faculty of Law, Queensland University of Technology, AUSTRALIA. Telephone: +61-7-3864 1098 Facsimile: +61-7-3864 2121 e-mail: s.kift@qut.edu.au Abstract In recent times, employers, graduates, government and professional bodies have all called upon tertiary educators to embrace a notion...»

«© 2013 Legal Center for Nonprofits, Inc. Information provided herein is not legal advice and is intended for informational purposes only. Consult an attorney before acting on the information contained herein. Attendance at this program does not result in an attorney-client relationship. Presentation references Massachusetts law; consult an attorney licensed in your own state for assistance with your specific situation. This program will not turn you into a Parliamentarian! This program...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.