FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«Kuan-Ting Yu Jiasi Shen Bolei Zhou CSAIL CSAIL CSAIL MIT MIT MIT 1 Introduction Computation tasks in Computer Vision research are data-intensive. ...»

-- [ Page 1 ] --

GoSpark: An In-Memory Distributed Computation Platform in Go

Kuan-Ting Yu Jiasi Shen Bolei Zhou



1 Introduction

Computation tasks in Computer Vision research are data-intensive. Such tasks usually involve performing repeated single operations, such as image resizing and feature extraction, on thousands of images. They also require training iterative algorithms, such as K-means clustering or logistic regressions, on thousands of image feature vectors. To support the computation of such tasks, the CSAIL Computer Vision Group has been using 36 high-performance in-house computation servers1. However, currently there is no efficient way to distribute the processing of large data sets across these servers. Each user can only connect to one single server using Secure Shell (SSH) and perform computation task (under Matlab, Python, or C++) with data and intermediate results stored in the NFS space maintained by CSAIL TIG. The use of centralized NFS space as data storage has caused I/O to be the bottleneck for large scale data processing tasks. Thus, a distributed computation platform with distributed file system would highly improve the performance of computations in computer vision research.

With Hadoop Distributed File System (HDFS) installed and configured on selected vision servers, we build a distributed computation platform, GoSpark2, to process large data sets across these servers. GoSpark is a fast in-memory cluster computing framework Spark [3] under the implementation of Go programming language. Our version of Resilient Distributed Dataset (RDD) in GoSpark supports a variety of transformations and actions. GoSpark is fault-tolerant to the worker failures and network disconnections.

In the following sections, we will first explain the design and implementation of GoSpark, and then demonstrate its functionality and fault tolerance with two image processing examples.

2 Design and Implementation

2.1 RDD Data Abstraction RDD is the data abstraction in GoSpark. It defines the interfaces of transformations and actions, as listed in [3]. We implemented various kinds of operationType, including Map, FlatMap, Reduce, ReduceByKey, Filter, Collect.

Code 1: RDD data structure 1 type RDD struct { 2 splits []*Split 3 dependType string // Narrow or Wide 4 splitType string // Hash partition or range partition 5 operationType string // Map, reduceBykey...

6 fnName string // function name for map, reduce...

7 prevRDD1 *RDD 8 prevRDD2 *RDD 9...

10 } The RDD data structure maintains prevRDD1 and prevRDD2 to track the RDD(s) that this current RDD is derived from. prevRDD2 is used for join or union operation. fnName keeps the name3 of input function literal. splits is used to keep the splitIDs of the RDD, as each splitID is a unique 64-bit integer.

–  –  –

Figure 1: The system design of GoSpark (Scheduler, Master, and Workers) and HDFS (Name node, data nodes, and data splits). Due to the high-performance of the vision servers, each vision server is both a data node in Hadoop HDFS and a worker in GoSpark at the same time.

–  –  –

Figure 1 shows the overall design of GoSpark. There are three main roles:

Scheduler builds the lineage of RDDs and tracks the lineage across a wide range of transformations.

Master maintains the status of workers and communicates with them through TCP connections.

Workers read input data from either HDFS or other workers, perform specified computation, and store results in memory. Their jobs are decided by the scheduler module and distributed by the master.

2.2.1 Scheduler The scheduler is activated when an RDD action is triggered. When scheduling, it examines the full lineage of RDDs related to the target, and identifies the execution order.

–  –  –

1 Server viewboard is at http://foxtrot.csail.mit.edu/cgi/machine_stats.cgi.

2 Our code is available at https://github.com/metalbubble/824project.

3 Please refer to Section 2.2.3 for more detail.

2 The scheduling process has the following steps: 1) Extracting stages, which are the RDDs before a wide dependency, and the target RDD; 2) Building a directed acyclic graph among the stages; 3) Topologically sorting the DAG to obtain an execution order of stages, and execute the stages one-by-one; 4) Executing a stage and computing each split in the stage in parallel. This includes tracing back the previous RDDs recursively and executing the corresponding split. 5) When the previous splits that another split depend on are done, runThisSplit() is called to send job specification, including opType, inputSplits, addresses, and outputSplits through master.AssignJob() to workers.

Data Locality The scheduler will consider the locality of data splits when distributing jobs to workers. Specifically, when the scheduler commands the master to assign a job to workers, the scheduler will provide a list of preferred workers based on data locality. The master will choose a worker based on this list. For example, at the beginning of a lineage, the scheduler assigns the workers that are close to the HDFS splits with ReadHDFSSplit jobs. These workers are responsible for reading data from HDFS. Other workers that need these data will send out GetSplit requests to them to fetch the splits.

2.2.2 Master The master module builds the bridge between the scheduler and the workers. Specifically, when the scheduler needs to assign a job to a worker, it calls master.AssignJob(), which selects a worker and assigns it the job through the DoJob RPC. In addition, the master also provides a Register RPC for workers to register and to update their status.

Code 3: Master API 1 type Master struct { 2 MasterAddress string 3 MasterPort string 4 workers map[string]WorkerInfo 5 } 6 func MakeMaster(ip string, port string) *Master {...} 7 func (mr *Master) StartRegistrationServer() {...} 8 func (mr *Master) Register(args *RegisterArgs, res *RegisterReply) error {...} 9 func (mr *Master) WorkersAvailable() map[string]WorkerInfo {...} 10 func (mr *Master) AssignJob(workersPreferred []string, force bool, args *DoJobArgs, reply *DoJobReply) (bool, string) {...

11 ok := call(w, "Worker.DoJob", args, reply)...

12 } 13 func (mr *Master) webHandler(ws *websocket.Conn) {...} Load Balancing When assigning a job, the master selects an available worker based on the following information: a) the list of data locality preference that the scheduler provides, b) the number of CPUs on each worker machine, and c) the number of GoSpark jobs currently running on each worker. To demonstrate the data locality and load balancing features of GoSpark, we implemented a monitor page4 that displays the real-time CPU and memory usage of all the workers. Its implementation establishes websocket connections between Javascript and Go. Figure 2 shows two snapshots of the status of the workers.

2.2.3 Workers Workers directly conduct data processing tasks as specified in DoJob RPC arguments. Different jobs are processed in parallel.

–  –  –

Code 5: Worker API 1 type Worker struct { 2 unreliable bool // for testing 3 mem map[string]([]interface{}) 4...

5 } 6 func (wk *Worker) DoJob(args *DoJobArgs, res *DoJobReply) error {...} 7 func (wk *Worker) Shutdown(args *ShutdownArgs, res *ShutdownReply) error {...} 8 func MakeWorker(MasterAddress string, MasterPort string, me string, port string, unrel bool) *Worker {...} 9 func (wk *Worker) kill() {...} // for testing 4 The worker status monitor is accessible at http://web.mit.edu/jiasi/www/ws.html.

–  –  –

To tolerate network failures, the workers not only register to the master during initialization, but also frequently do so whenever they are alive.

Memory management To achieve fast in-memory computation, each worker stores all computation results in its local memory mem until being told to delete them by DelSplit jobs. For workers to cooperate, each intermediate result is uniquely identified by a SplitID. The SplitIDs are allocated by the scheduler in DoJob arguments. For example, when worker A needs input data from worker B, the scheduler tells it the address of worker B and the SplitID of the data. With these information, worker A can send a GetSplit request to worker B. Then, worker B looks up the data in its local memory and replies to worker A with its contents.

Function Literals Since Go cannot serialize function literals, it is impossible to pass their definitions from the master to workers through RPCs. Thus, we instead compile user functions together with our GoSpark code on worker machines. The master instructs workers of which function to run by passing function names as strings in DoJob RPC arguments. Workers then use the run-time reflection in Go to call the desired functions by name.

Code 6: Call a function by name with Go reflection 1 fn := reflect.ValueOf(&UserFunc{}).MethodByName(args.Function) // look up by name 2 a1 := reflect.ValueOf(KeyValue{Value:line}) // wrap arguments in data structures 3 a2 := reflect.ValueOf(KeyValue{Value:args.Data}) 4 r: = fn.Call([]reflect.Value{a1, a2}) // call function by name 5 s: = r[0].Interface() // result There are still some technical restrictions, however. The reflect package requires a data structure to be defined to look up methods by name. Thus, we define UserFunc and restrict users to implement function literals under this type. Another restriction is that reflect does not support calling functions whose arguments have type interface{}. Thus, we also have to wrap user function arguments in a data structure (we chose KeyValue).

Unreliability In order to test the fault tolerance of GoSpark, we need to sometimes force the network connections to fail. To this end, we use conn.Close() to discard RPC requests, and use syscall.Shutdown() to force discard of replies after processing requests.

2.3 Fault Tolerance GoSpark has three fault tolerance mechanisms, from three different levels.

RDD fault tolerance At the execution stage of RDDs, the worker will first check the dependency of RDDs based on the assigned job.

If there are some splits missing, that worker will report to the scheduler. Then scheduler will check the dependency of that missing split and re-launch the task of processing the split on some worker. Until all the dependency of RDDs is complete, the failed job will be assigned again.

4 Worker fault tolerance Sometimes workers will be disconnected to the master due to network disconnection or the crash of the worker server. If an RPC call from the master to a worker fails, the master will retry a few times until success. If a worker is still unreachable, the master considers the worker as crashed and removes it from the available workers list. Then it selects another worker from the list and assigns it the specified job. If the master cannot reach a worker while the worker is in fact alive, the worker will keep calling the Register RPC on master. When its register requests eventually reach the master, the master will mark the worker as available again.

HDFS fault tolerance The HDFS is configured to have n replicas among the workers. If there is a machine failed after 10-minute time-out, the HDFS master will automatically recover the lost replicas on other servers. Note that we assume the scheduler and master will not fail.

–  –  –

To build and test our distributed computation platform, we installed and configured HDFS (Hadoop 2.4) on 15 selected vision servers.

Since vision servers use the shared NFS space as the primary data storage, to avoid the bottleneck of data I/O the default HDFS path is set to the local drive of each vision server in /scratch/tmp. The replicate number is set as 4. Vision24 server is selected as the name node of Hadoop 5 and the master/scheduler of GoSpark, while all the selected vision servers are served as data nodes and workers of GoSpark.

HDFS partitions a large file into 128 MB data splits among data nodes. One issue here is the locality of data access between workers and HDFS, i.e., each worker should access the data splits stored in its own drive in priority. HDFS API is provided for workers to access the splits of the large file separately. However the official HDFS APIs are mainly in JAVA, we implement a HDFS split reader in JAVA.

We wrote a wrapper in go to call this Java reader. For example, for a large file stored at file := "hdfs://...", the scheduler will first get the information of data splits of HDFS by calling GetSplitInfo(file), then the master in GoSpark will know where the splits are stored and assign the HDFS link of these data splits to each worker based on locality. Finally the worker will read the data split in HDFS using ReadHDFSSplit(file, splitID)and conduct data processing operations independently. Figure 1 illustrates the data access process in GoSpark.

3.2 Image clustering using K-Means

We demonstrate image clustering using K-means on the GoSpark platform. We use the SUN Database [2], a scene categorization benchmark, as the data for K-Means clustering. There are totally 108,754 images in the database, which come from 397 scene categories, such as kitchen, coast, and mountain. The feature of each image in the database is extracted as a 4096 dimensional vector by the pretrained neural network [1]. All the feature vectors are then stored as a single 3.32 GB CSV file at HDFS.

K-Means is a commonly used clustering method. Given an initial set of k centers by random, the algorithm iteratively proceeds two steps: 1) Assignment step: to assign each data vector to its closest center. 2) Update step: to recalculate the k centers by taking the mean of vectors assigned to each label.

Pages:   || 2 |

Similar works:

«Item Number Meeting Date April 22, 2004 Staff Report TO: James W. Antonen, City Manager VIA: Donna Silva, Parks & Community Services Director FROM: Vicki Crescitelli, Community Services Administrator SUBJECT: Recreation and Park Commission Minutes, March 18 & April 15, 2004 March 18, 2004 Recreation and Park Commission Minutes Informational Items Items 1,2,3,4,5,6,8,10 and 11 are informational only. (Item # 5, Open Container Ordinance was scheduled on the April 20, 2004 City Council agenda.)...»

« Strasbourg, février 1995 H (95) 10 CONVENTION-CADRE POUR LA PROTECTION DES MINORITÉS NATIONALES ET RAPPORT EXPLICATIF Introduction: La Convention-cadre pour la protection des minorités nationales, élaborée au sein du Conseil de l’Europe par le Comité ad hoc pour la protection des minorités nationales (CAHMIN) sous l’autorité du Comité des Ministres, a été adoptée par le Comité des Ministres du Conseil de l’Europe le 10 novembre 1994 et ouverte à la signature des Etats...»

«Bachelor thesis Comparing of muscle activity during handstand and different core stability plank exercises and performance Kathrine Forfang [KIF350] Bachelorgradsoppgave i [kroppsøving og idrettsfag faglærerutdanning] [Avdeling for lærerutdanning] Høgskolen i Nord-Trøndelag [2015] Acknowledgements I could not have done this bachelor thesis by my self, and i got all the help i needed to make this possible. Therefore i would like to thank everyone who has contributed to my work in any means....»

«ALL-MEDIA GUIDE TO FAIR AND CROSSTO FAIR AND CROSSCULTURAL REPORTING CULTURAL REPORTING Stephen Stockwell and Paul Scott 1 A ‘nuts and bolts’ handbook on cross-cultural media work in Australia, covering. • ethnic communities • indigenous Australia • finding contacts • effective cross-cultural communication • relevant legislation and codes of practice • MEAA code of ethics The All-Media Guide to Fair and Cross-Cultural Reporting is a useful day-to-day tool for dealing with the...»

«Inoguchi | Abe’s Leadership and the Legacy of Japan’s Defeat Shinzo Abe’s Leadership and the Legacy of Japan’s Defeat Takashi Inoguchi In August 2015, Prime Minister Shinzo Abe released a statement on the seventieth anniversary of Japan’s defeat in the Second World War.1 In the seventy years since 1945, Japan has not registered any war-related deaths, which might be viewed as analogous to Sweden and its experience following the Great Northern War of 1700 to 1721.2 Barring its...»

«Jämförelse verksamhetsplan och verksamhetsberättelse 2015 Aktivitet Har Har ej Kommentar utfört utfört Personal och X 2015 har varit det första året som föreningen stått organisatorisk samverkan som avsändare för ansökan om stöd för att ha två anställda. Därmed står föreningen som avsändare för allt som utförs av personalen vilket har inneburit att styrelsen har mer ansvar men också mer insyn. För att bemöta denna förändring så har strukturen inom föreningen setts...»

«Marbles In Your Mouth Tiptoeing into the world of foreign language learning, Part 2 Introduction In Marbles In Your Mouth, Part 1, I challenged two faulty and self-defeating beliefs adults commonly use to justify their lack of success in, or avoidance of, the process of learning a foreign language. First, I challenged the belief that learning a foreign language is just TOO hard for adults and that those who are successful somehow have unique and special attributes for learning language. I noted...»

«PUENTE ATIRANTADO PUERTA DE LAS ROZAS SOBRE LA A-6, MADRID Miguel Sacristán Mont. Guillermo Capellán Miguel Juan José Arenas de Pablo Pascual Garcia Arias Ingeniero de Caminos Ingeniero de Caminos Dr. Ingeniero de Caminos Ingeniero de Caminos ARENAS & ASOCIADOS ARENAS & ASOCIADOS ARENAS & ASOCIADOS IDOM Coordinador de Proyectos Director Técnico Presidente Director de Obra jjarenas@arenasing.com msacristan@arenasing.com gcapellan@arenasing.com pga@idom.es Resumen Desde el verano de 2007...»

«The possible origin of Bloody GIR? GIR Goes Crazy And Stuff: alternate ending It is dusk at a cow-filled field in front of a farmhouse and barn. There are cows grazing and mooing. They look up suddenly as a shadow zooms above them, stopping above one cow. The Voot Cruiser hovers above the cow. The Voot Cruiser lowers a little. A hatch opens and emits a tractor beam that envelopes one of the cows. who looks up. Several cows back away as the cow is lifted upwards along with blades of grass and...»

«TM The Scrum Guide The Definitive Guide to Scrum: The Rules of the Game July 2016 Developed and sustained by Ken Schwaber and Jeff Sutherland Table of Contents Purpose of the Scrum Guide Definition of Scrum Scrum Theory Scrum Values The Scrum Team The Product Owner The Development Team The Scrum Master Scrum Events The Sprint Sprint Planning Daily Scrum Sprint Review Sprint Retrospective Scrum Artifacts Product Backlog Sprint Backlog Increment Artifact Transparency Definition of “Done” End...»

«University of Nevada Reno A Generic Queuing System and Time-Saving Region Restrictions for Calculating the Crossing Number of Kn A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science with a major in Computer Science. by Bei Yuan Dr. Frederick C. Harris, Jr., Thesis advisor December 2004 We recommend that the thesis prepared under our supervision by BEI YUAN entitled A Generic Queuing System and Time-Saving Region Restrictions for Calculating the...»

«World Library and Information Congress: 79th IFLA General Conference and Assembly Draft minutes of the IFLA General Assembly held on Wednesday 21 & Thursday 22 August 2013, in Exhibition Hall 404-405 of the Suntec Singapore International Convention & Exhibition Centre, 1 Raffles Boulevard, Suntec City, Singapore.1. Opening by the President, Ingrid Parent The meeting opened at 16.15 on Wednesday 21 August. The President welcomed the members and delegates and thanked them for their attendance at...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.