Treffer: Primitive operations in computing massive data revisited

Title:
Primitive operations in computing massive data revisited
Contributors:
Wei, Hao (author.), Yu, Jeffrey Xu (thesis advisor.), Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management. (degree granting institution.)
Publication Year:
2017
Collection:
The Chinese University of Hong Kong: CUHK Digital Repository / 香港中文大學數碼典藏
Document Type:
Fachzeitschrift text
File Description:
electronic resource; remote; 1 online resource (xiii, 174 leaves) : illustrations (some color); computer; online resource
Language:
English
Chinese
Rights:
Use of this resource is governed by the terms and conditions of the Creative Commons "Attribution-NonCommercial-NoDerivatives 4.0 International" License (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Accession Number:
edsbas.83DC5016
Database:
BASE

Weitere Informationen

Ph.D. ; With the massive size of data from various sources, achieving high efficiency for even the primitive operations in big data can be challenging. In this thesis, we investigate the primitive operations in processing two widely used data types, graph and string. Graph can model the online social network, hyperlink graph, and knowledge graph etc., while string is widely used to represent textual data including document, message, and DNA sequence. ; We first investigate how to optimize the graph computing in a general way. During our research on various kinds of graph algorithms, we observe that CPU cache perfor mance is the key to the efficiency of graph computing. Motivated by this observation, we analyze the common access pattern of various primitive graph operations, such as Breadth First Search, Depth First Search, PageRank, etc., and formulate the graph ordering problem that aims to utilize the graph data locality to reduce the CPU cache miss ratio based on the common access pattern. We design efficient algorithms to compute the graph ordering and the experiments show that the optimized graph ordering can improve the CPU cache performance for most graph operations. Note that the graph algorithms can get speed up from the graph ordering without the requirement of modifying its implementation and data structures. ; We further study the probabilistic data structure and design hash-based labeling schemes for efficiently answering reachability query in graph data and similarity search query in string data. Graph reachability query is a primitive graph operation that checks whether a node can reach another node over the graph. We propose IP+ labeling, a novel randomized labeling approach designed based on min-wise independent permutation technique. The index size and index construction time of the proposed IP+ label are kept to be linear with the graph size. The IP+ label can guarantee that two unreachable nodes can be answered directly by their IP+ labels with a high probability and no graph traversal is ...