Treffer: Primitive operations in computing massive data revisited
Chinese
Weitere Informationen
Ph.D. ; With the massive size of data from various sources, achieving high efficiency for even the primitive operations in big data can be challenging. In this thesis, we investigate the primitive operations in processing two widely used data types, graph and string. Graph can model the online social network, hyperlink graph, and knowledge graph etc., while string is widely used to represent textual data including document, message, and DNA sequence. ; We first investigate how to optimize the graph computing in a general way. During our research on various kinds of graph algorithms, we observe that CPU cache perfor mance is the key to the efficiency of graph computing. Motivated by this observation, we analyze the common access pattern of various primitive graph operations, such as Breadth First Search, Depth First Search, PageRank, etc., and formulate the graph ordering problem that aims to utilize the graph data locality to reduce the CPU cache miss ratio based on the common access pattern. We design efficient algorithms to compute the graph ordering and the experiments show that the optimized graph ordering can improve the CPU cache performance for most graph operations. Note that the graph algorithms can get speed up from the graph ordering without the requirement of modifying its implementation and data structures. ; We further study the probabilistic data structure and design hash-based labeling schemes for efficiently answering reachability query in graph data and similarity search query in string data. Graph reachability query is a primitive graph operation that checks whether a node can reach another node over the graph. We propose IP+ labeling, a novel randomized labeling approach designed based on min-wise independent permutation technique. The index size and index construction time of the proposed IP+ label are kept to be linear with the graph size. The IP+ label can guarantee that two unreachable nodes can be answered directly by their IP+ labels with a high probability and no graph traversal is ...