Scalable, Workload Aware Indexing And Query Processing Over Unstructured Data

Nikolaos Papailiou

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/9064

Full metadata record

DC Field	Value	Language
dc.contributor.author	Nikolaos Papailiou
dc.date.accessioned	2018-07-22T22:50:28Z	-
dc.date.available	2018-07-22T22:50:28Z	-
dc.date.issued	2016-12-5
dc.date.submitted	2016-11-1
dc.identifier.uri	http://artemis-new.cslab.ece.ntua.gr:8080/jspui/handle/123456789/9064	-
dc.description.abstract	The pace at which data are described, queried and exchanged, using unstructured data representations, is constantly growing. Semantic Web technologies have emerged as one of the prevalent unstructured data sources. Utilizing the RDF description model, they attempt to encode and make openly available various World Wide Web datasets. Therefore, the constantly increasing volume of available data calls for efficient and scalable solutions for their management. In this thesis, we devise distributed algorithms and techniques for data management, which can scale and handle huge datasets. We introduce H2RDF+, a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed database. Creating 6 indexes over HBASE tables, H2RDF+ can process complex queries making adaptive decisions on both the join ordering and the join execution. Joins are executed using in distributed or centralized resources, depending on their cost. Furthermore, we present a novel system that addresses graph-based, workload-adaptive indexing of large RDF graphs by caching SPARQL query results. At the heart of the system lies a SPARQL query canonical labelling algorithm that is used to uniquely index and reference SPARQL query graphs as well as their isomorphic forms. We integrate our canonical labelling algorithm with a dynamic programming planner in order to generate the optimal join execution plan, examining the utilization of both primitive triple indexes and cached query results. By monitoring cache requests, our system is able to identify and cache SPARQL queries that, even if not explicitly issued, greatly reduce the average response time of a workload. The proposed cache is modular in design, allowing integration with different RDF stores. Another ever-increasing source of unstructured data is the Internet traffic. Network datasets collected at large networks such as Internet Exchange Points (IXPs) can be in the order of Terabytes per hour. To handle analytics over such datasets, we present Datix, a fully decentralized, open-source analytics system for network traffic data that relies on smart partitioning storage schemes to support fast join algorithms and efficient execution of filtering queries. In brief, Datix manages to efficiently answer queries within minutes compared to more than 24 hours processing when executing existing Python-based code in single node setups. Datix also achieves nearly 70% speedup compared to baseline query implementations of popular big data analytics engines such as Hive and Shark.
dc.language	English
dc.subject	rdf
dc.subject	hadoop
dc.subject	hbase
dc.subject	distributed databases
dc.subject	caching
dc.subject	datix
dc.subject	h2rdf
dc.subject	h2rdf+
dc.title	Scalable, Workload Aware Indexing And Query Processing Over Unstructured Data
dc.type	PhD Thesis
dc.description.pages	158
dc.contributor.supervisor	Κοζύρης Νεκτάριος
dc.department	Τομέας Τεχνολογίας Πληροφορικής & Υπολογιστών
dc.organization	ΕΜΠ, Τμήμα Ηλεκτρολόγων Μηχανικών & Μηχανικών Υπολογιστών
Appears in Collections:	Διδακτορικές Διατριβές - Ph.D. Theses

Files in This Item:

File	Size	Format
PD2016-0050.pdf	6.16 MB	Adobe PDF	View/Open

Show simple item record