Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19623
Title: | Data Lakehouse Performance Study |
Authors: | Σιδέρης, Κωνσταντίνος Τσουμάκος Δημήτριος |
Keywords: | Big Data Data Lakes Batch Processing Stream Processing Delta Lake Apache Hudi |
Issue Date: | 23-Jun-2025 |
Abstract: | As organisations increasingly adopt lakehouse architectures to support big data analytics, understand- ing the performance trade-offs of utilising enhanced storage layers instead of standard data lake ar- chitectures is essential. This masters dissertation aims to present a comprehensive performance eval- uation of two leading data lakehouse solutions, Delta Lake and Apache Hudi, focusing on both batch and stream processing workloads. Through the benchmarking process, we compare Delta Lake and Hudi against standard data lake implementations, which consist of a simple storage layer queried by an analytics engine, in this case, HDFS and Apache Spark. Being built on top of data lakes, lakehouses leverage their strengths, while simultaneously, introducing new features, such as ACID transactions, schema enforcement, schema evolution and data governance mechanisms, to address the issues data lakes face. Additionally, they introduce optimisations, such as indexing, data skipping, and parti- tion pruning, to further improve them. Throughout this thesis, we present these features and through benchmarks, evaluate how they improve performance and whether the added functionalities justify the use of lakehouses, even in cases where they may underperform. |
URI: | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19623 |
Appears in Collections: | Διπλωματικές Εργασίες - Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
03118134_thesis.pdf | 1.34 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.