High-Performance Data Analytics with Ray

Γεωργίου, Χαρίδημος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19229

Full metadata record

DC Field	Value	Language
dc.contributor.author	Γεωργίου, Χαρίδημος	-
dc.date.accessioned	2024-07-24T11:10:31Z	-
dc.date.available	2024-07-24T11:10:31Z	-
dc.date.issued	2024-07-23	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19229	-
dc.description.abstract	The increasing demand for the development of efficient processing and analyzing tools, can be attributed to the significant role that Big Data analysis and Machine Learning have acquired in our days across various fields. In order for complex computations to be effectively performed, and vast amounts of data to be handled adequately and lead to the extraction of meaningful insights, the need for distributed computing frameworks with such capabilities has emerged. Through the development frameworks, such as Ray, the computational demands of Big Data analytics and ML tasks are addressed, due to a certain set of capabilities they provide. Ray is enriched with APIs that parallelize Python code, and can therefore be characterized as a powerful tool regarding distributed computing. The objective of this thesis is to analyze Ray's performance on various applications, delving into ETL operations, graph processing, distributed ML training, and hyperparameter tuning. Apache Spark is another similar framework. Being widely used today and recognized for its powerful data-processing properties and extensive library support, Spark's performance is used as a point of reference in this work. The experiment was carried out on a cluster setup, taking into consideration various parameters including the time of execution, CPU time, as well as memory usage throughout different data sizes and node configurations. According to the results, it was demonstrated that Spark is superior to Ray regarding ETL and graph operations since it comprises a more mature ecosystem and exhibits efficient memory usage. It was nevertheless observed that Ray was outperforming Spark as far as ML training and hyperparameter tuning were involved, which showcases its significant parallel processing capabilities.	en_US
dc.language	en	en_US
dc.subject	Ray	en_US
dc.subject	Apache Spark	en_US
dc.subject	Distributed Computing	en_US
dc.subject	Machine Learning	en_US
dc.subject	Data Analytics	en_US
dc.subject	Graph Analytics	en_US
dc.title	High-Performance Data Analytics with Ray	en_US
dc.description.pages	86	en_US
dc.contributor.supervisor	Τσουμάκος Δημήτριος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
diploma_thesis_Georgiou_Charidimos.pdf		2.7 MB	Adobe PDF	View/Open

Show simple item record