Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18958
Title: Πολλαπλές αναλύσεις στην σημασιολογική κατάτμηση εικόνων
Authors: Μπενέτου, Σμαραγδή
Μαραγκός Πέτρος
Keywords: Semantic Image Segmentation
Machine Learning
Computer Vision
Transformers
Issue Date: 20-Oct-2023
Abstract: Computer vision is a field in computer science aiming to enhance visual perception of computers. With numerous applications in different areas such as autopilot, medical imaging, security, agriculture etc., computer vision advancement is at the center of attention. Its tasks or goals are constantly getting more demanding. It started from classification which classifies an image. Then object detection was tackled which identifies all the objects in an image. Finally, semantic segmentation was introduced which requests for classification of every pixel in an image. Semantic segmentation is crucial in real-world applications as it would allow for complete environment perception. The previous tasks were satisfactorily approached with convolutional models. Moving on to transformers, object detection was further improved as it is more effective on detecting multi-scale objects. The self-attention module of transformers was able to implement that requirement and introduce contextual information that convolutions were not able to. However, unlike classification or even object detection, semantic segmentation requires multiple scale recognition of objects' shapes. Transformers were able to perform this task, however, the architecture philosophy needed to be changed in order to scale up performance in a more demanding task. Encoder-decoder architectures are remains from the classification task as they transform information to a lower dimension producing the single class label. Later approaches attempted to introduce multiple resolutions by using residual connections from encoder to decoder in order to prevent this loss of information. This technique, though, still faces the problem of processing information without loss and that is where multiple resolutions introduce a solution to the problem. After the extended background research, multiple resolutions are dominating SOTA and improve their respective single resolution models. Theoretically, multiple resolutions can only improve a model as they introduce extra information than the information produced in the original model. Mask2Former is a multi-purpose segmentation model that can be trained without changing architecture in : semantic segmentation, instance segmentation, and panoptic segmentation. It is composed of a pixel-level module, a transformer decoder, and a segmentation head. The pixel-level module in this model can be any feature extraction model, however, up until now only encoder-decoder architectures have been used. Thus, in this diploma research the goal is to introduce high resolution to the Mask2Former pixel-level module in prospects of improving its performance in semantic segmentation. We achieved through the multi-resolutional architecture an improvement of 0.3mIoU to the original model's performance in Cityscapes and a 0.2mIoU improvement in ADE20k.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18958
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
DiplomaThesis_BenetouSmaragdi.pdf13.67 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.