چكيده لاتين
Abstract
Visible image analysis plays a crucial role in daily life and industrial applications, yet such images are inherently constrained by their limited spectral range. These limitations become particularly evident under poor illumination conditions, where capturing fine details and recognizing objects is challenging. In contrast, objects in natural environments emit electromagnetic radiation across different frequencies, known as thermal radiation, which is invisible to the human eye. Infrared imaging covers a broader spectral range than visible imaging and is less sensitive to adverse environmental conditions such as low light, fog, or occlusion. Consequently, infrared images are highly valuable in managing unfavorable lighting conditions. However, they suffer from lower spatial resolution and lack of color and texture details compared to visible images. Fusing visible and infrared images provides an effective solution to generate composite representations that combine the high spatial detail of visible images with the spectral advantages of infrared images. This research proposes a feature-level deep learning–based approach for fusing visible and infrared images, addressing challenges related to redundant information and semantic understanding. Most existing methods emphasize statistical features and visual quality of fused images while overlooking their application in high-level computer vision tasks such as object detection, tracking, and scene understanding, often leading to loss of semantic information. The key idea of the proposed framework is to integrate an instance segmentation network with an image fusion network, thereby embedding object-level semantic information into the fused image to enhance its utility for high-level vision tasks, while ensuring that the fusion process can be executed in real time.
The proposed architecture leverages deep convolutional neural networks, where visible and infrared images are processed in parallel through two separate branches. High-level features extracted in each branch are first enhanced by self-attention and then fused through cross-modal attention mechanisms. The fused features are reconstructed and fed into an instance segmentation network, where segmentation errors in object boundaries and regions are recursively propagated back to the fusion network as supervisory signals. This mechanism not only improves boundary accuracy but also compels the fusion network to learn more effective combinations of spectral and semantic features at the object level. For training and evaluation, the Tokyo Multi-Spectral dataset was employed, and dedicated instance segmentation labels were generated to provide precise semantic supervision. Experimental results demonstrate that the proposed method significantly outperforms existing approaches in object detection tasks, achieving superior accuracy in metrics such as Intersection over union (IoU). These findings highlight the effectiveness of the proposed architecture in integrating spectral and semantic information, yielding substantial improvements in high-level computer vision tasks.
Keywords: Image Fusion; Visible Image, Infrared Image, Instance Segmentation, Deep Learning, Object Detection.