Foundation Model Empowered Real-Time Video Conference With Semantic Communications
Document Type
Article
Source of Publication
IEEE Transactions on Image Processing
Publication Date
1-1-2026
Abstract
With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.
DOI Link
ISSN
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
Volume
35
First Page
1740
Last Page
1755
Disciplines
Computer Sciences
Keywords
foundation models, generative AI reconstruction, Interactivity, semantic communications, video semantic segmentation
Scopus ID
Recommended Citation
Chen, Mingkai; Ma, Wenbo; Zeng, Mujian; He, Xiaoming; Xiong, Jian; Wang, Lei; Al-Dulaimi, Anwer; and Mumtaz, Shahid, "Foundation Model Empowered Real-Time Video Conference With Semantic Communications" (2026). All Works. 7883.
https://zuscholars.zu.ac.ae/works/7883
Indexed in Scopus
yes
Open Access
no