期刊文献+
共找到7篇文章
< 1 >
每页显示 20 50 100
基于大模型的工业质检系统关键技术及应用
1
作者 李苒笙 李志强 贾北洋 《广东通信技术》 2025年第5期35-39,44,共6页
基于大模型的工业质检系统目前覆盖纺织、电子、装备制造三大重点行业,提供多领域预训练模型,通过数据增强策略缓解异常样本稀缺问题,提升模型泛化能力与鲁棒性。基于大模型的工业质检系统支持预训练模型定制化微调,并采用模型蒸馏技术... 基于大模型的工业质检系统目前覆盖纺织、电子、装备制造三大重点行业,提供多领域预训练模型,通过数据增强策略缓解异常样本稀缺问题,提升模型泛化能力与鲁棒性。基于大模型的工业质检系统支持预训练模型定制化微调,并采用模型蒸馏技术生成轻量级版本,实现高效部署与推理。 展开更多
关键词 工业质检 CLIP(Contrastive language-image Pre-training) 少样本学习 数据增强 定制化微调
在线阅读 下载PDF
Enhanced Panoramic Image Generation with GAN and CLIP Models
2
作者 Shilong Li Qiang Zhao 《Journal of Beijing Institute of Technology》 2025年第1期91-101,共11页
Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textur... Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textures is challenging. This paper introduces a method using generative adversarial networks(GANs) and the contrastive language-image pretraining(CLIP) model to restore and control texture in panoramic images. The GAN model captures complex structures and maintains consistency, while CLIP enables fine-grained texture control via semantic text-image associations. GAN inversion optimizes latent codes for precise texture details. The resulting low dynamic range(LDR) images are converted to high dynamic range(HDR) using the Blender engine for seamless texture blending. Experimental results demonstrate the effectiveness and flexibility of this method in panoramic texture restoration and generation. 展开更多
关键词 panoramic images environment texture generative adversarial networks(GANs) contrastive language-image pretraining(CLIP)model blender engine fine-grained control texture generation
在线阅读 下载PDF
A Review on Vision-Language-Based Approaches: Challenges and Applications
3
作者 Huu-Tuong Ho Luong Vuong Nguyen +4 位作者 Minh-Tien Pham Quang-Huy Pham Quang-Duong Tran Duong Nguyen Minh Huy Tri-Hai Nguyen 《Computers, Materials & Continua》 2025年第2期1733-1756,共24页
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural lang... In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications. 展开更多
关键词 Bootstrapping language-image pre-training(BLIP) multimodal learning vision-language model(VLM) vision-language pre-training(VLP)
在线阅读 下载PDF
Region-Aware Fashion Contrastive Learning for Unified Attribute Recognition and Composed Retrieval 被引量:1
4
作者 WANG Kangping ZHAO Mingbo 《Journal of Donghua University(English Edition)》 CAS 2024年第4期405-415,共11页
Clothing attribute recognition has become an essential technology,which enables users to automatically identify the characteristics of clothes and search for clothing images with similar attributes.However,existing me... Clothing attribute recognition has become an essential technology,which enables users to automatically identify the characteristics of clothes and search for clothing images with similar attributes.However,existing methods cannot recognize newly added attributes and may fail to capture region-level visual features.To address the aforementioned issues,a region-aware fashion contrastive language-image pre-training(RaF-CLIP)model was proposed.This model aligned cropped and segmented images with category and multiple fine-grained attribute texts,achieving the matching of fashion region and corresponding texts through contrastive learning.Clothing retrieval found suitable clothing based on the user-specified clothing categories and attributes,and to further improve the accuracy of retrieval,an attribute-guided composed network(AGCN)as an additional component on RaF-CLIP was introduced,specifically designed for composed image retrieval.This task aimed to modify the reference image based on textual expressions to retrieve the expected target.By adopting a transformer-based bidirectional attention and gating mechanism,it realized the fusion and selection of image features and attribute text features.Experimental results show that the proposed model achieves a mean precision of 0.6633 for attribute recognition tasks and a recall@10(recall@k is defined as the percentage of correct samples appearing in the top k retrieval results)of 39.18 for composed image retrieval task,satisfying user needs for freely searching for clothing through images and texts. 展开更多
关键词 attribute recognition image retrieval contrastive language-image pre-training(CLIP) image text matching transformer
在线阅读 下载PDF
Facial Expression Generation from Text with FaceCLIP
5
作者 Wen-Wen Fu Wen-Juan Gong +2 位作者 Chen-Yang Yu Wei Wang Jordi Gonzàlez 《Journal of Computer Science & Technology》 2025年第2期359-377,共19页
Facial expression generation from pure textual descriptions is widely applied in human-computer interaction,computer-aided design,assisted education,etc.However,this task is challenging due to the intricate facial str... Facial expression generation from pure textual descriptions is widely applied in human-computer interaction,computer-aided design,assisted education,etc.However,this task is challenging due to the intricate facial structure and the complex mapping between texts and images.Existing methods face limitations in generating high-resolution images or capturing diverse facial expressions.In this study,we propose a novel generation approach,named FaceCLIP,to tackle these problems.The proposed method utilizes a CLIP-based multi-stage generative adversarial model to produce vivid facial expressions with high resolutions.With strong semantic priors from multi-modal textual and visual cues,the proposed method effectively disentangles facial attributes,enabling attribute editing and semantic reasoning.To facilitate text-toexpression generation,we build a new dataset called the FET dataset,which contains facial expression images and corresponding textual descriptions.Experiments on the dataset demonstrate improved image quality and semantic consistency compared with state-of-the-art methods. 展开更多
关键词 facial expression generation contrastive language-image pre-training(CLIP) MULTI-STAGE generative adversarial network(GAN)
原文传递
CLIP-Flow:Decoding images encoded in CLIP space
6
作者 Hao Ma Ming Li +4 位作者 Jingyuan Yang Or Patashnik Dani Lischinski Daniel Cohen-Or Hui Huang 《Computational Visual Media》 CSCD 2024年第6期1157-1168,共12页
This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for im... This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image synthesis.In particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information.Moreover,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible convolution.As the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis.We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method.In addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image synthesis.Experiments validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively. 展开更多
关键词 image-to-image text-to-image contrastive language-image pretraining(CLIP) FLOW StyleGAN
原文传递
KnowER:Knowledge enhancement for efficient text-video retrieval
7
作者 Hongwei Kou Yingyun Yang Yan Hua 《Intelligent and Converged Networks》 EI 2023年第2期93-105,共13页
The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary ... The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD. 展开更多
关键词 text-video retrieval knowledge graph contrastive language-image pre-training(CLIP)
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部