期刊文献+
共找到14篇文章
< 1 >
每页显示 20 50 100
Enhanced sparse RCNN for transmission line bolt defect detection via text-to-image data augmentation and quality filtering
1
作者 Chen Zhenyu Yan Huaguang +2 位作者 Du Jianguang Xue Meng Zhao Shuai 《High Technology Letters》 2026年第1期11-20,共10页
To address the issue of inconsistent image quality and data scarcity in bolt defect detection for transmission lines,this paper proposes an improved sparse region-based convolutional neural network(RCNN) based detecti... To address the issue of inconsistent image quality and data scarcity in bolt defect detection for transmission lines,this paper proposes an improved sparse region-based convolutional neural network(RCNN) based detection framework integrating image quality evaluation and text-to-image data augmentation.First,a HyperNetwork-based image quality assessment module is introduced to filter low-quality inspection images in terms of clarity and structural integrity,resulting in a high-quality training dataset.Second,a text-to-image diffusion model is utilized for sample augmentation.By designing text prompts that describe various bolt defect types under diverse lighting and viewing conditions,the model automatically generates realistic synthetic samples.The generated images are further filtered using a combination of quality and perceptual similarity metrics to ensure consistency with the real data distribution.Building upon the sparse RCNN baseline,a dynamic label assignment mechanism and a random decision path detection head are incorporated to enhance bounding box matching and prediction accuracy.Experimental results demonstrate that the proposed method significantly improves detection accuracy(mAP@0.5) over the original sparse RCNN while maintaining low computational cost,enabling more efficient and intelligent inspection of transmission line components. 展开更多
关键词 sparse region-based convolutional neural network HyperNetwork image quality assessment text-to-image generation data augmentation bolt defect detection transmission line inspection
在线阅读 下载PDF
EvilPromptFuzzer: generating inappropriate content based on text-to-image models
2
作者 Juntao He Haoran Dai +7 位作者 Runqi Sui Xuejing Yuan Dun Liu Hao Feng Xinyue Liu Wenchuan Yang Baojiang Cui Kedan Li 《Cybersecurity》 2025年第4期99-118,共20页
Text-to-image(TTI)models provide huge innovation ability for many industries,while the content security triggered by them has also attracted wide attention.Considerable research has focused on content security threats... Text-to-image(TTI)models provide huge innovation ability for many industries,while the content security triggered by them has also attracted wide attention.Considerable research has focused on content security threats of large language models(LLMs),yet comprehensive studies on the content security of TTI models are notably scarce.This paper introduces a systematic tool,named EvilPromptFuzzer,designed to fuzz evil prompts in TTI models.For 15 kinds of fne-grained risks,EvilPromptFuzzer employs the strong knowledge-mining ability of LLMs to construct seed banks,in which the seeds cover various types of characters,interrelations,actions,objects,expressions,body parts,locations,surroundings,etc.Subsequently,these seeds are fed into the LLMs to build scene-diverse prompts,which can weaken the semantic sensitivity related to the fne-grained risks.Hence,the prompts can bypass the content audit mechanism of the TTI model,and ultimately help to generate images with inappropriate content.For the risks of violence,horrible,disgusting,animal cruelty,religious bias,political symbol,and extremism,the efciency of Evil-PromptFuzzer for generating inappropriate images based on DALL.E 3 are greater than 30%,namely,more than 30 generated images are malicious among 100 prompts.Specifcally,the efciency of horrible,disgusting,political sym-bols,and extremism up to 58%,64%,71%,and 50%,respectively.Additionally,we analyzed the vulnerability of exist-ing popular content audit platforms,including Amazon,Google,Azure,and Baidu.Even the most efective Google SafeSearch cloud platform identifes only 33.85%of malicious images across three distinct categories. 展开更多
关键词 Risks of AI-generated content Inappropriate content text-to-image models
原文传递
Feature-Grounded Single-Stage Text-to-Image Generation 被引量:1
3
作者 Yuan Zhou Peng Wang +1 位作者 Lei Xiang Haofeng Zhang 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2024年第2期469-480,共12页
Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)framework.However,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image ... Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)framework.However,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image distribution.Moreover,the multistage generation strategy results in complex T2I applications.Therefore,this study proposes a novel feature-grounded single-stage T2I model,which considers the“real”distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model's generation capacity.Experimental results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models,showing the improved similarities among the generated image,text,and ground truth. 展开更多
关键词 text-to-image(T2I) feature-grounded single-stage generation Generative Adversarial Network(GAN)
原文传递
A Comprehensive Pipeline for Complex Text-to-Image Synthesis
4
作者 Fei Fang Fei Luo +3 位作者 Hong-Pan Zhang Hua-Jian Zhou Alix L.H.Chow Chun-Xia Xiao 《Journal of Computer Science & Technology》 SCIE EI CSCD 2020年第3期522-537,共16页
Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing... Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing and computer vision.We model it as a combination of semantic entity recognition,object retrieval and recombination,and objects’status optimization.To reach a satisfactory result,we propose a comprehensive pipeline to convert the input text to its visual counterpart.The pipeline includes text processing,foreground objects and background scene retrieval,image synthesis using constrained MCMC,and post-processing.Firstly,we roughly divide the objects parsed from the input text into foreground objects and background scenes.Secondly,we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset,and retrieve an appropriate background scene image from the background image dataset extracted from the Internet.Thirdly,in order to ensure the rationality of foreground objects’positions and sizes in the image synthesis step,we design a cost function and use the Markov Chain Monte Carlo(MCMC)method as the optimizer to solve this constrained layout problem.Finally,to make the image look natural and harmonious,we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step.The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks(GANs)in visual quality of generated scene images. 展开更多
关键词 image synthesis scene generation text-to-image conversion Markov Chain Monte Carlo(MCMC)
原文传递
CRD-CGAN:category-consistent and relativistic constraints for diverse text-to-image generation
5
作者 Tao HU Chengjiang LONG Chunxia XIAO 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第1期61-75,共15页
Generating photo-realistic images from a text description is a challenging problem in computer vision.Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adve... Generating photo-realistic images from a text description is a challenging problem in computer vision.Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks(GANs).In this paper,we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images.Based on those constraints,a category-consistent and relativistic diverse conditional GAN(CRD-CGAN)is proposed to synthesize K photo-realistic images simultaneously.We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises.Then,we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images,which can improve the performance of basic conditional loss.Finally,we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images.We evaluate our approach using the Caltech-UCSD Birds-200-2011,Oxford 102 flower and MS COCO 2014 datasets,and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images. 展开更多
关键词 text-to-image diverse conditional GAN relativi-stic category-consistent
原文传递
CAFE-GAN: CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination
6
作者 Xuanhong Wang Hongyu Guo +3 位作者 Jiazhen Li Mingchen Wang Xian Wang Yijun Zhang 《Computers, Materials & Continua》 2026年第1期1742-1760,共19页
Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step... Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis. 展开更多
关键词 Large vision language models deep learning computer vision text-to-image generation
在线阅读 下载PDF
Generate Corresponding Image from Text Description Using Modified GAN-CLS Algorithm
7
作者 GONG Fuzhou XIA Zigeng 《Journal of Systems Science & Complexity》 2026年第1期410-431,共22页
Synthesizing images or texts automatically becomes a useful research area in the artificial intelligence nowadays.Generative adversarial networks(GANs),proposed by Goodfellow,et al.in 2014,make this task to be done mo... Synthesizing images or texts automatically becomes a useful research area in the artificial intelligence nowadays.Generative adversarial networks(GANs),proposed by Goodfellow,et al.in 2014,make this task to be done more efficiently by using deep neural networks(DNNs).The authors consider generating corresponding images from a single-sentence input text description using a GAN.Specifically,the authors analyze the GAN-CLS algorithm,which is a kind of advanced method of GAN proposed by Reed,et al.in 2016.In this paper the authors show the theoretical problem with this algorithm and correct it by modifying the objective function of the model.Experiments are performed on the Oxford-102 dataset and the CUB dataset to support the theoretical results.Since the proposed modification can be seen as an idea which can be used to improve all such kind of GAN models,the authors try two models,GAN-CLS and AttnGAN_(GPT).As a result,in both of the two models,the proposed modified algorithm is more stable and can generate images which are more plausible than the original algorithm.Also,some of the generated images match the input texts better,and the proposed modified algorithm has better performance on the quantitative indicators including FID and Inception Score.Finally,the authors propose some future application prospect of the modification idea,especially in the area of large language models. 展开更多
关键词 Deep learning generative adversarial networks negative examples text-to-image synthesis
原文传递
Exploring the Latest Applications of OpenAI and ChatGPT: An In-Depth Survey 被引量:3
8
作者 Hong Zhang Haijian Shao 《Computer Modeling in Engineering & Sciences》 SCIE EI 2024年第3期2061-2102,共42页
OpenAI and ChatGPT, as state-of-the-art languagemodels driven by cutting-edge artificial intelligence technology,have gained widespread adoption across diverse industries. In the realm of computer vision, these models... OpenAI and ChatGPT, as state-of-the-art languagemodels driven by cutting-edge artificial intelligence technology,have gained widespread adoption across diverse industries. In the realm of computer vision, these models havebeen employed for intricate tasks including object recognition, image generation, and image processing, leveragingtheir advanced capabilities to fuel transformative breakthroughs. Within the gaming industry, they have foundutility in crafting virtual characters and generating plots and dialogues, thereby enabling immersive and interactiveplayer experiences. Furthermore, these models have been harnessed in the realm of medical diagnosis, providinginvaluable insights and support to healthcare professionals in the realmof disease detection. The principal objectiveof this paper is to offer a comprehensive overview of OpenAI, OpenAI Gym, ChatGPT, DALL E, stable diffusion,the pre-trained clip model, and other pertinent models in various domains, encompassing CLIP Text-to-Image,education, medical imaging, computer vision, social influence, natural language processing, software development,coding assistance, and Chatbot, among others. Particular emphasis will be placed on comparative analysis andexamination of popular text-to-image and text-to-video models under diverse stimuli, shedding light on thecurrent research landscape, emerging trends, and existing challenges within the domains of OpenAI and ChatGPT.Through a rigorous literature review, this paper aims to deliver a professional and insightful overview of theadvancements, potentials, and limitations of these pioneering language models. 展开更多
关键词 OpenAI ChatGPT DALL E stable diffusion OpenAI Gym text-to-image text-to-video
在线阅读 下载PDF
Novel Framework for Generating Criminals Images Based on Textual Data Using Identity GANs
9
作者 Mohamed Fathallah Mohamed Sakr Sherif Eletriby 《Computers, Materials & Continua》 SCIE EI 2023年第7期383-396,共14页
Text-to-image generation is a vital task in different fields,such as combating crime and terrorism and quickly arresting lawbreakers.For several years,due to a lack of deep learning and machine learning resources,poli... Text-to-image generation is a vital task in different fields,such as combating crime and terrorism and quickly arresting lawbreakers.For several years,due to a lack of deep learning and machine learning resources,police officials required artists to draw the face of a criminal.Traditional methods of identifying criminals are inefficient and time-consuming.This paper presented a new proposed hybrid model for converting the text into the nearest images,then ranking the produced images according to the available data.The framework contains two main steps:generation of the image using an Identity Generative Adversarial Network(IGAN)and ranking of the images according to the available data using multi-criteria decision-making based on neutrosophic theory.The IGAN has the same architecture as the classical Generative Adversarial Networks(GANs),but with different modifications,such as adding a non-linear identity block,smoothing the standard GAN loss function by using a modified loss function and label smoothing,and using mini-batch training.The model achieves efficient results in Inception Distance(FID)and inception score(IS)when compared with other architectures of GANs for generating images from text.The IGAN achieves 42.16 as FID and 14.96 as IS.When it comes to ranking the generated images using Neutrosophic,the framework also performs well in the case of missing information and missing data. 展开更多
关键词 GAN deep learning text-to-image identity GAN
在线阅读 下载PDF
A survey on text-driven visual generation:advances,frameworks,and future directions
10
作者 Haoting Jv 《Advances in Engineering Innovation》 2026年第2期1-11,共11页
After Denoising Diffusion Probabilistic Models(DDPM)outperformed Generative Adversarial Networks(GANs),diffusion models have evolved into the backbone of text-guided visual generation,with Stable Diffusion and DALL... After Denoising Diffusion Probabilistic Models(DDPM)outperformed Generative Adversarial Networks(GANs),diffusion models have evolved into the backbone of text-guided visual generation,with Stable Diffusion and DALL·E 2 alleviating key technical constraints.Despite remarkable advances in Text-to-Image(T2I)and Text-to-Video(T2V)tasks,critical gaps remain unaddressed.This paper conducts a systematic review of diffusion-based T2I and T2V technologies,synthesises the latest advances in related technologies,and proposes a"Technical Module-Application-Evaluation"framework to link technical breakthroughs with real-world applications.It also highlights under-researched fields and corresponding evaluation benchmarks,offering an integrated technical landscape to guide the equitable and reliable industrialisation of text-driven visual generation technologies. 展开更多
关键词 diffusion models text-to-image generation text-to-video generation controllable visual generation physics-aware generation
在线阅读 下载PDF
A new frontier in design studio:AI and human collaboration in conceptual design 被引量:1
11
作者 Derya Karadağ Betül Ozar 《Frontiers of Architectural Research》 2025年第6期1536-1550,共15页
This study explores the role of artificial intelligence(AI)in the conceptual design phase of interior design education,focusing on AI’s potential to help students visualise and refine creative ideas.Conducted within ... This study explores the role of artificial intelligence(AI)in the conceptual design phase of interior design education,focusing on AI’s potential to help students visualise and refine creative ideas.Conducted within a design studio course,the research integrates textto-image generators,particularly Midjourney to support students’design processes.Implemented in the fourth week of a 14-week course,a structured workshop introduced students to Midjourney,with surveys conducted both at this stage and during the final submission to capture changes in student perspectives.Using a two-phase case study involving a workshop,surveys,and interviews among senior undergraduate students in the bachelor’s program of the Interior Architecture and Environmental Design Department,the study assesses the impact of AI prompts,from simple keywords to detailed narratives,on concept development and project outcomes.Findings indicate that AI broadens design possibilities,facilitates iterative ideation,and improves conceptual precision through high-fidelity visualizations.While students view AI as a valuable addition to their creative process,they also express concerns about ethics and the need to balance AI’s benefits with preserving design authenticity.This research contributes to the broader discussion on AI’s role in design,advocating for a balanced integration that respects both technological potential and human creativity. 展开更多
关键词 Artificial intelligence Design process Conceptual development Interior design education text-to-image generation
原文传递
Personalized image generation with deep generative models:A decade survey
12
作者 Yuxiang Wei Yiheng Zheng +4 位作者 Yabo Zhang Ming Liu Zhilong Ji Lei Zhang Wangmeng Zuo 《Computational Visual Media》 2025年第6期1141-1194,共54页
Recent advances in generative models have significantly facilitated the development of personalized content creation.Given a small set of images containing a user-specific concept,personalized image generation allows ... Recent advances in generative models have significantly facilitated the development of personalized content creation.Given a small set of images containing a user-specific concept,personalized image generation allows the user to create images that incorporate that concept while adhering to provided text descriptions.The technologies used for personalization have evolved alongside the development of generative models,with their distinct and interrelated components.In this survey,we present a comprehensive review of generalized personalized image generation across various generative models,including traditional GANs,contemporary text-to-image diffusion models,and emerging multi-modal autoregressive(AR)models.We first define a unified framework that standardizes the personalization process across different generative models,encompassing three key components:inversion spaces,inversion methods,and personalization schemes.This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures.Building upon our framework,we provide an in-depth analysis of personalization techniques within each generative model,highlighting their unique contributions and innovations.Through comparative analysis,we elucidate the current landscape of personalized image generation,identifying commonalities and distinguishing features of existing methods.Finally,we discuss open challenges in the field and propose potential directions for future research.We keep a bibliography of related works at https://github.com/csyxwei/Awesome-Personalized-Image-Generation. 展开更多
关键词 personalized image generation generative models generative adversarial networks(GANs) text-to-image diffusion models multi-modal AutoRegressive models
原文传递
An inter-semiotic analysis of ideational meaning in text-prompted AI-generated images
13
作者 Arash Ghazvineh 《Language and Semiotic Studies》 2024年第1期17-42,共26页
This paper explores the inter-semiotic analysis of the ideational meaning in images generated by the text-to-image AI tool,Bing Image Creator.It adopts Kress and Van Leeuwen’s Grammar of Visual Design as its theoreti... This paper explores the inter-semiotic analysis of the ideational meaning in images generated by the text-to-image AI tool,Bing Image Creator.It adopts Kress and Van Leeuwen’s Grammar of Visual Design as its theoretical framework as the original grounding of the framework in systemic functional grammar(SFG)ensures a solid theoretical basis for undertaking analyses that involve the incorporation of textual and visual components.The integration of an AI generative model within the analytical framework enables a systematic connection between language and visual representations.This incorporation offers the potential to generate well-regulated pictorial representations that are systematically grounded in controlled textual prompts.This approach introduces a novel avenue for re-examining inter-semiotic processes,leveraging the power of AI technology.The paper argues that visual representations possess unique structural devices that surpass the limitations of verbal or written communication as they readily accommodate larger amounts of information in contrast to the limitations of the linear nature of alphabetic writing.Moreover,this paper extends its contribution by critically evaluating specific aspects of the Grammar of Visual Design. 展开更多
关键词 inter-semiotic analysis AI text-to-image generator systemic functional linguistics grammar of visual designs
原文传递
CLIP-Flow:Decoding images encoded in CLIP space
14
作者 Hao Ma Ming Li +4 位作者 Jingyuan Yang Or Patashnik Dani Lischinski Daniel Cohen-Or Hui Huang 《Computational Visual Media》 CSCD 2024年第6期1157-1168,共12页
This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for im... This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image synthesis.In particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information.Moreover,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible convolution.As the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis.We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method.In addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image synthesis.Experiments validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively. 展开更多
关键词 image-to-image text-to-image contrastive language-image pretraining(CLIP) FLOW StyleGAN
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部