To address the issue of inconsistent image quality and data scarcity in bolt defect detection for transmission lines,this paper proposes an improved sparse region-based convolutional neural network(RCNN) based detecti...To address the issue of inconsistent image quality and data scarcity in bolt defect detection for transmission lines,this paper proposes an improved sparse region-based convolutional neural network(RCNN) based detection framework integrating image quality evaluation and text-to-image data augmentation.First,a HyperNetwork-based image quality assessment module is introduced to filter low-quality inspection images in terms of clarity and structural integrity,resulting in a high-quality training dataset.Second,a text-to-image diffusion model is utilized for sample augmentation.By designing text prompts that describe various bolt defect types under diverse lighting and viewing conditions,the model automatically generates realistic synthetic samples.The generated images are further filtered using a combination of quality and perceptual similarity metrics to ensure consistency with the real data distribution.Building upon the sparse RCNN baseline,a dynamic label assignment mechanism and a random decision path detection head are incorporated to enhance bounding box matching and prediction accuracy.Experimental results demonstrate that the proposed method significantly improves detection accuracy(mAP@0.5) over the original sparse RCNN while maintaining low computational cost,enabling more efficient and intelligent inspection of transmission line components.展开更多
Text-to-image(TTI)models provide huge innovation ability for many industries,while the content security triggered by them has also attracted wide attention.Considerable research has focused on content security threats...Text-to-image(TTI)models provide huge innovation ability for many industries,while the content security triggered by them has also attracted wide attention.Considerable research has focused on content security threats of large language models(LLMs),yet comprehensive studies on the content security of TTI models are notably scarce.This paper introduces a systematic tool,named EvilPromptFuzzer,designed to fuzz evil prompts in TTI models.For 15 kinds of fne-grained risks,EvilPromptFuzzer employs the strong knowledge-mining ability of LLMs to construct seed banks,in which the seeds cover various types of characters,interrelations,actions,objects,expressions,body parts,locations,surroundings,etc.Subsequently,these seeds are fed into the LLMs to build scene-diverse prompts,which can weaken the semantic sensitivity related to the fne-grained risks.Hence,the prompts can bypass the content audit mechanism of the TTI model,and ultimately help to generate images with inappropriate content.For the risks of violence,horrible,disgusting,animal cruelty,religious bias,political symbol,and extremism,the efciency of Evil-PromptFuzzer for generating inappropriate images based on DALL.E 3 are greater than 30%,namely,more than 30 generated images are malicious among 100 prompts.Specifcally,the efciency of horrible,disgusting,political sym-bols,and extremism up to 58%,64%,71%,and 50%,respectively.Additionally,we analyzed the vulnerability of exist-ing popular content audit platforms,including Amazon,Google,Azure,and Baidu.Even the most efective Google SafeSearch cloud platform identifes only 33.85%of malicious images across three distinct categories.展开更多
Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)framework.However,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image ...Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)framework.However,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image distribution.Moreover,the multistage generation strategy results in complex T2I applications.Therefore,this study proposes a novel feature-grounded single-stage T2I model,which considers the“real”distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model's generation capacity.Experimental results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models,showing the improved similarities among the generated image,text,and ground truth.展开更多
Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing...Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing and computer vision.We model it as a combination of semantic entity recognition,object retrieval and recombination,and objects’status optimization.To reach a satisfactory result,we propose a comprehensive pipeline to convert the input text to its visual counterpart.The pipeline includes text processing,foreground objects and background scene retrieval,image synthesis using constrained MCMC,and post-processing.Firstly,we roughly divide the objects parsed from the input text into foreground objects and background scenes.Secondly,we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset,and retrieve an appropriate background scene image from the background image dataset extracted from the Internet.Thirdly,in order to ensure the rationality of foreground objects’positions and sizes in the image synthesis step,we design a cost function and use the Markov Chain Monte Carlo(MCMC)method as the optimizer to solve this constrained layout problem.Finally,to make the image look natural and harmonious,we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step.The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks(GANs)in visual quality of generated scene images.展开更多
Generating photo-realistic images from a text description is a challenging problem in computer vision.Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adve...Generating photo-realistic images from a text description is a challenging problem in computer vision.Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks(GANs).In this paper,we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images.Based on those constraints,a category-consistent and relativistic diverse conditional GAN(CRD-CGAN)is proposed to synthesize K photo-realistic images simultaneously.We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises.Then,we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images,which can improve the performance of basic conditional loss.Finally,we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images.We evaluate our approach using the Caltech-UCSD Birds-200-2011,Oxford 102 flower and MS COCO 2014 datasets,and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images.展开更多
Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step...Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis.展开更多
Synthesizing images or texts automatically becomes a useful research area in the artificial intelligence nowadays.Generative adversarial networks(GANs),proposed by Goodfellow,et al.in 2014,make this task to be done mo...Synthesizing images or texts automatically becomes a useful research area in the artificial intelligence nowadays.Generative adversarial networks(GANs),proposed by Goodfellow,et al.in 2014,make this task to be done more efficiently by using deep neural networks(DNNs).The authors consider generating corresponding images from a single-sentence input text description using a GAN.Specifically,the authors analyze the GAN-CLS algorithm,which is a kind of advanced method of GAN proposed by Reed,et al.in 2016.In this paper the authors show the theoretical problem with this algorithm and correct it by modifying the objective function of the model.Experiments are performed on the Oxford-102 dataset and the CUB dataset to support the theoretical results.Since the proposed modification can be seen as an idea which can be used to improve all such kind of GAN models,the authors try two models,GAN-CLS and AttnGAN_(GPT).As a result,in both of the two models,the proposed modified algorithm is more stable and can generate images which are more plausible than the original algorithm.Also,some of the generated images match the input texts better,and the proposed modified algorithm has better performance on the quantitative indicators including FID and Inception Score.Finally,the authors propose some future application prospect of the modification idea,especially in the area of large language models.展开更多
OpenAI and ChatGPT, as state-of-the-art languagemodels driven by cutting-edge artificial intelligence technology,have gained widespread adoption across diverse industries. In the realm of computer vision, these models...OpenAI and ChatGPT, as state-of-the-art languagemodels driven by cutting-edge artificial intelligence technology,have gained widespread adoption across diverse industries. In the realm of computer vision, these models havebeen employed for intricate tasks including object recognition, image generation, and image processing, leveragingtheir advanced capabilities to fuel transformative breakthroughs. Within the gaming industry, they have foundutility in crafting virtual characters and generating plots and dialogues, thereby enabling immersive and interactiveplayer experiences. Furthermore, these models have been harnessed in the realm of medical diagnosis, providinginvaluable insights and support to healthcare professionals in the realmof disease detection. The principal objectiveof this paper is to offer a comprehensive overview of OpenAI, OpenAI Gym, ChatGPT, DALL E, stable diffusion,the pre-trained clip model, and other pertinent models in various domains, encompassing CLIP Text-to-Image,education, medical imaging, computer vision, social influence, natural language processing, software development,coding assistance, and Chatbot, among others. Particular emphasis will be placed on comparative analysis andexamination of popular text-to-image and text-to-video models under diverse stimuli, shedding light on thecurrent research landscape, emerging trends, and existing challenges within the domains of OpenAI and ChatGPT.Through a rigorous literature review, this paper aims to deliver a professional and insightful overview of theadvancements, potentials, and limitations of these pioneering language models.展开更多
Text-to-image generation is a vital task in different fields,such as combating crime and terrorism and quickly arresting lawbreakers.For several years,due to a lack of deep learning and machine learning resources,poli...Text-to-image generation is a vital task in different fields,such as combating crime and terrorism and quickly arresting lawbreakers.For several years,due to a lack of deep learning and machine learning resources,police officials required artists to draw the face of a criminal.Traditional methods of identifying criminals are inefficient and time-consuming.This paper presented a new proposed hybrid model for converting the text into the nearest images,then ranking the produced images according to the available data.The framework contains two main steps:generation of the image using an Identity Generative Adversarial Network(IGAN)and ranking of the images according to the available data using multi-criteria decision-making based on neutrosophic theory.The IGAN has the same architecture as the classical Generative Adversarial Networks(GANs),but with different modifications,such as adding a non-linear identity block,smoothing the standard GAN loss function by using a modified loss function and label smoothing,and using mini-batch training.The model achieves efficient results in Inception Distance(FID)and inception score(IS)when compared with other architectures of GANs for generating images from text.The IGAN achieves 42.16 as FID and 14.96 as IS.When it comes to ranking the generated images using Neutrosophic,the framework also performs well in the case of missing information and missing data.展开更多
After Denoising Diffusion Probabilistic Models(DDPM)outperformed Generative Adversarial Networks(GANs),diffusion models have evolved into the backbone of text-guided visual generation,with Stable Diffusion and DALL...After Denoising Diffusion Probabilistic Models(DDPM)outperformed Generative Adversarial Networks(GANs),diffusion models have evolved into the backbone of text-guided visual generation,with Stable Diffusion and DALL·E 2 alleviating key technical constraints.Despite remarkable advances in Text-to-Image(T2I)and Text-to-Video(T2V)tasks,critical gaps remain unaddressed.This paper conducts a systematic review of diffusion-based T2I and T2V technologies,synthesises the latest advances in related technologies,and proposes a"Technical Module-Application-Evaluation"framework to link technical breakthroughs with real-world applications.It also highlights under-researched fields and corresponding evaluation benchmarks,offering an integrated technical landscape to guide the equitable and reliable industrialisation of text-driven visual generation technologies.展开更多
This study explores the role of artificial intelligence(AI)in the conceptual design phase of interior design education,focusing on AI’s potential to help students visualise and refine creative ideas.Conducted within ...This study explores the role of artificial intelligence(AI)in the conceptual design phase of interior design education,focusing on AI’s potential to help students visualise and refine creative ideas.Conducted within a design studio course,the research integrates textto-image generators,particularly Midjourney to support students’design processes.Implemented in the fourth week of a 14-week course,a structured workshop introduced students to Midjourney,with surveys conducted both at this stage and during the final submission to capture changes in student perspectives.Using a two-phase case study involving a workshop,surveys,and interviews among senior undergraduate students in the bachelor’s program of the Interior Architecture and Environmental Design Department,the study assesses the impact of AI prompts,from simple keywords to detailed narratives,on concept development and project outcomes.Findings indicate that AI broadens design possibilities,facilitates iterative ideation,and improves conceptual precision through high-fidelity visualizations.While students view AI as a valuable addition to their creative process,they also express concerns about ethics and the need to balance AI’s benefits with preserving design authenticity.This research contributes to the broader discussion on AI’s role in design,advocating for a balanced integration that respects both technological potential and human creativity.展开更多
Recent advances in generative models have significantly facilitated the development of personalized content creation.Given a small set of images containing a user-specific concept,personalized image generation allows ...Recent advances in generative models have significantly facilitated the development of personalized content creation.Given a small set of images containing a user-specific concept,personalized image generation allows the user to create images that incorporate that concept while adhering to provided text descriptions.The technologies used for personalization have evolved alongside the development of generative models,with their distinct and interrelated components.In this survey,we present a comprehensive review of generalized personalized image generation across various generative models,including traditional GANs,contemporary text-to-image diffusion models,and emerging multi-modal autoregressive(AR)models.We first define a unified framework that standardizes the personalization process across different generative models,encompassing three key components:inversion spaces,inversion methods,and personalization schemes.This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures.Building upon our framework,we provide an in-depth analysis of personalization techniques within each generative model,highlighting their unique contributions and innovations.Through comparative analysis,we elucidate the current landscape of personalized image generation,identifying commonalities and distinguishing features of existing methods.Finally,we discuss open challenges in the field and propose potential directions for future research.We keep a bibliography of related works at https://github.com/csyxwei/Awesome-Personalized-Image-Generation.展开更多
This paper explores the inter-semiotic analysis of the ideational meaning in images generated by the text-to-image AI tool,Bing Image Creator.It adopts Kress and Van Leeuwen’s Grammar of Visual Design as its theoreti...This paper explores the inter-semiotic analysis of the ideational meaning in images generated by the text-to-image AI tool,Bing Image Creator.It adopts Kress and Van Leeuwen’s Grammar of Visual Design as its theoretical framework as the original grounding of the framework in systemic functional grammar(SFG)ensures a solid theoretical basis for undertaking analyses that involve the incorporation of textual and visual components.The integration of an AI generative model within the analytical framework enables a systematic connection between language and visual representations.This incorporation offers the potential to generate well-regulated pictorial representations that are systematically grounded in controlled textual prompts.This approach introduces a novel avenue for re-examining inter-semiotic processes,leveraging the power of AI technology.The paper argues that visual representations possess unique structural devices that surpass the limitations of verbal or written communication as they readily accommodate larger amounts of information in contrast to the limitations of the linear nature of alphabetic writing.Moreover,this paper extends its contribution by critically evaluating specific aspects of the Grammar of Visual Design.展开更多
This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for im...This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image synthesis.In particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information.Moreover,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible convolution.As the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis.We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method.In addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image synthesis.Experiments validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively.展开更多
基金Supported by the Science and Technology Project from State Grid Corporation of China (No.5700-202490330A-2-1-ZX)。
文摘To address the issue of inconsistent image quality and data scarcity in bolt defect detection for transmission lines,this paper proposes an improved sparse region-based convolutional neural network(RCNN) based detection framework integrating image quality evaluation and text-to-image data augmentation.First,a HyperNetwork-based image quality assessment module is introduced to filter low-quality inspection images in terms of clarity and structural integrity,resulting in a high-quality training dataset.Second,a text-to-image diffusion model is utilized for sample augmentation.By designing text prompts that describe various bolt defect types under diverse lighting and viewing conditions,the model automatically generates realistic synthetic samples.The generated images are further filtered using a combination of quality and perceptual similarity metrics to ensure consistency with the real data distribution.Building upon the sparse RCNN baseline,a dynamic label assignment mechanism and a random decision path detection head are incorporated to enhance bounding box matching and prediction accuracy.Experimental results demonstrate that the proposed method significantly improves detection accuracy(mAP@0.5) over the original sparse RCNN while maintaining low computational cost,enabling more efficient and intelligent inspection of transmission line components.
基金supported by the Youth Fund Project of the National Natural Science Foundation of China(62202064).
文摘Text-to-image(TTI)models provide huge innovation ability for many industries,while the content security triggered by them has also attracted wide attention.Considerable research has focused on content security threats of large language models(LLMs),yet comprehensive studies on the content security of TTI models are notably scarce.This paper introduces a systematic tool,named EvilPromptFuzzer,designed to fuzz evil prompts in TTI models.For 15 kinds of fne-grained risks,EvilPromptFuzzer employs the strong knowledge-mining ability of LLMs to construct seed banks,in which the seeds cover various types of characters,interrelations,actions,objects,expressions,body parts,locations,surroundings,etc.Subsequently,these seeds are fed into the LLMs to build scene-diverse prompts,which can weaken the semantic sensitivity related to the fne-grained risks.Hence,the prompts can bypass the content audit mechanism of the TTI model,and ultimately help to generate images with inappropriate content.For the risks of violence,horrible,disgusting,animal cruelty,religious bias,political symbol,and extremism,the efciency of Evil-PromptFuzzer for generating inappropriate images based on DALL.E 3 are greater than 30%,namely,more than 30 generated images are malicious among 100 prompts.Specifcally,the efciency of horrible,disgusting,political sym-bols,and extremism up to 58%,64%,71%,and 50%,respectively.Additionally,we analyzed the vulnerability of exist-ing popular content audit platforms,including Amazon,Google,Azure,and Baidu.Even the most efective Google SafeSearch cloud platform identifes only 33.85%of malicious images across three distinct categories.
基金supported by the National Natural Science Foundation of China(No.61872187).
文摘Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)framework.However,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image distribution.Moreover,the multistage generation strategy results in complex T2I applications.Therefore,this study proposes a novel feature-grounded single-stage T2I model,which considers the“real”distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model's generation capacity.Experimental results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models,showing the improved similarities among the generated image,text,and ground truth.
基金supported by the Key Technological Innovation Projects of Hubei Province of China under Grant No.2018AAA062the Wuhan Science and Technology Plan Project of Hubei Province of China under Grant No.2017010201010109,the National Key Research and Development Program of China under Grant No.2017YFB1002600the National Natural Science Foundation of China under Grant Nos.61672390 and 61972298.
文摘Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem.It needs to solve several difficult tasks across the fields of natural language processing and computer vision.We model it as a combination of semantic entity recognition,object retrieval and recombination,and objects’status optimization.To reach a satisfactory result,we propose a comprehensive pipeline to convert the input text to its visual counterpart.The pipeline includes text processing,foreground objects and background scene retrieval,image synthesis using constrained MCMC,and post-processing.Firstly,we roughly divide the objects parsed from the input text into foreground objects and background scenes.Secondly,we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset,and retrieve an appropriate background scene image from the background image dataset extracted from the Internet.Thirdly,in order to ensure the rationality of foreground objects’positions and sizes in the image synthesis step,we design a cost function and use the Markov Chain Monte Carlo(MCMC)method as the optimizer to solve this constrained layout problem.Finally,to make the image look natural and harmonious,we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step.The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks(GANs)in visual quality of generated scene images.
基金supported by the National Natural Science Foundation of China(Grant Nos.61972298 and 61962019)by the National Cultural and Tourism Science and Technology Innovation Project(2021064)the Training Program of High Level Scientific Research Achievements of Hubei Minzu University under Grant PY22011.
文摘Generating photo-realistic images from a text description is a challenging problem in computer vision.Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks(GANs).In this paper,we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images.Based on those constraints,a category-consistent and relativistic diverse conditional GAN(CRD-CGAN)is proposed to synthesize K photo-realistic images simultaneously.We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises.Then,we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images,which can improve the performance of basic conditional loss.Finally,we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images.We evaluate our approach using the Caltech-UCSD Birds-200-2011,Oxford 102 flower and MS COCO 2014 datasets,and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images.
文摘Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis.
基金supported by the National Natural Science Foundation of China under Grant No.12288201。
文摘Synthesizing images or texts automatically becomes a useful research area in the artificial intelligence nowadays.Generative adversarial networks(GANs),proposed by Goodfellow,et al.in 2014,make this task to be done more efficiently by using deep neural networks(DNNs).The authors consider generating corresponding images from a single-sentence input text description using a GAN.Specifically,the authors analyze the GAN-CLS algorithm,which is a kind of advanced method of GAN proposed by Reed,et al.in 2016.In this paper the authors show the theoretical problem with this algorithm and correct it by modifying the objective function of the model.Experiments are performed on the Oxford-102 dataset and the CUB dataset to support the theoretical results.Since the proposed modification can be seen as an idea which can be used to improve all such kind of GAN models,the authors try two models,GAN-CLS and AttnGAN_(GPT).As a result,in both of the two models,the proposed modified algorithm is more stable and can generate images which are more plausible than the original algorithm.Also,some of the generated images match the input texts better,and the proposed modified algorithm has better performance on the quantitative indicators including FID and Inception Score.Finally,the authors propose some future application prospect of the modification idea,especially in the area of large language models.
基金the National Natural Science Foundation of China(No.62001197).
文摘OpenAI and ChatGPT, as state-of-the-art languagemodels driven by cutting-edge artificial intelligence technology,have gained widespread adoption across diverse industries. In the realm of computer vision, these models havebeen employed for intricate tasks including object recognition, image generation, and image processing, leveragingtheir advanced capabilities to fuel transformative breakthroughs. Within the gaming industry, they have foundutility in crafting virtual characters and generating plots and dialogues, thereby enabling immersive and interactiveplayer experiences. Furthermore, these models have been harnessed in the realm of medical diagnosis, providinginvaluable insights and support to healthcare professionals in the realmof disease detection. The principal objectiveof this paper is to offer a comprehensive overview of OpenAI, OpenAI Gym, ChatGPT, DALL E, stable diffusion,the pre-trained clip model, and other pertinent models in various domains, encompassing CLIP Text-to-Image,education, medical imaging, computer vision, social influence, natural language processing, software development,coding assistance, and Chatbot, among others. Particular emphasis will be placed on comparative analysis andexamination of popular text-to-image and text-to-video models under diverse stimuli, shedding light on thecurrent research landscape, emerging trends, and existing challenges within the domains of OpenAI and ChatGPT.Through a rigorous literature review, this paper aims to deliver a professional and insightful overview of theadvancements, potentials, and limitations of these pioneering language models.
文摘Text-to-image generation is a vital task in different fields,such as combating crime and terrorism and quickly arresting lawbreakers.For several years,due to a lack of deep learning and machine learning resources,police officials required artists to draw the face of a criminal.Traditional methods of identifying criminals are inefficient and time-consuming.This paper presented a new proposed hybrid model for converting the text into the nearest images,then ranking the produced images according to the available data.The framework contains two main steps:generation of the image using an Identity Generative Adversarial Network(IGAN)and ranking of the images according to the available data using multi-criteria decision-making based on neutrosophic theory.The IGAN has the same architecture as the classical Generative Adversarial Networks(GANs),but with different modifications,such as adding a non-linear identity block,smoothing the standard GAN loss function by using a modified loss function and label smoothing,and using mini-batch training.The model achieves efficient results in Inception Distance(FID)and inception score(IS)when compared with other architectures of GANs for generating images from text.The IGAN achieves 42.16 as FID and 14.96 as IS.When it comes to ranking the generated images using Neutrosophic,the framework also performs well in the case of missing information and missing data.
文摘After Denoising Diffusion Probabilistic Models(DDPM)outperformed Generative Adversarial Networks(GANs),diffusion models have evolved into the backbone of text-guided visual generation,with Stable Diffusion and DALL·E 2 alleviating key technical constraints.Despite remarkable advances in Text-to-Image(T2I)and Text-to-Video(T2V)tasks,critical gaps remain unaddressed.This paper conducts a systematic review of diffusion-based T2I and T2V technologies,synthesises the latest advances in related technologies,and proposes a"Technical Module-Application-Evaluation"framework to link technical breakthroughs with real-world applications.It also highlights under-researched fields and corresponding evaluation benchmarks,offering an integrated technical landscape to guide the equitable and reliable industrialisation of text-driven visual generation technologies.
文摘This study explores the role of artificial intelligence(AI)in the conceptual design phase of interior design education,focusing on AI’s potential to help students visualise and refine creative ideas.Conducted within a design studio course,the research integrates textto-image generators,particularly Midjourney to support students’design processes.Implemented in the fourth week of a 14-week course,a structured workshop introduced students to Midjourney,with surveys conducted both at this stage and during the final submission to capture changes in student perspectives.Using a two-phase case study involving a workshop,surveys,and interviews among senior undergraduate students in the bachelor’s program of the Interior Architecture and Environmental Design Department,the study assesses the impact of AI prompts,from simple keywords to detailed narratives,on concept development and project outcomes.Findings indicate that AI broadens design possibilities,facilitates iterative ideation,and improves conceptual precision through high-fidelity visualizations.While students view AI as a valuable addition to their creative process,they also express concerns about ethics and the need to balance AI’s benefits with preserving design authenticity.This research contributes to the broader discussion on AI’s role in design,advocating for a balanced integration that respects both technological potential and human creativity.
基金supported by National Key R&D Program of China(2022YFA1004100).
文摘Recent advances in generative models have significantly facilitated the development of personalized content creation.Given a small set of images containing a user-specific concept,personalized image generation allows the user to create images that incorporate that concept while adhering to provided text descriptions.The technologies used for personalization have evolved alongside the development of generative models,with their distinct and interrelated components.In this survey,we present a comprehensive review of generalized personalized image generation across various generative models,including traditional GANs,contemporary text-to-image diffusion models,and emerging multi-modal autoregressive(AR)models.We first define a unified framework that standardizes the personalization process across different generative models,encompassing three key components:inversion spaces,inversion methods,and personalization schemes.This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures.Building upon our framework,we provide an in-depth analysis of personalization techniques within each generative model,highlighting their unique contributions and innovations.Through comparative analysis,we elucidate the current landscape of personalized image generation,identifying commonalities and distinguishing features of existing methods.Finally,we discuss open challenges in the field and propose potential directions for future research.We keep a bibliography of related works at https://github.com/csyxwei/Awesome-Personalized-Image-Generation.
文摘This paper explores the inter-semiotic analysis of the ideational meaning in images generated by the text-to-image AI tool,Bing Image Creator.It adopts Kress and Van Leeuwen’s Grammar of Visual Design as its theoretical framework as the original grounding of the framework in systemic functional grammar(SFG)ensures a solid theoretical basis for undertaking analyses that involve the incorporation of textual and visual components.The integration of an AI generative model within the analytical framework enables a systematic connection between language and visual representations.This incorporation offers the potential to generate well-regulated pictorial representations that are systematically grounded in controlled textual prompts.This approach introduces a novel avenue for re-examining inter-semiotic processes,leveraging the power of AI technology.The paper argues that visual representations possess unique structural devices that surpass the limitations of verbal or written communication as they readily accommodate larger amounts of information in contrast to the limitations of the linear nature of alphabetic writing.Moreover,this paper extends its contribution by critically evaluating specific aspects of the Grammar of Visual Design.
基金supported in parts by the National Natural Science Foundation of China(62161146005,U21B2023)Shenzhen Science and Technology Program(KQTD20210811090044003,RCJC20200714114435012)Israel Science Foundation.
文摘This study introduces CLIP-Flow,a novel network for generating images from a given image or text.To effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image synthesis.In particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information.Moreover,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible convolution.As the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis.We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method.In addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image synthesis.Experiments validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively.