Although generative conversational artificial intelligence(AI)can answer questions well and hold conversations as a person,the semantic ambiguity inherent in text-based communication poses challenges to effective use....Although generative conversational artificial intelligence(AI)can answer questions well and hold conversations as a person,the semantic ambiguity inherent in text-based communication poses challenges to effective use.Effective use reflects the users’utilization of generative conversational AI to achieve their goals,which has not been previously studied.Drawing on the media naturalness theory,we examined how generative conversational AI’s content and style naturalness affect effective use.A two-wave survey was conducted to collect data from 565 users of generative conversational AI.Two techniques were used in this study.Initially,partial least squares structural equation modeling(PLS-SEM)was applied to determine the variables that significantly affected the mechanisms(i.e.,cognitive effort and communication ambiguity)and effective use.Secondly,an artificial neural network model was used to evaluate the relative importance of the significant predictors of mechanisms and effective use identified from the PLS-SEM analysis.The results revealed that the naturalness of content and style differed in their effects on cognitive effort and communication ambiguity.Additionally,cognitive effort and communication ambiguity negatively affected effective use.This study advances the literature on effective use by uncovering the psychological mechanisms underlying effective use and their antecedents.In addition,this study offers insights into the design of generative conversational AI.展开更多
Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal ...Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal alignment,temporal consistency,and robust handling of noisy or incomplete inputs across multiple modalities.We propose Multi Agent-Chain of Thought(CoT),a novel multi-agent chain-of-thought reasoning framework where specialized agents for text,vision,and speech modalities collaboratively construct shared reasoning traces through inter-agent message passing and consensus voting mechanisms.Our architecture incorporates self-reflection modules,conflict resolution protocols,and dynamic rationale alignment to enhance consistency,factual accuracy,and user engagement.The framework employs a hierarchical attention mechanism with cross-modal fusion and implements adaptive reasoning depth based on dialogue complexity.Comprehensive evaluations on Situated Interactive Multi-Modal Conversations(SIMMC)2.0,VisDial v1.0,and newly introduced challenging scenarios demonstrate statistically significant improvements in grounding accuracy(p<0.01),chain-of-thought interpretability,and robustness to adversarial inputs compared to state-of-the-art monolithic transformer baselines and existing multi-agent approaches.展开更多
Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate unt...Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate until the assistant behaves consistently.This paper investigates how far large language models(LLMs)can automate this development.In this paper,we use two reference corpora,Let’s Go(English,public transport)and MEDIA(French,hotel booking),to prompt four LLM families(GPT-4o,Claude,Gemini,Mistral Small)and generate the core specifications required by the rasa platform.These include intent sets with example utterances,entity definitions with slot mappings,response templates,and basic dialog flows.To structure this process,we introduce a model-and platform-agnostic pipelinewith two phases.The first normalizes and validates LLM-generated artifacts,enforcing crossfile consistency andmaking slot usage explicit.The second uses a lightweight dialog harness that runs scripted tests and incrementally patches failure points until conversations complete reliably.Across eight projects,all models required some targeted repairs before training.After applying our pipeline,all reached≥70%task completion(many above 84%),while NLU performance ranged from mid-0.6 to 1.0 macro-F1 depending on domain breadth.These results show that,with modest guidance,current LLMs can produce workable end-to-end dialog prototypes directly fromraw transcripts.Our main contributions are:(i)a reusable bootstrap method aligned with industry domain-specific languages(DSLs),(ii)a small set of high-impact corrective patterns,and(iii)a simple but effective harness for closed-loop refinement across conversational platforms.展开更多
This paper presents an innovative approach to enhance the querying capability of ChatGPT,a conversational artificial intelligence model,by incorporating voice-based interaction and a convolutional neural network(CNN)-...This paper presents an innovative approach to enhance the querying capability of ChatGPT,a conversational artificial intelligence model,by incorporating voice-based interaction and a convolutional neural network(CNN)-based impaired vision detection model.The proposed system aims to improve user experience and accessibility by allowing users to interact with ChatGPT using voice commands.Additionally,a CNN-based model is employed to detect impairments in user vision,enabling the system to adapt its responses and provide appropriate assistance.This research tackles head-on the challenges of user experience and inclusivity in artificial intelligence(AI).It underscores our commitment to overcoming these obstacles,making ChatGPT more accessible and valuable for a broader audience.The integration of voice-based interaction and impaired vision detection represents a novel approach to conversational AI.Notably,this innovation transcends novelty;it carries the potential to profoundly impact the lives of users,particularly those with visual impairments.The modular approach to system design ensures adaptability and scalability,critical for the practical implementation of these advancements.Crucially,the solution places the user at its core.Customizing responses for those with visual impairments demonstrates AI’s potential to not only understand but also accommodate individual needs and preferences.展开更多
A round 2010,academic circles witnessed a surge in Al research fueled by break-throughs such as the ImageNet project,a publicly available large-scale image database.The field reached a tipping point in 2016 when Googl...A round 2010,academic circles witnessed a surge in Al research fueled by break-throughs such as the ImageNet project,a publicly available large-scale image database.The field reached a tipping point in 2016 when Google’s AlphaGo defeated Go world champion Lee Se-dol and gained widespread public attention with the release of OpenAI's ChatGPT in November 2022.Just one year after ChatGPT’s debut,Chinese Al firm DeepSeek launched its open-source general large model,a milestone in the evolution of Al technology.展开更多
People occasionally interact with each other through conversation.In particular,we communicate through dialogue and exchange emotions and information from it.Emotions are essential characteristics of natural language....People occasionally interact with each other through conversation.In particular,we communicate through dialogue and exchange emotions and information from it.Emotions are essential characteristics of natural language.Conversational artificial intelligence is an integral part of all the technologies that allow computers to communicate like humans.For a computer to interact like a human being,it must understand the emotions inherent in the conversation and generate the appropriate responses.However,existing dialogue systems focus only on improving the quality of understanding natural language or generating natural language,excluding emotions.We propose a chatbot based on emotion,which is an essential element in conversation.EP-Bot(an Empathetic PolarisX-based chatbot)is an empathetic chatbot that can better understand a person’s utterance by utilizing PolarisX,an autogrowing knowledge graph.PolarisX extracts new relationship information and expands the knowledge graph automatically.It is helpful for computers to understand a person’s common sense.The proposed EP-Bot extracts knowledge graph embedding using PolarisX and detects emotion and dialog act from the utterance.Then it generates the next utterance using the embeddings.EP-Bot could understand and create a conversation,including the person’s common sense,emotion,and intention.We verify the novelty and accuracy of EP-Bot through the experiments.展开更多
Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversat...Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track record in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Specifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sensing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenarios,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting finegrained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.展开更多
基金supported by the National Natural Science Foundation of China(NSFC)(Grant No.72171095)the National Social Science Foundation of China(Grant No.22VRC153)the Wuhan Textile University Fund(Grant Nos.2024289 and 2024380)。
文摘Although generative conversational artificial intelligence(AI)can answer questions well and hold conversations as a person,the semantic ambiguity inherent in text-based communication poses challenges to effective use.Effective use reflects the users’utilization of generative conversational AI to achieve their goals,which has not been previously studied.Drawing on the media naturalness theory,we examined how generative conversational AI’s content and style naturalness affect effective use.A two-wave survey was conducted to collect data from 565 users of generative conversational AI.Two techniques were used in this study.Initially,partial least squares structural equation modeling(PLS-SEM)was applied to determine the variables that significantly affected the mechanisms(i.e.,cognitive effort and communication ambiguity)and effective use.Secondly,an artificial neural network model was used to evaluate the relative importance of the significant predictors of mechanisms and effective use identified from the PLS-SEM analysis.The results revealed that the naturalness of content and style differed in their effects on cognitive effort and communication ambiguity.Additionally,cognitive effort and communication ambiguity negatively affected effective use.This study advances the literature on effective use by uncovering the psychological mechanisms underlying effective use and their antecedents.In addition,this study offers insights into the design of generative conversational AI.
文摘Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal alignment,temporal consistency,and robust handling of noisy or incomplete inputs across multiple modalities.We propose Multi Agent-Chain of Thought(CoT),a novel multi-agent chain-of-thought reasoning framework where specialized agents for text,vision,and speech modalities collaboratively construct shared reasoning traces through inter-agent message passing and consensus voting mechanisms.Our architecture incorporates self-reflection modules,conflict resolution protocols,and dynamic rationale alignment to enhance consistency,factual accuracy,and user engagement.The framework employs a hierarchical attention mechanism with cross-modal fusion and implements adaptive reasoning depth based on dialogue complexity.Comprehensive evaluations on Situated Interactive Multi-Modal Conversations(SIMMC)2.0,VisDial v1.0,and newly introduced challenging scenarios demonstrate statistically significant improvements in grounding accuracy(p<0.01),chain-of-thought interpretability,and robustness to adversarial inputs compared to state-of-the-art monolithic transformer baselines and existing multi-agent approaches.
基金This publication is part of the TrustBoost project,that has received funding from MICIU/AEI/10.13039/501100011033,from FEDER,UEIt is a coordinated project by a multidisciplinary team from the Universidad Politécnica de Madrid(UPM)and University of Granada(UGR),with two subprojects that address TrustBoost’s objectives:“Enhancing Trustworthiness in Conversational AI through Multimodal Affective Awareness”(Trust Boost-UPM,ref.PID2023-150584OB-C21)“Breaking the Duality of Conversational AI:Going beyond Guided Conversations While Ensuring Compliance with Domain Rules and Constraints”(Trust Boost-UGR,ref.PID2023-150584OB-C22).
文摘Building reliable intent-based,task-oriented dialog systems typically requires substantial manual effort:designers must derive intents,entities,responses,and control logic from raw conversational data,then iterate until the assistant behaves consistently.This paper investigates how far large language models(LLMs)can automate this development.In this paper,we use two reference corpora,Let’s Go(English,public transport)and MEDIA(French,hotel booking),to prompt four LLM families(GPT-4o,Claude,Gemini,Mistral Small)and generate the core specifications required by the rasa platform.These include intent sets with example utterances,entity definitions with slot mappings,response templates,and basic dialog flows.To structure this process,we introduce a model-and platform-agnostic pipelinewith two phases.The first normalizes and validates LLM-generated artifacts,enforcing crossfile consistency andmaking slot usage explicit.The second uses a lightweight dialog harness that runs scripted tests and incrementally patches failure points until conversations complete reliably.Across eight projects,all models required some targeted repairs before training.After applying our pipeline,all reached≥70%task completion(many above 84%),while NLU performance ranged from mid-0.6 to 1.0 macro-F1 depending on domain breadth.These results show that,with modest guidance,current LLMs can produce workable end-to-end dialog prototypes directly fromraw transcripts.Our main contributions are:(i)a reusable bootstrap method aligned with industry domain-specific languages(DSLs),(ii)a small set of high-impact corrective patterns,and(iii)a simple but effective harness for closed-loop refinement across conversational platforms.
基金This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University(IMSIU)(Grant Number:IMSIU-RP23008).
文摘This paper presents an innovative approach to enhance the querying capability of ChatGPT,a conversational artificial intelligence model,by incorporating voice-based interaction and a convolutional neural network(CNN)-based impaired vision detection model.The proposed system aims to improve user experience and accessibility by allowing users to interact with ChatGPT using voice commands.Additionally,a CNN-based model is employed to detect impairments in user vision,enabling the system to adapt its responses and provide appropriate assistance.This research tackles head-on the challenges of user experience and inclusivity in artificial intelligence(AI).It underscores our commitment to overcoming these obstacles,making ChatGPT more accessible and valuable for a broader audience.The integration of voice-based interaction and impaired vision detection represents a novel approach to conversational AI.Notably,this innovation transcends novelty;it carries the potential to profoundly impact the lives of users,particularly those with visual impairments.The modular approach to system design ensures adaptability and scalability,critical for the practical implementation of these advancements.Crucially,the solution places the user at its core.Customizing responses for those with visual impairments demonstrates AI’s potential to not only understand but also accommodate individual needs and preferences.
文摘A round 2010,academic circles witnessed a surge in Al research fueled by break-throughs such as the ImageNet project,a publicly available large-scale image database.The field reached a tipping point in 2016 when Google’s AlphaGo defeated Go world champion Lee Se-dol and gained widespread public attention with the release of OpenAI's ChatGPT in November 2022.Just one year after ChatGPT’s debut,Chinese Al firm DeepSeek launched its open-source general large model,a milestone in the evolution of Al technology.
基金supported by Basic Science Research Program through the NRF(National Research Foundation of Korea)the MSIT(Ministry of Science and ICT),Korea,under the National Program for Excellence in SW supervised by the IITP(Institute for Information&communications Technology Promotion)and the Gachon University research fund of 2019(Nos.NRF2019R1A2C1008412,2015-0-00932,GCU-2019-0773).
文摘People occasionally interact with each other through conversation.In particular,we communicate through dialogue and exchange emotions and information from it.Emotions are essential characteristics of natural language.Conversational artificial intelligence is an integral part of all the technologies that allow computers to communicate like humans.For a computer to interact like a human being,it must understand the emotions inherent in the conversation and generate the appropriate responses.However,existing dialogue systems focus only on improving the quality of understanding natural language or generating natural language,excluding emotions.We propose a chatbot based on emotion,which is an essential element in conversation.EP-Bot(an Empathetic PolarisX-based chatbot)is an empathetic chatbot that can better understand a person’s utterance by utilizing PolarisX,an autogrowing knowledge graph.PolarisX extracts new relationship information and expands the knowledge graph automatically.It is helpful for computers to understand a person’s common sense.The proposed EP-Bot extracts knowledge graph embedding using PolarisX and detects emotion and dialog act from the utterance.Then it generates the next utterance using the embeddings.EP-Bot could understand and create a conversation,including the person’s common sense,emotion,and intention.We verify the novelty and accuracy of EP-Bot through the experiments.
文摘Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track record in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Specifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sensing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenarios,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting finegrained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.