Large language models(LLMs)exhibit remarkable capabilities in various natural language processing tasks,such as machine translation.However,the large number of LLM parameters incurs significant costs during inference....Large language models(LLMs)exhibit remarkable capabilities in various natural language processing tasks,such as machine translation.However,the large number of LLM parameters incurs significant costs during inference.Previous studies have attempted to train translation-tailored LLMs with moderately sized models by fine-tuning them on the translation data.Nevertheless,when performing translations in zero-shot directions that are absent from the fine-tuning data,the problem of ignoring instructions and thus producing translations in the wrong language(i.e.,the off-target translation issue)remains unresolved.In this work,we design a twostage fine-tuning algorithm to improve the instruction-following ability of translation-tailored LLMs,particularly for maintaining accurate translation directions.We first fine-tune LLMs on the translation data to elicit basic translation capabilities.At the second stage,we construct instruction-conficting samples by randomly replacing the instructions with the incorrect ones.Then,we introduce an extra unlikelihood loss to reduce the probability assigned to those samples.Experiments on two benchmarks using the LLaMA 2 and LLaMA 3 models,spanning 16 zero-shot directions,demonstrate that,compared to the competitive baseline translation-finetuned LLaMA,our method could effectively reduce the off-target translation ratio(up to-62.4 percentage points),thus improving translation quality(up to+9.7 bilingual evaluation understudy).Analysis shows that our method can preserve the model's performance on other tasks,such as supervised translation and general tasks.Code is released at https://github.com/alphadl/LanguageAware_Tuning.展开更多
Objective To investigate methods for constructing a high-quality instructional dataset for traditional Chinese medicine(TCM)mental disorders and to validate its efficacy.Methods We proposed the Fine-Med-Mental-T&P...Objective To investigate methods for constructing a high-quality instructional dataset for traditional Chinese medicine(TCM)mental disorders and to validate its efficacy.Methods We proposed the Fine-Med-Mental-T&P methodology for constructing high-quality instruction datasets in TCM mental disorders.This approach integrates theoretical knowledge and practical case studies through a dual-track strategy.(i)Theoretical track:textbooks and guidelines on TCM mental disorders were manually segmented.Initial responses were generated using DeepSeek-V3,followed by refinement by the Qwen3-32B model to align the expression with human preferences.A screening algorithm was then applied to select 16000 high-quality instruction pairs.(ii)Practical track:starting from over 600 real clinical case seeds,diagnostic and therapeutic instruction pairs were generated using DeepSeek-V3 and subsequently screened through manual evaluation,resulting in 4000 high-quality practiceoriented instruction pairs.The integration of both tracks yielded the Med-Mental-Instruct-T&P dataset,comprising a total of 20000 instruction pairs.To validate the dataset’s effectiveness,three experimental evaluations(both manual and automated)were conducted:(i)comparative studies to compare the performance of models fine-tuned on different datasets;(ii)benchmarking to compare against mainstream TCM-specific large language models(LLMs);(iii)data ablation study to investigate the relationship between data volume and model performance.Results Experimental results demonstrate the superior performance of T&P-model finetuned on the Med-Mental-Instruct-T&P dataset.In the comparative study,the T&P-model significantly outperformed the baseline models trained solely on self-generated or purely human-curated baseline data.This superiority was evident in both automated metrics(ROUGEL>0.55)and expert manual evaluations(scoring above 7/10 across accuracy).In benchmark comparisons,the T&P-model also excelled against existing mainstream TCM LLMs(e.g.,HuatuoGPT and ZuoyiGPT).It showed particularly strong capabilities in handling diverse clinical presentations,including challenging disorders such as insomnia and coma,showcasing its robustness and versatility.Data ablation studies showed that T&P-model performance had an overall upward trend with minor fluctuations when training data increased from 10%to 50%;beyond 50%,performance improvement slowed significantly,with metrics plateauing and approaching a saturation point.展开更多
Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavi...Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavily rely on the textual modality,which inherently cannot capture the full extent of information contained within images.We propose a context-aware visual entailment(CAVE)model,which introduces a novel aggregation module designed to extract high-level semantic features from images.This module integrates lower-level semantic image features into high-level visual tokens,formatting them similarly to text tokens so that they can serve as inputs for LLMs.The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension.Additionally,the CAVE model incorporates a new input format and training methodology,which is rooted in instruction tuning and in-context learning techniques.The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs.Experimental results on the E-SNLIVE dataset show that the proposed CAVE model exhibits outstanding performance.展开更多
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L...The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.展开更多
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact...Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.展开更多
1 Introduction Large Language Models(LLMs)possess massive parameters and are trained on vast datasets,demonstrating exceptional proficiency in various tasks.The remarkable advancements in LLMs also inspire the explora...1 Introduction Large Language Models(LLMs)possess massive parameters and are trained on vast datasets,demonstrating exceptional proficiency in various tasks.The remarkable advancements in LLMs also inspire the exploration of leveraging LLMs as recommenders(LLMRec),whose effectiveness stems from extensive open-world knowledge and reasoning ability in LLMs[1].LLMRec obtains the recommendation ability through instruction tuning on the user interaction data.But in many cases,it is also crucial for LLMRec to forget specific user data,which is referred to as recommendation unlearning[2],as shown in Fig.1.展开更多
In the past years,multimodal large language models(MLLMs)have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning.However,the extensive model size and ...In the past years,multimodal large language models(MLLMs)have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning.However,the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.Thus,studying efficient and lightweight MLLMs has enormous potential,especially in edge computing scenarios.In this survey,we provide a comprehensive and systematic review of the current state of efficient MLLMs.Specifically,this survey summarizes the timeline of representative efficient MLLMs,the current state of research in structures and strategies,and the applications.Finally,the limitations of current efficient MLLM research and promising future directions are discussed.展开更多
Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose signif...Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices,thereby hindering their widespread application.In this work,we introduce Mini-InternVL,a series of MLLMs with parameters ranging from 1 billion to 4 billion,which achieves 90% of the performance with only 5% of the parameters.This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios.To further promote the adoption of our models,we are developing a unified adaptation framework for Mini-InternVL,which enables our models to transfer and outperform specialized models in downstream tasks,including autonomous driving,medical image processing,and remote sensing.We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.展开更多
基金Project supported by the National Natural Science Foundation of China(No.62372468)the Shandong Natural Science Foundation(No.ZR2023MF008)+1 种基金the Major Basic Research Projects in Shandong Province(No.ZR2023ZD32)the Qingdao Natural Science Foundation(No.23-2-1-161-zyyd-jch)。
文摘Large language models(LLMs)exhibit remarkable capabilities in various natural language processing tasks,such as machine translation.However,the large number of LLM parameters incurs significant costs during inference.Previous studies have attempted to train translation-tailored LLMs with moderately sized models by fine-tuning them on the translation data.Nevertheless,when performing translations in zero-shot directions that are absent from the fine-tuning data,the problem of ignoring instructions and thus producing translations in the wrong language(i.e.,the off-target translation issue)remains unresolved.In this work,we design a twostage fine-tuning algorithm to improve the instruction-following ability of translation-tailored LLMs,particularly for maintaining accurate translation directions.We first fine-tune LLMs on the translation data to elicit basic translation capabilities.At the second stage,we construct instruction-conficting samples by randomly replacing the instructions with the incorrect ones.Then,we introduce an extra unlikelihood loss to reduce the probability assigned to those samples.Experiments on two benchmarks using the LLaMA 2 and LLaMA 3 models,spanning 16 zero-shot directions,demonstrate that,compared to the competitive baseline translation-finetuned LLaMA,our method could effectively reduce the off-target translation ratio(up to-62.4 percentage points),thus improving translation quality(up to+9.7 bilingual evaluation understudy).Analysis shows that our method can preserve the model's performance on other tasks,such as supervised translation and general tasks.Code is released at https://github.com/alphadl/LanguageAware_Tuning.
基金Key Scientific Research Project of the Hunan Provincial Department of Education(23A312).
文摘Objective To investigate methods for constructing a high-quality instructional dataset for traditional Chinese medicine(TCM)mental disorders and to validate its efficacy.Methods We proposed the Fine-Med-Mental-T&P methodology for constructing high-quality instruction datasets in TCM mental disorders.This approach integrates theoretical knowledge and practical case studies through a dual-track strategy.(i)Theoretical track:textbooks and guidelines on TCM mental disorders were manually segmented.Initial responses were generated using DeepSeek-V3,followed by refinement by the Qwen3-32B model to align the expression with human preferences.A screening algorithm was then applied to select 16000 high-quality instruction pairs.(ii)Practical track:starting from over 600 real clinical case seeds,diagnostic and therapeutic instruction pairs were generated using DeepSeek-V3 and subsequently screened through manual evaluation,resulting in 4000 high-quality practiceoriented instruction pairs.The integration of both tracks yielded the Med-Mental-Instruct-T&P dataset,comprising a total of 20000 instruction pairs.To validate the dataset’s effectiveness,three experimental evaluations(both manual and automated)were conducted:(i)comparative studies to compare the performance of models fine-tuned on different datasets;(ii)benchmarking to compare against mainstream TCM-specific large language models(LLMs);(iii)data ablation study to investigate the relationship between data volume and model performance.Results Experimental results demonstrate the superior performance of T&P-model finetuned on the Med-Mental-Instruct-T&P dataset.In the comparative study,the T&P-model significantly outperformed the baseline models trained solely on self-generated or purely human-curated baseline data.This superiority was evident in both automated metrics(ROUGEL>0.55)and expert manual evaluations(scoring above 7/10 across accuracy).In benchmark comparisons,the T&P-model also excelled against existing mainstream TCM LLMs(e.g.,HuatuoGPT and ZuoyiGPT).It showed particularly strong capabilities in handling diverse clinical presentations,including challenging disorders such as insomnia and coma,showcasing its robustness and versatility.Data ablation studies showed that T&P-model performance had an overall upward trend with minor fluctuations when training data increased from 10%to 50%;beyond 50%,performance improvement slowed significantly,with metrics plateauing and approaching a saturation point.
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)Shanghai Pujiang Program,China(No.22PJ1423400)。
文摘Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavily rely on the textual modality,which inherently cannot capture the full extent of information contained within images.We propose a context-aware visual entailment(CAVE)model,which introduces a novel aggregation module designed to extract high-level semantic features from images.This module integrates lower-level semantic image features into high-level visual tokens,formatting them similarly to text tokens so that they can serve as inputs for LLMs.The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension.Additionally,the CAVE model incorporates a new input format and training methodology,which is rooted in instruction tuning and in-context learning techniques.The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs.Experimental results on the E-SNLIVE dataset show that the proposed CAVE model exhibits outstanding performance.
文摘The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.
基金supported by the National Key R&D Program of China[2022YFF0902703]the State Administration for Market Regulation Science and Technology Plan Project(2024MK033).
文摘Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.
基金supported by the National Natural Science Foundation of China(Grant No.62177033)sponsored by the Huawei Innovation Research Program.
文摘1 Introduction Large Language Models(LLMs)possess massive parameters and are trained on vast datasets,demonstrating exceptional proficiency in various tasks.The remarkable advancements in LLMs also inspire the exploration of leveraging LLMs as recommenders(LLMRec),whose effectiveness stems from extensive open-world knowledge and reasoning ability in LLMs[1].LLMRec obtains the recommendation ability through instruction tuning on the user interaction data.But in many cases,it is also crucial for LLMRec to forget specific user data,which is referred to as recommendation unlearning[2],as shown in Fig.1.
基金supported by the National Natural Science Foundation of China(Nos.62302167,U23A20343 and 72192821).
文摘In the past years,multimodal large language models(MLLMs)have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning.However,the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.Thus,studying efficient and lightweight MLLMs has enormous potential,especially in edge computing scenarios.In this survey,we provide a comprehensive and systematic review of the current state of efficient MLLMs.Specifically,this survey summarizes the timeline of representative efficient MLLMs,the current state of research in structures and strategies,and the applications.Finally,the limitations of current efficient MLLM research and promising future directions are discussed.
基金supported by the National Key R&D Program of China(Nos.2022ZD0160102 and 2022ZD0161300)the National Natural Science Foundation of China(Nos.62376134 and 62372223).
文摘Multi-modal large language models(MLLMs)have demonstrated impressive performance in vision-language tasks across a wide range of domains.However,the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices,thereby hindering their widespread application.In this work,we introduce Mini-InternVL,a series of MLLMs with parameters ranging from 1 billion to 4 billion,which achieves 90% of the performance with only 5% of the parameters.This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios.To further promote the adoption of our models,we are developing a unified adaptation framework for Mini-InternVL,which enables our models to transfer and outperform specialized models in downstream tasks,including autonomous driving,medical image processing,and remote sensing.We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.