Large visual language models(LVLMs)have revolutionized the multimodal domain,demonstrating exceptional performance in tasks requiring fusing visual and textual information.However,the current evaluation benchmarks fai...Large visual language models(LVLMs)have revolutionized the multimodal domain,demonstrating exceptional performance in tasks requiring fusing visual and textual information.However,the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text,focusing primarily on answer accuracy rather than the reasoning processes behind them.To address this gap and enhance the understanding of LVLMs’capabilities,we introduce KnowBench,a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs.KnowBench comprises 1081 image-question pairs,each with four options and four pieces of corresponding knowledge across 11 major categories.We evaluate mainstream LVLMs on KnowBench,including proprietary models like Gemini,Claude,and GPT,and open-source models like LLaVA,Qwen-VL,and InternVL.Our experiments reveal a notable discrepancy in the models’abilities to select correct answers and corresponding knowledge whether the models are opensource or proprietary.This indicates that there is still a significant gap in the current LVLMs’knowledge alignment between images and text.Furthermore,our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations.This indicates that scaling laws have a significant impact on multimodal knowledge alignment,and the iteration of the model by researchers also has a positive effect.We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models.We have made our dataset publicly available at https://doi.org/10.57760/sciencedb.29672.展开更多
Spatiotemporal data fusion technologies have been widely used for land surface phenology(LSP)monitoring since it is a low-cost solution to obtain fine-resolution satellite time series.However,the reliability of fused ...Spatiotemporal data fusion technologies have been widely used for land surface phenology(LSP)monitoring since it is a low-cost solution to obtain fine-resolution satellite time series.However,the reliability of fused images is largely affected by land surface heterogeneity and input data.It is unclear whether data fusion can really benefit LSP studies at fine scales.To explore this research question,this study designed a sophisticated simulation experiment to quantify effectiveness of 2 representative data fusion algorithms,namely,pair-based Spatial and Temporal Adaptive Reflectance Fusion Model(STARFM)and time series-based Spatiotemporal fusion method to Simultaneously generate Full-length normalized difference vegetation Index Time series(SSFIT)by fusing Landsat and Moderate Resolution Imaging Spectroradiometer(MODIS)data in extracting pixel-wise spring phenology(i.e.,the start of the growing season,SOS)and its spatial gradient and temporal variation.Our results reveal that:(a)STARFM can improve the accuracy of pixel-wise SOS by up to 74.47%and temporal variation by up to 59.13%,respectively,compared with only using Landsat images,but it can hardly improve the retrieval of spatial gradient.For SSFIT,the accuracy of pixel-wise SOS,spatial gradient,and temporal variation can be improved by up to 139.20%,26.36%,and 162.30%,respectively;(b)the accuracy improvement introduced by fusion algorithms decreases with the number of available Landsat images per year,and it has a large variation with the same number of available Landsat images,and(c)this large variation is highly related to the temporal distributions of available Landsat images,suggesting that fusion algorithms can improve SOS accuracy only when cloud-free Landsat images cannot capture key vegetation growth period.This study calls for caution with the use of data fusion in LSP studies at fine scales.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.62176115.
文摘Large visual language models(LVLMs)have revolutionized the multimodal domain,demonstrating exceptional performance in tasks requiring fusing visual and textual information.However,the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text,focusing primarily on answer accuracy rather than the reasoning processes behind them.To address this gap and enhance the understanding of LVLMs’capabilities,we introduce KnowBench,a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs.KnowBench comprises 1081 image-question pairs,each with four options and four pieces of corresponding knowledge across 11 major categories.We evaluate mainstream LVLMs on KnowBench,including proprietary models like Gemini,Claude,and GPT,and open-source models like LLaVA,Qwen-VL,and InternVL.Our experiments reveal a notable discrepancy in the models’abilities to select correct answers and corresponding knowledge whether the models are opensource or proprietary.This indicates that there is still a significant gap in the current LVLMs’knowledge alignment between images and text.Furthermore,our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations.This indicates that scaling laws have a significant impact on multimodal knowledge alignment,and the iteration of the model by researchers also has a positive effect.We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models.We have made our dataset publicly available at https://doi.org/10.57760/sciencedb.29672.
基金supported by the National Natural Science Foundation of China(Project Nos.42271331 and 42022060)The Hong Kong Polytechnic University(Project Nos.4-ZZND and Q-CDBP).
文摘Spatiotemporal data fusion technologies have been widely used for land surface phenology(LSP)monitoring since it is a low-cost solution to obtain fine-resolution satellite time series.However,the reliability of fused images is largely affected by land surface heterogeneity and input data.It is unclear whether data fusion can really benefit LSP studies at fine scales.To explore this research question,this study designed a sophisticated simulation experiment to quantify effectiveness of 2 representative data fusion algorithms,namely,pair-based Spatial and Temporal Adaptive Reflectance Fusion Model(STARFM)and time series-based Spatiotemporal fusion method to Simultaneously generate Full-length normalized difference vegetation Index Time series(SSFIT)by fusing Landsat and Moderate Resolution Imaging Spectroradiometer(MODIS)data in extracting pixel-wise spring phenology(i.e.,the start of the growing season,SOS)and its spatial gradient and temporal variation.Our results reveal that:(a)STARFM can improve the accuracy of pixel-wise SOS by up to 74.47%and temporal variation by up to 59.13%,respectively,compared with only using Landsat images,but it can hardly improve the retrieval of spatial gradient.For SSFIT,the accuracy of pixel-wise SOS,spatial gradient,and temporal variation can be improved by up to 139.20%,26.36%,and 162.30%,respectively;(b)the accuracy improvement introduced by fusion algorithms decreases with the number of available Landsat images per year,and it has a large variation with the same number of available Landsat images,and(c)this large variation is highly related to the temporal distributions of available Landsat images,suggesting that fusion algorithms can improve SOS accuracy only when cloud-free Landsat images cannot capture key vegetation growth period.This study calls for caution with the use of data fusion in LSP studies at fine scales.