Referring expression comprehension(REC)aims to locate a specific region in an image described by a natural language.Existing two-stage methods generate multiple candidate proposals in the first stage,followed by selec...Referring expression comprehension(REC)aims to locate a specific region in an image described by a natural language.Existing two-stage methods generate multiple candidate proposals in the first stage,followed by selecting one of these proposals as the grounding result in the second stage.Nevertheless,the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate,thereby enormously limiting the overall network performance.To address the above issues,the authors propose an innovative method termed Separate Non-Maximum Suppression(Sep-NMS)for two-stage REC.Particularly,Sep-NMS models information from the two stages independently and collaboratively,ultimately achieving an overall improvement in comprehension and identification of the target objects.Specifically,the authors propose a Ref-Relatedness module for filtering referent proposals rigorously,decreasing the redundancy of referent proposals.A CLIP†Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects.It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage.Moreover,an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages,ensuring maximum uti-lisation of the available information.Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods.The datasets used are publicly available:RefCOCO,RefCOCO+:https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg:https://doi.org/10.1109/CVPR.2016.9.展开更多
Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous...Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions,when there are many candidate regions in the set these methods are inefficient.Inspired by recent success of image captioning by using deep learning methods,in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning.We present a model for referring expressions comprehension by selecting the most relevant region directly from the image.The core of our model is a recurrent attention network which can be seen as an extension of Memory Network.The proposed model capable of improving the results by multiple computational hops.We evaluate the proposed model on two referring expression datasets:Visual Genome and Flickr30k Entities.The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency.We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers.展开更多
Based on a review of the theories on meaning in the field of philosophy, semantics and pragmatics, this paper suggests that language meaning, subject to constant change in real communication,is a concept different fro...Based on a review of the theories on meaning in the field of philosophy, semantics and pragmatics, this paper suggests that language meaning, subject to constant change in real communication,is a concept different from relatively static sense and reference which are mutual inter-dependent to each other. Sense and reference constitute the personal images of any given words.Three sources of meaning are elaborated: the dictionary sense, the personal images of any given words, and the dynamic context.A combination of traditional semantic meaning and dynamic pragmatic meaning analysis is suggested through examining the meaning derivation process in the cases of the use of referring expressions in definite, indefinite descriptions, and incomplete sentences.展开更多
The integration of near-infrared genetically encoded reporters(NIR-GERs)with photoacoustic(PA)imaging enables visualizing deep-seated functions of specific cell populations at high resolution,though the imaging depth ...The integration of near-infrared genetically encoded reporters(NIR-GERs)with photoacoustic(PA)imaging enables visualizing deep-seated functions of specific cell populations at high resolution,though the imaging depth is primarily constrained by reporters’PA response intensity.Directed evolution can optimize NIR-GERs’performance for PA imaging,yet precise quantifying of PA responses in mutant proteins expressed in E.coli colonies across iterative rounds poses challenges to the imaging speed and quantification capabilities of the screening platforms.Here,we present self-calibrated photoacoustic screening(SCAPAS),an imaging-based platform that can detect samples in parallel within 5 s(equivalent to 50 ms per colony),achieving a considerable quantification accuracy of approximately 2.8%and a quantification precision of about 6.47%.SCAPAS incorporates co-expressed reference proteins in sample preparation and employs a ring transducer array with switchable illumination for rapid,wide-field dual-wavelength PA imaging,enabling precisely calculating the PA response using the self-calibration method.Numerical simulations validated the image optimization strategy,quantification process,and noise robustness.Tests with co-expression samples confirmed SCAPAS’s superior screening speed and quantification capabilities.We believe that SCAPAS will facilitate the development of novel NIR-GERs suitable for PA imaging and has the potential to significantly impact the advancement of PA probes and molecular imaging.展开更多
基金funded by the National Natural Science Foundation of China(No.62076032).
文摘Referring expression comprehension(REC)aims to locate a specific region in an image described by a natural language.Existing two-stage methods generate multiple candidate proposals in the first stage,followed by selecting one of these proposals as the grounding result in the second stage.Nevertheless,the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate,thereby enormously limiting the overall network performance.To address the above issues,the authors propose an innovative method termed Separate Non-Maximum Suppression(Sep-NMS)for two-stage REC.Particularly,Sep-NMS models information from the two stages independently and collaboratively,ultimately achieving an overall improvement in comprehension and identification of the target objects.Specifically,the authors propose a Ref-Relatedness module for filtering referent proposals rigorously,decreasing the redundancy of referent proposals.A CLIP†Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects.It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage.Moreover,an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages,ensuring maximum uti-lisation of the available information.Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods.The datasets used are publicly available:RefCOCO,RefCOCO+:https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg:https://doi.org/10.1109/CVPR.2016.9.
基金This work was supported in part by audio-visual new media laboratory operation and maintenance of Academy of Broadcasting Science,Grant No.200304in part by the National Key Research and Development Program of China(Grant No.2019YFB1406201).
文摘Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions,when there are many candidate regions in the set these methods are inefficient.Inspired by recent success of image captioning by using deep learning methods,in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning.We present a model for referring expressions comprehension by selecting the most relevant region directly from the image.The core of our model is a recurrent attention network which can be seen as an extension of Memory Network.The proposed model capable of improving the results by multiple computational hops.We evaluate the proposed model on two referring expression datasets:Visual Genome and Flickr30k Entities.The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency.We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers.
文摘Based on a review of the theories on meaning in the field of philosophy, semantics and pragmatics, this paper suggests that language meaning, subject to constant change in real communication,is a concept different from relatively static sense and reference which are mutual inter-dependent to each other. Sense and reference constitute the personal images of any given words.Three sources of meaning are elaborated: the dictionary sense, the personal images of any given words, and the dynamic context.A combination of traditional semantic meaning and dynamic pragmatic meaning analysis is suggested through examining the meaning derivation process in the cases of the use of referring expressions in definite, indefinite descriptions, and incomplete sentences.
基金STI2030-Major Projects(2022ZD0212000)Key Research and Development Program of Zhejiang(2024SSYS0014)+4 种基金Beijing Natural Science Foundation(Z240009)National Natural Science Foundation of China(2021MG1BI01,62475129,21927813,T2322001)Strategic Precision Surgery Project at the Institute for Intelligent Healthcare(Tsinghua University)Innovation Fund of the Tsinghua-Foshan Institute of Advanced ManufacturingNational Key Research and Development Program of China(2022ZD0211900,2024YFC3406603)。
文摘The integration of near-infrared genetically encoded reporters(NIR-GERs)with photoacoustic(PA)imaging enables visualizing deep-seated functions of specific cell populations at high resolution,though the imaging depth is primarily constrained by reporters’PA response intensity.Directed evolution can optimize NIR-GERs’performance for PA imaging,yet precise quantifying of PA responses in mutant proteins expressed in E.coli colonies across iterative rounds poses challenges to the imaging speed and quantification capabilities of the screening platforms.Here,we present self-calibrated photoacoustic screening(SCAPAS),an imaging-based platform that can detect samples in parallel within 5 s(equivalent to 50 ms per colony),achieving a considerable quantification accuracy of approximately 2.8%and a quantification precision of about 6.47%.SCAPAS incorporates co-expressed reference proteins in sample preparation and employs a ring transducer array with switchable illumination for rapid,wide-field dual-wavelength PA imaging,enabling precisely calculating the PA response using the self-calibration method.Numerical simulations validated the image optimization strategy,quantification process,and noise robustness.Tests with co-expression samples confirmed SCAPAS’s superior screening speed and quantification capabilities.We believe that SCAPAS will facilitate the development of novel NIR-GERs suitable for PA imaging and has the potential to significantly impact the advancement of PA probes and molecular imaging.