Two phenomena of similar objects with different spectra and different objects with similar spectrum often result in the difficulty of separation and identification of all types of geographical objects only using spect...Two phenomena of similar objects with different spectra and different objects with similar spectrum often result in the difficulty of separation and identification of all types of geographical objects only using spectral information. Therefore, there is a need to incorporate spatial structural and spatial association properties of the surfaces of objects into image processing to improve the accuracy of classification of remotely sensed imagery. In the current article, a new method is proposed on the basis of the principle of multiple-point statistics for combining spectral information and spatial information for image classification. The method was validated by applying to a case study on road extraction based on Landsat TM taken over the Chinese Yellow River delta on August 8, 1999. The classification results have shown that this new method provides overall better results than the traditional methods such as maximum likelihood classifier (MLC).展开更多
The study of induced polarization (IP) information extraction from magnetotelluric (MT) sounding data is of great and practical significance to the exploitation of deep mineral, oil and gas resources. The linear i...The study of induced polarization (IP) information extraction from magnetotelluric (MT) sounding data is of great and practical significance to the exploitation of deep mineral, oil and gas resources. The linear inversion method, which has been given priority in previous research on the IP information extraction method, has three main problems as follows: 1) dependency on the initial model, 2) easily falling into the local minimum, and 3) serious non-uniqueness of solutions. Taking the nonlinearity and nonconvexity of IP information extraction into consideration, a two-stage CO-PSO minimum structure inversion method using compute unified distributed architecture (CUDA) is proposed. On one hand, a novel Cauchy oscillation particle swarm optimization (CO-PSO) algorithm is applied to extract nonlinear IP information from MT sounding data, which is implemented as a parallel algorithm within CUDA computing architecture; on the other hand, the impact of the polarizability on the observation data is strengthened by introducing a second stage inversion process, and the regularization parameter is applied in the fitness function of PSO algorithm to solve the problem of multi-solution in inversion. The inversion simulation results of polarization layers in different strata of various geoelectric models show that the smooth models of resistivity and IP parameters can be obtained by the proposed algorithm, the results of which are relatively stable and accurate. The experiment results added with noise indicate that this method is robust to Gaussian white noise. Compared with the traditional PSO and GA algorithm, the proposed algorithm has more efficiency and better inversion results.展开更多
Due to the need of rapid and sustainable development in China’s coastal zones, the high-resolution information theory using data mining technology becomes an urgent research focus. However, the traditional pixel-base...Due to the need of rapid and sustainable development in China’s coastal zones, the high-resolution information theory using data mining technology becomes an urgent research focus. However, the traditional pixel-based image analysis methods cannot meet the needs of this development trend. The paper attempts to present an information extraction approach in terms of image segmentation based on an object-oriented algorithm for high-resolution remote sensing images. An aim of the author’ research is to establish an identification system of "pixel-primitive-object". Through extraction and combination of micro-scale coastal zone features, some objects are classified or recognized, e.g., tidal flat, water line, sea wall, and mariculture pond. Firstly, the authors extract various internal features of relatively homogeneous primitive objects using an image segmentation algorithm based on both spectral and shape information. Secondly, the features of those primitives are analyzed to ascertain an optimal object by adopting certain feature rules. The results from this research indicate that our model is practical to realize and the extraction accuracy of the coastal information is significantly improved as compared with the traditional approaches. Therefore, this study provides a potential way to serve the author’ highly dynamic coastal zones for monitoring, management, development and utilization.展开更多
Information extraction(IE)aims to automatically identify and extract information about specific interests from raw texts.Despite the abundance of solutions based on fine-tuning pretrained language models,IE in the con...Information extraction(IE)aims to automatically identify and extract information about specific interests from raw texts.Despite the abundance of solutions based on fine-tuning pretrained language models,IE in the context of fewshot and zero-shot scenarios remains highly challenging due to the scarcity of training data.Large language models(LLMs),on the other hand,can generalize well to unseen tasks with few-shot demonstrations or even zero-shot instructions and have demonstrated impressive ability for a wide range of natural language understanding or generation tasks.Nevertheless,it is unclear,whether such effectiveness can be replicated in the task of IE,where the target tasks involve specialized schema and quite abstractive entity or relation concepts.In this paper,we first examine the validity of LLMs in executing IE tasks with an established prompting strategy and further propose multiple types of augmented prompting methods,including the structured fundamental prompt(SFP),the structured interactive reasoning prompt(SIRP),and the voting-enabled structured interactive reasoning prompt(VESIRP).The experimental results demonstrate that while directly promotes inferior performance,the proposed augmented prompt methods significantly improve the extraction accuracy,achieving comparable or even better performance(e.g.,zero-shot FewNERD,FewNERD-INTRA)than state-of-theart methods that require large-scale training samples.This study represents a systematic exploration of employing instruction-following LLM for the task of IE.It not only establishes a performance benchmark for this novel paradigm but,more importantly,validates a practical technical pathway through the proposed prompt enhancement method,offering a viable solution for efficient IE in low-resource settings.展开更多
Processing police incident data in public security involves complex natural language processing(NLP)tasks,including information extraction.This data contains extensive entity information—such as people,locations,and ...Processing police incident data in public security involves complex natural language processing(NLP)tasks,including information extraction.This data contains extensive entity information—such as people,locations,and events—while also involving reasoning tasks like personnel classification,relationship judgment,and implicit inference.Moreover,utilizing models for extracting information from police incident data poses a significant challenge—data scarcity,which limits the effectiveness of traditional rule-based and machine-learning methods.To address these,we propose TIPS.In collaboration with public security experts,we used de-identified police incident data to create templates that enable large language models(LLMs)to populate data slots and generate simulated data,enhancing data density and diversity.We then designed schemas to efficiently manage complex extraction and reasoning tasks,constructing a high-quality dataset and fine-tuning multiple open-source LLMs.Experiments showed that the fine-tuned ChatGLM-4-9B model achieved an F1 score of 87.14%,nearly 30%higher than the base model,significantly reducing error rates.Manual corrections further improved performance by 9.39%.This study demonstrates that combining largescale pre-trained models with limited high-quality domain-specific data can greatly enhance information extraction in low-resource environments,offering a new approach for intelligent public security applications.展开更多
Background:Acquiring relevant information about procurement targets is fundamental to procuring medical devices.Although traditional Natural Language Processing(NLP)and Machine Learning(ML)methods have improved inform...Background:Acquiring relevant information about procurement targets is fundamental to procuring medical devices.Although traditional Natural Language Processing(NLP)and Machine Learning(ML)methods have improved information retrieval efficiency to a certain extent,they exhibit significant limitations in adaptability and accuracy when dealing with procurement documents characterized by diverse formats and a high degree of unstructured content.The emergence of Large Language Models(LLMs)offers new possibilities for efficient procurement information processing and extraction.Methods:This study collected procurement transaction documents from public procurement websites,and proposed a procurement Information Extraction(IE)method based on LLMs.Unlike traditional approaches,this study systematically explores the applicability of LLMs in both structured and unstructured entities in procurement documents,addressing the challenges posed by format variability and content complexity.Furthermore,an optimized prompt framework tailored for procurement document extraction tasks is developed to enhance the accuracy and robustness of IE.The aim is to process and extract key information from medical device procurement quickly and accurately,meeting stakeholders'demands for precision and timeliness in information retrieval.Results:Experimental results demonstrate that,compared to traditional methods,the proposed approach achieves an F1 Score of 0.9698,representing a 4.85%improvement over the best baseline model.Moreover,both recall and precision rates are close to 97%,significantly outperforming other models and exhibiting exceptional overall recognition capabilities.Notably,further analysis reveals that the proposed method consistently maintains high performance across both structured and unstructured entities in procurement documents while balancing recall and precision effectively,demonstrating its adaptability in handling varying document formats.The results of ablation experiments validate the effectiveness of the proposed prompting strategy.Conclusion:Additionally,this study explores the challenges and potential improvements of the proposed method in IE tasks and provides insights into its feasibility for real-world deployment and application directions,further clarifying its adaptability and value.This method not only exhibits significant advantages in medical device procurement but also holds promise for providing new approaches to information processing and decision support in various domains.展开更多
Tectono-geochemical analysis is one of the key technical methods for deep prospecting and prediction,but the extraction of information on weak and low degrees of mineralization remains a significant challenge.This stu...Tectono-geochemical analysis is one of the key technical methods for deep prospecting and prediction,but the extraction of information on weak and low degrees of mineralization remains a significant challenge.This study takes the Maoping super-large germanium-rich lead-zinc deposit in northeastern Yunnan as an example,systematically analyzes the mineralization element assemblages and their anomaly distribution characteristics,extracts information on low and weak anomalies at depth,clarifies the spatial distribution of ore-forming element anomalies and fluid migration patterns,and establishes tectono-geochemical deep anomaly evaluation criteria and prospecting models,thereby proposing directions for deep prospecting in the deposit.This research shows that the mineralization element assemblage of the F1 factor(Cd-Cu-Ge-Zn-Sb-In-Pb-Sr(-)-As-Hg)anomalies represents near-ore halos;the element assemblage of the F2 factor(Ni-Co-Cr-Rb-Ga)anomalies represents tail halos;the element assemblage of the F3 factor(Rb-Mo-Tl-As)anomalies represents front halos;and the element assemblage of the F4 factor(Ba-Ga)anomalies represents barite alteration anomalies.Elements such as Zn and Pb exhibit significant anomalies near the lead-zinc ore bodies.In the study area,vertical anomalies in the eastern region of the Luoze River indicate that ore-forming fluids migrated from the SE at depth to the NW at shallower levels,whereas in the western region,ore-forming fluids migrated from the SW at depth to the NE at shallower levels.Thus,the lateral extensions of different ore bodies in the eastern and western regions of the river have been determined.On this basis,tectono-geochemical deep anomaly evaluation criteria for the deposit are established,and directions for deep prospecting are proposed.This study provides scientific value and practical significance for deep prospecting and exploration engineering planning for similar lead-zinc deposits.展开更多
Although Named Entity Recognition(NER)in cybersecurity has historically concentrated on threat intelligence,vital security data can be found in a variety of sources,such as open-source intelligence and unprocessed too...Although Named Entity Recognition(NER)in cybersecurity has historically concentrated on threat intelligence,vital security data can be found in a variety of sources,such as open-source intelligence and unprocessed tool outputs.When dealing with technical language,the coexistence of structured and unstructured data poses serious issues for traditional BERT-based techniques.We introduce a three-phase approach for improved NER inmulti-source cybersecurity data that makes use of large language models(LLMs).To ensure thorough entity coverage,our method starts with an identification module that uses dynamic prompting techniques.To lessen hallucinations,the extraction module uses confidence-based self-assessment and cross-checking using regex validation.The tagging module links to knowledge bases for contextual validation and uses SecureBERT in conjunction with conditional random fields to detect entity boundaries precisely.Our framework creates efficient natural language segments by utilizing decoderbased LLMs with 10B parameters.When compared to baseline SecureBERT implementations,evaluation across four cybersecurity data sources shows notable gains,with a 9.4%–25.21%greater recall and a 6.38%–17.3%better F1-score.Our refined model matches larger models and achieves 2.6%–4.9%better F1-score for technical phrase recognition than the state-of-the-art alternatives Claude 3.5 Sonnet,Llama3-8B,and Mixtral-7B.The three-stage architecture identification-extraction-tagging pipeline tackles important cybersecurity NER issues.Through effective architectures,these developments preserve deployability while setting a new standard for entity extraction in challenging security scenarios.The findings show how specific enhancements in hybrid recognition,validation procedures,and prompt engineering raise NER performance above monolithic LLM approaches in cybersecurity applications,especially for technical entity extraction fromheterogeneous sourceswhere conventional techniques fall short.Because of itsmodular nature,the framework can be upgraded at the component level as new methods are developed.展开更多
The development of precision agriculture demands high accuracy and efficiency of cultivated land information extraction. As a new means of monitoring the ground in recent years, unmanned aerial vehicle (UAV) low-hei...The development of precision agriculture demands high accuracy and efficiency of cultivated land information extraction. As a new means of monitoring the ground in recent years, unmanned aerial vehicle (UAV) low-height remote sensing technique, which is flexible, efficient with low cost and with high resolution, is widely applied to investing various resources. Based on this, a novel extraction method for cultivated land information based on Deep Convolutional Neural Network and Transfer Learning (DTCLE) was proposed. First, linear features (roads and ridges etc.) were excluded based on Deep Convolutional Neural Network (DCNN). Next, feature extraction method learned from DCNN was used to cultivated land information extraction by introducing transfer learning mechanism. Last, cultivated land information extraction results were completed by the DTCLE and eCognifion for cultivated land information extraction (ECLE). The location of the Pengzhou County and Guanghan County, Sichuan Province were selected for the experimental purpose. The experimental results showed that the overall precision for the experimental image 1, 2 and 3 (of extracting cultivated land) with the DTCLE method was 91.7%, 88.1% and 88.2% respectively, and the overall precision of ECLE is 9o.7%, 90.5% and 87.0%, respectively. Accuracy of DTCLE was equivalent to that of ECLE, and also outperformed ECLE in terms of integrity and continuity.展开更多
Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three ...Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three general features and the similarities between fragments are then defined on the bases of these features. Through competitions of fragments for different slots in information templates, the method classifies fragments into slot classes and filters out noise information. Far less annotated samples are needed as compared with rule-based methods and therefore it has a strong portability. Experiments show that the method has good performance and is superior to DOM-based method in information extraction. Key words information extraction - competing classification - feature extraction - wrapper induction CLC number TP 311 Foundation item: Supported by the National Natural Science Foundation of China (60303024)Biography: LI Xiang-yang (1974-), male, Ph. D. Candidate, research direction: information extraction, natural language processing.展开更多
Information extraction plays a vital role in natural language processing,to extract named entities and events from unstructured data.Due to the exponential data growth in the agricultural sector,extracting significant...Information extraction plays a vital role in natural language processing,to extract named entities and events from unstructured data.Due to the exponential data growth in the agricultural sector,extracting significant information has become a challenging task.Though existing deep learningbased techniques have been applied in smart agriculture for crop cultivation,crop disease detection,weed removal,and yield production,still it is difficult to find the semantics between extracted information due to unswerving effects of weather,soil,pest,and fertilizer data.This paper consists of two parts.An initial phase,which proposes a data preprocessing technique for removal of ambiguity in input corpora,and the second phase proposes a novel deep learning-based long short-term memory with rectification in Adam optimizer andmultilayer perceptron to find agricultural-based named entity recognition,events,and relations between them.The proposed algorithm has been trained and tested on four input corpora i.e.,agriculture,weather,soil,and pest&fertilizers.The experimental results have been compared with existing techniques and itwas observed that the proposed algorithm outperformsWeighted-SOM,LSTM+RAO,PLR-DBN,KNN,and Na飗e Bayes on standard parameters like accuracy,sensitivity,and specificity.展开更多
In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is p...In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate the wrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate the wrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.展开更多
Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, whic...Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.展开更多
Synthetic aperture radar (SAR) provides a large amount of image data for the observation and research of oceanic eddies. The use of SAR images to automatically depict the shape of eddies and extract the eddy informa...Synthetic aperture radar (SAR) provides a large amount of image data for the observation and research of oceanic eddies. The use of SAR images to automatically depict the shape of eddies and extract the eddy information is of great significance to the study of the oceanic eddies and the application of SAR eddy images. In this paper, a method of automatic shape depiction and information extraction for oceanic eddies in SAR images is proposed, which is for the research of spiral eddies. Firstly, the skeleton image is got by the skeletonization of SAR image. Secondly, the logarithmic spirals detected in the skeleton image are drawn on the SAR image to depict the shape of oceanic eddies. Finally, the eddy information is extracted based on the results of shape depiction. The sentinel 1 SAR eddy images in the Black Sea area were used for the experiment in this paper. The experimental results show that the proposed method can automatically depict the shape of eddies and extract the eddy information. The shape depiction results are consistent with the actual shape of the eddies, and the extracted eddy information is consistent with the reference information extracted by manual operation. As a result, the validity of the method is verified.展开更多
Because of the developed economy and lush vegetation in southern China, the following obstacles or difficulties exist in remote sensing land surface classification: 1) Diverse surface composition types;2) Undulating t...Because of the developed economy and lush vegetation in southern China, the following obstacles or difficulties exist in remote sensing land surface classification: 1) Diverse surface composition types;2) Undulating terrains;3) Small fragmented land;4) Indistinguishable shadows of surface objects. It is our top priority to clarify how to use the concept of big data (Data mining technology) and various new technologies and methods to make complex surface remote sensing information extraction technology develop in the direction of automation, refinement and intelligence. In order to achieve the above research objectives, the paper takes the Gaofen-2 satellite data produced in China as the data source, and takes the complex surface remote sensing information extraction technology as the research object, and intelligently analyzes the remote sensing information of complex surface on the basis of completing the data collection and preprocessing. The specific extraction methods are as follows: 1) extraction research on fractal texture features of Brownian motion;2) extraction research on color features;3) extraction research on vegetation index;4) research on vectors and corresponding classification. In this paper, fractal texture features, color features, vegetation features and spectral features of remote sensing images are combined to form a combination feature vector, which improves the dimension of features, and the feature vector improves the difference of remote sensing features, and it is more conducive to the classification of remote sensing features, and thus it improves the classification accuracy of remote sensing images. It is suitable for remote sensing information extraction of complex surface in southern China. This method can be extended to complex surface area in the future.展开更多
Due to higher demands on product diversity,flexible shift between productions of different products in one equipment becomes a popular solution,resulting in existence of multiple operation modes in a single process.In...Due to higher demands on product diversity,flexible shift between productions of different products in one equipment becomes a popular solution,resulting in existence of multiple operation modes in a single process.In order to handle such multi-mode process,a novel double-layer structure is proposed and the original data are decomposed into common and specific characteristics according to the relationship between variables among each mode.In addition,both low and high order information are considered in each layer.The common and specific information within each mode can be captured and separated into several subspaces according to the different order information.The performance of the proposed method is further validated through a numerical example and the Tennessee Eastman(TE)benchmark.Compared with previous methods,superiority of the proposed method is validated by the better monitoring results.展开更多
Key information extraction can reduce the dimensional effects while evaluating the correct preferences of users during semantic data analysis.Currently,the classifiers are used to maximize the performance of web-page ...Key information extraction can reduce the dimensional effects while evaluating the correct preferences of users during semantic data analysis.Currently,the classifiers are used to maximize the performance of web-page recommendation in terms of precision and satisfaction.The recent method disambiguates contextual sentiment using conceptual prediction with robustness,however the conceptual prediction method is not able to yield the optimal solution.Context-dependent terms are primarily evaluated by constructing linear space of context features,presuming that if the terms come together in certain consumerrelated reviews,they are semantically reliant.Moreover,the more frequently they coexist,the greater the semantic dependency is.However,the influence of the terms that coexist with each other can be part of the frequency of the terms of their semantic dependence,as they are non-integrative and their individual meaning cannot be derived.In this work,we consider the strength of a term and the influence of a term as a combinatorial optimization,called Combinatorial Optimized Linear Space Knapsack for Information Retrieval(COLSK-IR).The COLSK-IR is considered as a knapsack problem with the total weight being the“term influence”or“influence of term”and the total value being the“term frequency”or“frequency of term”for semantic data analysis.The method,by which the term influence and the term frequency are considered to identify the optimal solutions,is called combinatorial optimizations.Thus,we choose the knapsack for performing an integer programming problem and perform multiple experiments using the linear space through combinatorial optimization to identify the possible optimum solutions.It is evident from our experimental results that the COLSK-IR provides better results than previous methods to detect strongly dependent snippets with minimum ambiguity that are related to inter-sentential context during semantic data analysis.展开更多
In order to explore how to extract more transport information from current fluctuation, a theoretical extraction scheme is presented in a single barrier structure based on exclusion models, which include counter-flows...In order to explore how to extract more transport information from current fluctuation, a theoretical extraction scheme is presented in a single barrier structure based on exclusion models, which include counter-flows model and tunnel model. The first four cumulants of these two exclusion models are computed in a single barrier structure, and their characteristics are obtained. A scheme with the help of the first three cumulants is devised to check a transport process to follow the counter-flows model, the tunnel model or neither of them. Time series generated by Monte Carlo techniques is adopted to validate the abstraction procedure, and the result is reasonable.展开更多
Aiming at the problem that virtual machine information cannot be extracted incompletely, we extend the typical information extraction model of virtual machine and propose a perception mechanism in virtualization syste...Aiming at the problem that virtual machine information cannot be extracted incompletely, we extend the typical information extraction model of virtual machine and propose a perception mechanism in virtualization system based on storage covert channel to overcome the affection of the semantic gap. Taking advantage of undetectability of the covert channel, a secure channel is established between vip and virtual machine monitor to pass data directly. The vip machine can pass the control information of malicious process to virtual machine monitor by using the VMCALL instruction and shared memory. By parsing critical information in process control structure, virtual machine monitor can terminate the malicious processes. The test results show that the proposed mechanism can clear the user-level malicious programs in the virtual machine effectively and covertly. Meanwhile, its performance overhead is about the same as that of other mainstream monitoring mode.展开更多
A two-step information extraction method is presented to capture the specific index-related information more accurately.In the first step,the overall process variables are separated into two sets based on Pearson corr...A two-step information extraction method is presented to capture the specific index-related information more accurately.In the first step,the overall process variables are separated into two sets based on Pearson correlation coefficient.One is process variables strongly related to the specific index and the other is process variables weakly related to the specific index.Through performing principal component analysis(PCA)on the two sets,the directions of latent variables have changed.In other words,the correlation between latent variables in the set with strong correlation and the specific index may become weaker.Meanwhile,the correlation between latent variables in the set with weak correlation and the specific index may be enhanced.In the second step,the two sets are further divided into a subset strongly related to the specific index and a subset weakly related to the specific index from the perspective of latent variables using Pearson correlation coefficient,respectively.Two subsets strongly related to the specific index form a new subspace related to the specific index.Then,a hybrid monitoring strategy based on predicted specific index using partial least squares(PLS)and T2statistics-based method is proposed for specific index-related process monitoring using comprehensive information.Predicted specific index reflects real-time information for the specific index.T2statistics are used to monitor specific index-related information.Finally,the proposed method is applied to Tennessee Eastman(TE).The results indicate the effectiveness of the proposed method.展开更多
基金supported by the National Natural Science Foundation of China (No. 40671136)the National High Technology Research and Development Program of China (Nos.2006AA06Z115, 2006AA120106)
文摘Two phenomena of similar objects with different spectra and different objects with similar spectrum often result in the difficulty of separation and identification of all types of geographical objects only using spectral information. Therefore, there is a need to incorporate spatial structural and spatial association properties of the surfaces of objects into image processing to improve the accuracy of classification of remotely sensed imagery. In the current article, a new method is proposed on the basis of the principle of multiple-point statistics for combining spectral information and spatial information for image classification. The method was validated by applying to a case study on road extraction based on Landsat TM taken over the Chinese Yellow River delta on August 8, 1999. The classification results have shown that this new method provides overall better results than the traditional methods such as maximum likelihood classifier (MLC).
基金Projects(41604117,41204054)supported by the National Natural Science Foundation of ChinaProjects(20110490149,2015M580700)supported by the Research Fund for the Doctoral Program of Higher Education,China+1 种基金Project(2015zzts064)supported by the Fundamental Research Funds for the Central Universities,ChinaProject(16B147)supported by the Scientific Research Fund of Hunan Provincial Education Department,China
文摘The study of induced polarization (IP) information extraction from magnetotelluric (MT) sounding data is of great and practical significance to the exploitation of deep mineral, oil and gas resources. The linear inversion method, which has been given priority in previous research on the IP information extraction method, has three main problems as follows: 1) dependency on the initial model, 2) easily falling into the local minimum, and 3) serious non-uniqueness of solutions. Taking the nonlinearity and nonconvexity of IP information extraction into consideration, a two-stage CO-PSO minimum structure inversion method using compute unified distributed architecture (CUDA) is proposed. On one hand, a novel Cauchy oscillation particle swarm optimization (CO-PSO) algorithm is applied to extract nonlinear IP information from MT sounding data, which is implemented as a parallel algorithm within CUDA computing architecture; on the other hand, the impact of the polarizability on the observation data is strengthened by introducing a second stage inversion process, and the regularization parameter is applied in the fitness function of PSO algorithm to solve the problem of multi-solution in inversion. The inversion simulation results of polarization layers in different strata of various geoelectric models show that the smooth models of resistivity and IP parameters can be obtained by the proposed algorithm, the results of which are relatively stable and accurate. The experiment results added with noise indicate that this method is robust to Gaussian white noise. Compared with the traditional PSO and GA algorithm, the proposed algorithm has more efficiency and better inversion results.
基金The "973" Project of China under contract No 2006CB701305the "863" Project of China under contract No2009AA12Z148the National Natural Science Foundation of China under contract No 40971224
文摘Due to the need of rapid and sustainable development in China’s coastal zones, the high-resolution information theory using data mining technology becomes an urgent research focus. However, the traditional pixel-based image analysis methods cannot meet the needs of this development trend. The paper attempts to present an information extraction approach in terms of image segmentation based on an object-oriented algorithm for high-resolution remote sensing images. An aim of the author’ research is to establish an identification system of "pixel-primitive-object". Through extraction and combination of micro-scale coastal zone features, some objects are classified or recognized, e.g., tidal flat, water line, sea wall, and mariculture pond. Firstly, the authors extract various internal features of relatively homogeneous primitive objects using an image segmentation algorithm based on both spectral and shape information. Secondly, the features of those primitives are analyzed to ascertain an optimal object by adopting certain feature rules. The results from this research indicate that our model is practical to realize and the extraction accuracy of the coastal information is significantly improved as compared with the traditional approaches. Therefore, this study provides a potential way to serve the author’ highly dynamic coastal zones for monitoring, management, development and utilization.
基金supported by the National Natural Science Foundation of China(62222212).
文摘Information extraction(IE)aims to automatically identify and extract information about specific interests from raw texts.Despite the abundance of solutions based on fine-tuning pretrained language models,IE in the context of fewshot and zero-shot scenarios remains highly challenging due to the scarcity of training data.Large language models(LLMs),on the other hand,can generalize well to unseen tasks with few-shot demonstrations or even zero-shot instructions and have demonstrated impressive ability for a wide range of natural language understanding or generation tasks.Nevertheless,it is unclear,whether such effectiveness can be replicated in the task of IE,where the target tasks involve specialized schema and quite abstractive entity or relation concepts.In this paper,we first examine the validity of LLMs in executing IE tasks with an established prompting strategy and further propose multiple types of augmented prompting methods,including the structured fundamental prompt(SFP),the structured interactive reasoning prompt(SIRP),and the voting-enabled structured interactive reasoning prompt(VESIRP).The experimental results demonstrate that while directly promotes inferior performance,the proposed augmented prompt methods significantly improve the extraction accuracy,achieving comparable or even better performance(e.g.,zero-shot FewNERD,FewNERD-INTRA)than state-of-theart methods that require large-scale training samples.This study represents a systematic exploration of employing instruction-following LLM for the task of IE.It not only establishes a performance benchmark for this novel paradigm but,more importantly,validates a practical technical pathway through the proposed prompt enhancement method,offering a viable solution for efficient IE in low-resource settings.
文摘Processing police incident data in public security involves complex natural language processing(NLP)tasks,including information extraction.This data contains extensive entity information—such as people,locations,and events—while also involving reasoning tasks like personnel classification,relationship judgment,and implicit inference.Moreover,utilizing models for extracting information from police incident data poses a significant challenge—data scarcity,which limits the effectiveness of traditional rule-based and machine-learning methods.To address these,we propose TIPS.In collaboration with public security experts,we used de-identified police incident data to create templates that enable large language models(LLMs)to populate data slots and generate simulated data,enhancing data density and diversity.We then designed schemas to efficiently manage complex extraction and reasoning tasks,constructing a high-quality dataset and fine-tuning multiple open-source LLMs.Experiments showed that the fine-tuned ChatGLM-4-9B model achieved an F1 score of 87.14%,nearly 30%higher than the base model,significantly reducing error rates.Manual corrections further improved performance by 9.39%.This study demonstrates that combining largescale pre-trained models with limited high-quality domain-specific data can greatly enhance information extraction in low-resource environments,offering a new approach for intelligent public security applications.
文摘Background:Acquiring relevant information about procurement targets is fundamental to procuring medical devices.Although traditional Natural Language Processing(NLP)and Machine Learning(ML)methods have improved information retrieval efficiency to a certain extent,they exhibit significant limitations in adaptability and accuracy when dealing with procurement documents characterized by diverse formats and a high degree of unstructured content.The emergence of Large Language Models(LLMs)offers new possibilities for efficient procurement information processing and extraction.Methods:This study collected procurement transaction documents from public procurement websites,and proposed a procurement Information Extraction(IE)method based on LLMs.Unlike traditional approaches,this study systematically explores the applicability of LLMs in both structured and unstructured entities in procurement documents,addressing the challenges posed by format variability and content complexity.Furthermore,an optimized prompt framework tailored for procurement document extraction tasks is developed to enhance the accuracy and robustness of IE.The aim is to process and extract key information from medical device procurement quickly and accurately,meeting stakeholders'demands for precision and timeliness in information retrieval.Results:Experimental results demonstrate that,compared to traditional methods,the proposed approach achieves an F1 Score of 0.9698,representing a 4.85%improvement over the best baseline model.Moreover,both recall and precision rates are close to 97%,significantly outperforming other models and exhibiting exceptional overall recognition capabilities.Notably,further analysis reveals that the proposed method consistently maintains high performance across both structured and unstructured entities in procurement documents while balancing recall and precision effectively,demonstrating its adaptability in handling varying document formats.The results of ablation experiments validate the effectiveness of the proposed prompting strategy.Conclusion:Additionally,this study explores the challenges and potential improvements of the proposed method in IE tasks and provides insights into its feasibility for real-world deployment and application directions,further clarifying its adaptability and value.This method not only exhibits significant advantages in medical device procurement but also holds promise for providing new approaches to information processing and decision support in various domains.
基金National Natural Science Foundation of China(42472127,42172086)Yunnan Major Science and Technological Projects(202202AG050014)+2 种基金the Yunnan Major Project of Basic Research(202401BN070001-002)Yunnan Mineral Resources Prediction and Evaluation Engineering Research Center(2011)Yunnan Provincial Geological Process and Mineral Resources Innovation Team(2012).
文摘Tectono-geochemical analysis is one of the key technical methods for deep prospecting and prediction,but the extraction of information on weak and low degrees of mineralization remains a significant challenge.This study takes the Maoping super-large germanium-rich lead-zinc deposit in northeastern Yunnan as an example,systematically analyzes the mineralization element assemblages and their anomaly distribution characteristics,extracts information on low and weak anomalies at depth,clarifies the spatial distribution of ore-forming element anomalies and fluid migration patterns,and establishes tectono-geochemical deep anomaly evaluation criteria and prospecting models,thereby proposing directions for deep prospecting in the deposit.This research shows that the mineralization element assemblage of the F1 factor(Cd-Cu-Ge-Zn-Sb-In-Pb-Sr(-)-As-Hg)anomalies represents near-ore halos;the element assemblage of the F2 factor(Ni-Co-Cr-Rb-Ga)anomalies represents tail halos;the element assemblage of the F3 factor(Rb-Mo-Tl-As)anomalies represents front halos;and the element assemblage of the F4 factor(Ba-Ga)anomalies represents barite alteration anomalies.Elements such as Zn and Pb exhibit significant anomalies near the lead-zinc ore bodies.In the study area,vertical anomalies in the eastern region of the Luoze River indicate that ore-forming fluids migrated from the SE at depth to the NW at shallower levels,whereas in the western region,ore-forming fluids migrated from the SW at depth to the NE at shallower levels.Thus,the lateral extensions of different ore bodies in the eastern and western regions of the river have been determined.On this basis,tectono-geochemical deep anomaly evaluation criteria for the deposit are established,and directions for deep prospecting are proposed.This study provides scientific value and practical significance for deep prospecting and exploration engineering planning for similar lead-zinc deposits.
文摘Although Named Entity Recognition(NER)in cybersecurity has historically concentrated on threat intelligence,vital security data can be found in a variety of sources,such as open-source intelligence and unprocessed tool outputs.When dealing with technical language,the coexistence of structured and unstructured data poses serious issues for traditional BERT-based techniques.We introduce a three-phase approach for improved NER inmulti-source cybersecurity data that makes use of large language models(LLMs).To ensure thorough entity coverage,our method starts with an identification module that uses dynamic prompting techniques.To lessen hallucinations,the extraction module uses confidence-based self-assessment and cross-checking using regex validation.The tagging module links to knowledge bases for contextual validation and uses SecureBERT in conjunction with conditional random fields to detect entity boundaries precisely.Our framework creates efficient natural language segments by utilizing decoderbased LLMs with 10B parameters.When compared to baseline SecureBERT implementations,evaluation across four cybersecurity data sources shows notable gains,with a 9.4%–25.21%greater recall and a 6.38%–17.3%better F1-score.Our refined model matches larger models and achieves 2.6%–4.9%better F1-score for technical phrase recognition than the state-of-the-art alternatives Claude 3.5 Sonnet,Llama3-8B,and Mixtral-7B.The three-stage architecture identification-extraction-tagging pipeline tackles important cybersecurity NER issues.Through effective architectures,these developments preserve deployability while setting a new standard for entity extraction in challenging security scenarios.The findings show how specific enhancements in hybrid recognition,validation procedures,and prompt engineering raise NER performance above monolithic LLM approaches in cybersecurity applications,especially for technical entity extraction fromheterogeneous sourceswhere conventional techniques fall short.Because of itsmodular nature,the framework can be upgraded at the component level as new methods are developed.
基金supported by the Fundamental Research Funds for the Central Universities of China(Grant No.2013SCU11006)the Key Laboratory of Digital Mapping and Land Information Application of National Administration of Surveying,Mapping and Geoinformation of China(Grant NO.DM2014SC02)the Key Laboratory of Geospecial Information Technology,Ministry of Land and Resources of China(Grant NO.KLGSIT201504)
文摘The development of precision agriculture demands high accuracy and efficiency of cultivated land information extraction. As a new means of monitoring the ground in recent years, unmanned aerial vehicle (UAV) low-height remote sensing technique, which is flexible, efficient with low cost and with high resolution, is widely applied to investing various resources. Based on this, a novel extraction method for cultivated land information based on Deep Convolutional Neural Network and Transfer Learning (DTCLE) was proposed. First, linear features (roads and ridges etc.) were excluded based on Deep Convolutional Neural Network (DCNN). Next, feature extraction method learned from DCNN was used to cultivated land information extraction by introducing transfer learning mechanism. Last, cultivated land information extraction results were completed by the DTCLE and eCognifion for cultivated land information extraction (ECLE). The location of the Pengzhou County and Guanghan County, Sichuan Province were selected for the experimental purpose. The experimental results showed that the overall precision for the experimental image 1, 2 and 3 (of extracting cultivated land) with the DTCLE method was 91.7%, 88.1% and 88.2% respectively, and the overall precision of ECLE is 9o.7%, 90.5% and 87.0%, respectively. Accuracy of DTCLE was equivalent to that of ECLE, and also outperformed ECLE in terms of integrity and continuity.
文摘Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three general features and the similarities between fragments are then defined on the bases of these features. Through competitions of fragments for different slots in information templates, the method classifies fragments into slot classes and filters out noise information. Far less annotated samples are needed as compared with rule-based methods and therefore it has a strong portability. Experiments show that the method has good performance and is superior to DOM-based method in information extraction. Key words information extraction - competing classification - feature extraction - wrapper induction CLC number TP 311 Foundation item: Supported by the National Natural Science Foundation of China (60303024)Biography: LI Xiang-yang (1974-), male, Ph. D. Candidate, research direction: information extraction, natural language processing.
基金This work was supported by the Deanship of Scientific Research at King Khalid University through a General Research Project under Grant Number GRP/41/42.
文摘Information extraction plays a vital role in natural language processing,to extract named entities and events from unstructured data.Due to the exponential data growth in the agricultural sector,extracting significant information has become a challenging task.Though existing deep learningbased techniques have been applied in smart agriculture for crop cultivation,crop disease detection,weed removal,and yield production,still it is difficult to find the semantics between extracted information due to unswerving effects of weather,soil,pest,and fertilizer data.This paper consists of two parts.An initial phase,which proposes a data preprocessing technique for removal of ambiguity in input corpora,and the second phase proposes a novel deep learning-based long short-term memory with rectification in Adam optimizer andmultilayer perceptron to find agricultural-based named entity recognition,events,and relations between them.The proposed algorithm has been trained and tested on four input corpora i.e.,agriculture,weather,soil,and pest&fertilizers.The experimental results have been compared with existing techniques and itwas observed that the proposed algorithm outperformsWeighted-SOM,LSTM+RAO,PLR-DBN,KNN,and Na飗e Bayes on standard parameters like accuracy,sensitivity,and specificity.
基金the National Grand Fundamental Research 973 Program of China(G1998030414)
文摘In order to use data information in the Internet, it is necessary to extract data from web pages. An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generation algorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate the wrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate the wrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.
基金the Artificial Intelligence Innovation and Development Project of Shanghai Municipal Commission of Economy and Information (No. 2019-RGZN-01081)。
文摘Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.
文摘Synthetic aperture radar (SAR) provides a large amount of image data for the observation and research of oceanic eddies. The use of SAR images to automatically depict the shape of eddies and extract the eddy information is of great significance to the study of the oceanic eddies and the application of SAR eddy images. In this paper, a method of automatic shape depiction and information extraction for oceanic eddies in SAR images is proposed, which is for the research of spiral eddies. Firstly, the skeleton image is got by the skeletonization of SAR image. Secondly, the logarithmic spirals detected in the skeleton image are drawn on the SAR image to depict the shape of oceanic eddies. Finally, the eddy information is extracted based on the results of shape depiction. The sentinel 1 SAR eddy images in the Black Sea area were used for the experiment in this paper. The experimental results show that the proposed method can automatically depict the shape of eddies and extract the eddy information. The shape depiction results are consistent with the actual shape of the eddies, and the extracted eddy information is consistent with the reference information extracted by manual operation. As a result, the validity of the method is verified.
文摘Because of the developed economy and lush vegetation in southern China, the following obstacles or difficulties exist in remote sensing land surface classification: 1) Diverse surface composition types;2) Undulating terrains;3) Small fragmented land;4) Indistinguishable shadows of surface objects. It is our top priority to clarify how to use the concept of big data (Data mining technology) and various new technologies and methods to make complex surface remote sensing information extraction technology develop in the direction of automation, refinement and intelligence. In order to achieve the above research objectives, the paper takes the Gaofen-2 satellite data produced in China as the data source, and takes the complex surface remote sensing information extraction technology as the research object, and intelligently analyzes the remote sensing information of complex surface on the basis of completing the data collection and preprocessing. The specific extraction methods are as follows: 1) extraction research on fractal texture features of Brownian motion;2) extraction research on color features;3) extraction research on vegetation index;4) research on vectors and corresponding classification. In this paper, fractal texture features, color features, vegetation features and spectral features of remote sensing images are combined to form a combination feature vector, which improves the dimension of features, and the feature vector improves the difference of remote sensing features, and it is more conducive to the classification of remote sensing features, and thus it improves the classification accuracy of remote sensing images. It is suitable for remote sensing information extraction of complex surface in southern China. This method can be extended to complex surface area in the future.
基金the National Natural Science Foundation of China(61903352)China Postdoctoral Science Foundation(2020M671721)+4 种基金Zhejiang Province Natural Science Foundation of China(LQ19F030007)Natural Science Foundation of Jiangsu Province(BK20180594)Project of department of education of Zhejiang province(Y202044960)Project of Zhejiang Tongji Vocational College of Science and Technology(TRC1904)Foundation of Key Laboratory of Advanced Process Control for Light Industry(Jiangnan University),Ministry of Education,P.R.China,APCLI1803.
文摘Due to higher demands on product diversity,flexible shift between productions of different products in one equipment becomes a popular solution,resulting in existence of multiple operation modes in a single process.In order to handle such multi-mode process,a novel double-layer structure is proposed and the original data are decomposed into common and specific characteristics according to the relationship between variables among each mode.In addition,both low and high order information are considered in each layer.The common and specific information within each mode can be captured and separated into several subspaces according to the different order information.The performance of the proposed method is further validated through a numerical example and the Tennessee Eastman(TE)benchmark.Compared with previous methods,superiority of the proposed method is validated by the better monitoring results.
文摘Key information extraction can reduce the dimensional effects while evaluating the correct preferences of users during semantic data analysis.Currently,the classifiers are used to maximize the performance of web-page recommendation in terms of precision and satisfaction.The recent method disambiguates contextual sentiment using conceptual prediction with robustness,however the conceptual prediction method is not able to yield the optimal solution.Context-dependent terms are primarily evaluated by constructing linear space of context features,presuming that if the terms come together in certain consumerrelated reviews,they are semantically reliant.Moreover,the more frequently they coexist,the greater the semantic dependency is.However,the influence of the terms that coexist with each other can be part of the frequency of the terms of their semantic dependence,as they are non-integrative and their individual meaning cannot be derived.In this work,we consider the strength of a term and the influence of a term as a combinatorial optimization,called Combinatorial Optimized Linear Space Knapsack for Information Retrieval(COLSK-IR).The COLSK-IR is considered as a knapsack problem with the total weight being the“term influence”or“influence of term”and the total value being the“term frequency”or“frequency of term”for semantic data analysis.The method,by which the term influence and the term frequency are considered to identify the optimal solutions,is called combinatorial optimizations.Thus,we choose the knapsack for performing an integer programming problem and perform multiple experiments using the linear space through combinatorial optimization to identify the possible optimum solutions.It is evident from our experimental results that the COLSK-IR provides better results than previous methods to detect strongly dependent snippets with minimum ambiguity that are related to inter-sentential context during semantic data analysis.
基金Project supported by the National Natural Science Foundation of China (Grant No. 60676053)Applied Material in Xi’an Innovation Funds,China (Grant No. XA-AM-200603)
文摘In order to explore how to extract more transport information from current fluctuation, a theoretical extraction scheme is presented in a single barrier structure based on exclusion models, which include counter-flows model and tunnel model. The first four cumulants of these two exclusion models are computed in a single barrier structure, and their characteristics are obtained. A scheme with the help of the first three cumulants is devised to check a transport process to follow the counter-flows model, the tunnel model or neither of them. Time series generated by Monte Carlo techniques is adopted to validate the abstraction procedure, and the result is reasonable.
基金Supported by the National High Technology Research and Development Program of China (863 Program) (2009AA012200)Henan Province Science and Technology Funding Projects ( SP09JH11158)
文摘Aiming at the problem that virtual machine information cannot be extracted incompletely, we extend the typical information extraction model of virtual machine and propose a perception mechanism in virtualization system based on storage covert channel to overcome the affection of the semantic gap. Taking advantage of undetectability of the covert channel, a secure channel is established between vip and virtual machine monitor to pass data directly. The vip machine can pass the control information of malicious process to virtual machine monitor by using the VMCALL instruction and shared memory. By parsing critical information in process control structure, virtual machine monitor can terminate the malicious processes. The test results show that the proposed mechanism can clear the user-level malicious programs in the virtual machine effectively and covertly. Meanwhile, its performance overhead is about the same as that of other mainstream monitoring mode.
基金Projects(61374140,61673173)supported by the National Natural Science Foundation of ChinaProjects(222201717006,222201714031)supported by the Fundamental Research Funds for the Central Universities,China
文摘A two-step information extraction method is presented to capture the specific index-related information more accurately.In the first step,the overall process variables are separated into two sets based on Pearson correlation coefficient.One is process variables strongly related to the specific index and the other is process variables weakly related to the specific index.Through performing principal component analysis(PCA)on the two sets,the directions of latent variables have changed.In other words,the correlation between latent variables in the set with strong correlation and the specific index may become weaker.Meanwhile,the correlation between latent variables in the set with weak correlation and the specific index may be enhanced.In the second step,the two sets are further divided into a subset strongly related to the specific index and a subset weakly related to the specific index from the perspective of latent variables using Pearson correlation coefficient,respectively.Two subsets strongly related to the specific index form a new subspace related to the specific index.Then,a hybrid monitoring strategy based on predicted specific index using partial least squares(PLS)and T2statistics-based method is proposed for specific index-related process monitoring using comprehensive information.Predicted specific index reflects real-time information for the specific index.T2statistics are used to monitor specific index-related information.Finally,the proposed method is applied to Tennessee Eastman(TE).The results indicate the effectiveness of the proposed method.