Processing police incident data in public security involves complex natural language processing(NLP)tasks,including information extraction.This data contains extensive entity information—such as people,locations,and ...Processing police incident data in public security involves complex natural language processing(NLP)tasks,including information extraction.This data contains extensive entity information—such as people,locations,and events—while also involving reasoning tasks like personnel classification,relationship judgment,and implicit inference.Moreover,utilizing models for extracting information from police incident data poses a significant challenge—data scarcity,which limits the effectiveness of traditional rule-based and machine-learning methods.To address these,we propose TIPS.In collaboration with public security experts,we used de-identified police incident data to create templates that enable large language models(LLMs)to populate data slots and generate simulated data,enhancing data density and diversity.We then designed schemas to efficiently manage complex extraction and reasoning tasks,constructing a high-quality dataset and fine-tuning multiple open-source LLMs.Experiments showed that the fine-tuned ChatGLM-4-9B model achieved an F1 score of 87.14%,nearly 30%higher than the base model,significantly reducing error rates.Manual corrections further improved performance by 9.39%.This study demonstrates that combining largescale pre-trained models with limited high-quality domain-specific data can greatly enhance information extraction in low-resource environments,offering a new approach for intelligent public security applications.展开更多
Background:Acquiring relevant information about procurement targets is fundamental to procuring medical devices.Although traditional Natural Language Processing(NLP)and Machine Learning(ML)methods have improved inform...Background:Acquiring relevant information about procurement targets is fundamental to procuring medical devices.Although traditional Natural Language Processing(NLP)and Machine Learning(ML)methods have improved information retrieval efficiency to a certain extent,they exhibit significant limitations in adaptability and accuracy when dealing with procurement documents characterized by diverse formats and a high degree of unstructured content.The emergence of Large Language Models(LLMs)offers new possibilities for efficient procurement information processing and extraction.Methods:This study collected procurement transaction documents from public procurement websites,and proposed a procurement Information Extraction(IE)method based on LLMs.Unlike traditional approaches,this study systematically explores the applicability of LLMs in both structured and unstructured entities in procurement documents,addressing the challenges posed by format variability and content complexity.Furthermore,an optimized prompt framework tailored for procurement document extraction tasks is developed to enhance the accuracy and robustness of IE.The aim is to process and extract key information from medical device procurement quickly and accurately,meeting stakeholders'demands for precision and timeliness in information retrieval.Results:Experimental results demonstrate that,compared to traditional methods,the proposed approach achieves an F1 Score of 0.9698,representing a 4.85%improvement over the best baseline model.Moreover,both recall and precision rates are close to 97%,significantly outperforming other models and exhibiting exceptional overall recognition capabilities.Notably,further analysis reveals that the proposed method consistently maintains high performance across both structured and unstructured entities in procurement documents while balancing recall and precision effectively,demonstrating its adaptability in handling varying document formats.The results of ablation experiments validate the effectiveness of the proposed prompting strategy.Conclusion:Additionally,this study explores the challenges and potential improvements of the proposed method in IE tasks and provides insights into its feasibility for real-world deployment and application directions,further clarifying its adaptability and value.This method not only exhibits significant advantages in medical device procurement but also holds promise for providing new approaches to information processing and decision support in various domains.展开更多
With the rapid expansion of social media,analyzing emotions and their causes in texts has gained significant importance.Emotion-cause pair extraction enables the identification of causal relationships between emotions...With the rapid expansion of social media,analyzing emotions and their causes in texts has gained significant importance.Emotion-cause pair extraction enables the identification of causal relationships between emotions and their triggers within a text,facilitating a deeper understanding of expressed sentiments and their underlying reasons.This comprehension is crucial for making informed strategic decisions in various business and societal contexts.However,recent research approaches employing multi-task learning frameworks for modeling often face challenges such as the inability to simultaneouslymodel extracted features and their interactions,or inconsistencies in label prediction between emotion-cause pair extraction and independent assistant tasks like emotion and cause extraction.To address these issues,this study proposes an emotion-cause pair extraction methodology that incorporates joint feature encoding and task alignment mechanisms.The model consists of two primary components:First,joint feature encoding simultaneously generates features for emotion-cause pairs and clauses,enhancing feature interactions between emotion clauses,cause clauses,and emotion-cause pairs.Second,the task alignment technique is applied to reduce the labeling distance between emotion-cause pair extraction and the two assistant tasks,capturing deep semantic information interactions among tasks.The proposed method is evaluated on a Chinese benchmark corpus using 10-fold cross-validation,assessing key performance metrics such as precision,recall,and F1 score.Experimental results demonstrate that the model achieves an F1 score of 76.05%,surpassing the state-of-the-art by 1.03%.The proposed model exhibits significant improvements in emotion-cause pair extraction(ECPE)and cause extraction(CE)compared to existing methods,validating its effectiveness.This research introduces a novel approach based on joint feature encoding and task alignment mechanisms,contributing to advancements in emotion-cause pair extraction.However,the study’s limitation lies in the data sources,potentially restricting the generalizability of the findings.展开更多
Web data extraction has become a key technology for extracting valuable data from websites.At present,most extraction methods based on rule learning,visual pattern or tree matching have limited performance on complex ...Web data extraction has become a key technology for extracting valuable data from websites.At present,most extraction methods based on rule learning,visual pattern or tree matching have limited performance on complex web pages.Through ana-lyzing various statistical characteristics of HTML el-ements in web documents,this paper proposes,based on statistical features,an unsupervised web data ex-traction method—traversing the HTML DOM parse tree at first,calculating and generating the statistical matrix of the elements,and then locating data records by clustering method and heuristic rules that reveal in-herent links between the visual characteristics of the data recording areas and the statistical characteristics of the HTML nodes—which is both suitable for data records extraction of single-page and multi-pages,and it has strong generality and needs no training.The ex-periments show that the accuracy and efficiency of this method are equally better than the current data extrac-tion method.展开更多
The development of precision agriculture demands high accuracy and efficiency of cultivated land information extraction. As a new means of monitoring the ground in recent years, unmanned aerial vehicle (UAV) low-hei...The development of precision agriculture demands high accuracy and efficiency of cultivated land information extraction. As a new means of monitoring the ground in recent years, unmanned aerial vehicle (UAV) low-height remote sensing technique, which is flexible, efficient with low cost and with high resolution, is widely applied to investing various resources. Based on this, a novel extraction method for cultivated land information based on Deep Convolutional Neural Network and Transfer Learning (DTCLE) was proposed. First, linear features (roads and ridges etc.) were excluded based on Deep Convolutional Neural Network (DCNN). Next, feature extraction method learned from DCNN was used to cultivated land information extraction by introducing transfer learning mechanism. Last, cultivated land information extraction results were completed by the DTCLE and eCognifion for cultivated land information extraction (ECLE). The location of the Pengzhou County and Guanghan County, Sichuan Province were selected for the experimental purpose. The experimental results showed that the overall precision for the experimental image 1, 2 and 3 (of extracting cultivated land) with the DTCLE method was 91.7%, 88.1% and 88.2% respectively, and the overall precision of ECLE is 9o.7%, 90.5% and 87.0%, respectively. Accuracy of DTCLE was equivalent to that of ECLE, and also outperformed ECLE in terms of integrity and continuity.展开更多
Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three ...Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three general features and the similarities between fragments are then defined on the bases of these features. Through competitions of fragments for different slots in information templates, the method classifies fragments into slot classes and filters out noise information. Far less annotated samples are needed as compared with rule-based methods and therefore it has a strong portability. Experiments show that the method has good performance and is superior to DOM-based method in information extraction. Key words information extraction - competing classification - feature extraction - wrapper induction CLC number TP 311 Foundation item: Supported by the National Natural Science Foundation of China (60303024)Biography: LI Xiang-yang (1974-), male, Ph. D. Candidate, research direction: information extraction, natural language processing.展开更多
Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracte...Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).展开更多
Information extraction plays a vital role in natural language processing,to extract named entities and events from unstructured data.Due to the exponential data growth in the agricultural sector,extracting significant...Information extraction plays a vital role in natural language processing,to extract named entities and events from unstructured data.Due to the exponential data growth in the agricultural sector,extracting significant information has become a challenging task.Though existing deep learningbased techniques have been applied in smart agriculture for crop cultivation,crop disease detection,weed removal,and yield production,still it is difficult to find the semantics between extracted information due to unswerving effects of weather,soil,pest,and fertilizer data.This paper consists of two parts.An initial phase,which proposes a data preprocessing technique for removal of ambiguity in input corpora,and the second phase proposes a novel deep learning-based long short-term memory with rectification in Adam optimizer andmultilayer perceptron to find agricultural-based named entity recognition,events,and relations between them.The proposed algorithm has been trained and tested on four input corpora i.e.,agriculture,weather,soil,and pest&fertilizers.The experimental results have been compared with existing techniques and itwas observed that the proposed algorithm outperformsWeighted-SOM,LSTM+RAO,PLR-DBN,KNN,and Na飗e Bayes on standard parameters like accuracy,sensitivity,and specificity.展开更多
This paper presents an innovative Soft Design Science Methodology for improving information systems security using multi-layered security approach. The study applied Soft Design Science Methodology to address the prob...This paper presents an innovative Soft Design Science Methodology for improving information systems security using multi-layered security approach. The study applied Soft Design Science Methodology to address the problematic situation on how information systems security can be improved. In addition, Soft Design Science Methodology was compounded with mixed research methodology. This holistic approach helped for research methodology triangulation. The study assessed security requirements and developed a framework for improving information systems security. The study carried out maturity level assessment to determine security status quo in the education sector in Tanzania. The study identified security requirements gap (IT security controls, IT security measures) using ISO/IEC 21827: Systems Security Engineering-Capability Maturity Model (SSE-CMM) with a rating scale of 0 - 5. The results of this study show that maturity level across security domain is 0.44 out of 5. The finding shows that the implementation of IT security controls and security measures for ensuring security goals are lacking or conducted in ad-hoc. Thus, for improving the security of information systems, organisations should implement security controls and security measures in each security domain (multi-layer security). This research provides a framework for enhancing information systems security during capturing, processing, storage and transmission of information. This research has several practical contributions. Firstly, it contributes to the body of knowledge of information systems security by providing a set of security requirements for ensuring information systems security. Secondly, it contributes empirical evidence on how information systems security can be improved. Thirdly, it contributes on the applicability of Soft Design Science Methodology on addressing the problematic situation in information systems security. The research findings can be used by decision makers and lawmakers to improve existing cyber security laws, and enact laws for data privacy and sharing of open data.展开更多
Synthetic aperture radar (SAR) provides a large amount of image data for the observation and research of oceanic eddies. The use of SAR images to automatically depict the shape of eddies and extract the eddy informa...Synthetic aperture radar (SAR) provides a large amount of image data for the observation and research of oceanic eddies. The use of SAR images to automatically depict the shape of eddies and extract the eddy information is of great significance to the study of the oceanic eddies and the application of SAR eddy images. In this paper, a method of automatic shape depiction and information extraction for oceanic eddies in SAR images is proposed, which is for the research of spiral eddies. Firstly, the skeleton image is got by the skeletonization of SAR image. Secondly, the logarithmic spirals detected in the skeleton image are drawn on the SAR image to depict the shape of oceanic eddies. Finally, the eddy information is extracted based on the results of shape depiction. The sentinel 1 SAR eddy images in the Black Sea area were used for the experiment in this paper. The experimental results show that the proposed method can automatically depict the shape of eddies and extract the eddy information. The shape depiction results are consistent with the actual shape of the eddies, and the extracted eddy information is consistent with the reference information extracted by manual operation. As a result, the validity of the method is verified.展开更多
Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, whic...Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.展开更多
Because of the developed economy and lush vegetation in southern China, the following obstacles or difficulties exist in remote sensing land surface classification: 1) Diverse surface composition types;2) Undulating t...Because of the developed economy and lush vegetation in southern China, the following obstacles or difficulties exist in remote sensing land surface classification: 1) Diverse surface composition types;2) Undulating terrains;3) Small fragmented land;4) Indistinguishable shadows of surface objects. It is our top priority to clarify how to use the concept of big data (Data mining technology) and various new technologies and methods to make complex surface remote sensing information extraction technology develop in the direction of automation, refinement and intelligence. In order to achieve the above research objectives, the paper takes the Gaofen-2 satellite data produced in China as the data source, and takes the complex surface remote sensing information extraction technology as the research object, and intelligently analyzes the remote sensing information of complex surface on the basis of completing the data collection and preprocessing. The specific extraction methods are as follows: 1) extraction research on fractal texture features of Brownian motion;2) extraction research on color features;3) extraction research on vegetation index;4) research on vectors and corresponding classification. In this paper, fractal texture features, color features, vegetation features and spectral features of remote sensing images are combined to form a combination feature vector, which improves the dimension of features, and the feature vector improves the difference of remote sensing features, and it is more conducive to the classification of remote sensing features, and thus it improves the classification accuracy of remote sensing images. It is suitable for remote sensing information extraction of complex surface in southern China. This method can be extended to complex surface area in the future.展开更多
GIS, a powerful tool for processing spatial data, is advantageous in its spatial overlaying. In this paper, GIS is applied to the extraction of geological information. Information associated with mineral resources is...GIS, a powerful tool for processing spatial data, is advantageous in its spatial overlaying. In this paper, GIS is applied to the extraction of geological information. Information associated with mineral resources is chosen to delineate the geo anomalies, the basis of ore forming anomalies and of mineral deposit location. This application is illustrated with an example in Weixi area, Yunnan Province.展开更多
OOV term translation plays an important role in natural language processing. Although many researchers in the past have endeavored to solve the OOV term translation problems, but none existing methods offer definition...OOV term translation plays an important role in natural language processing. Although many researchers in the past have endeavored to solve the OOV term translation problems, but none existing methods offer definition or context information of OOV terms. Furthermore, non-existing methods focus on cross-language definition retrieval for OOV terms. Never the less, it has always been so difficult to evaluate the correctness of an OOV term translation without domain specific knowledge and correct references. Our English definition ranking method differentiate the types of OOV terms, and applies different methods for translation extraction. Our English definition ranking method also extracts multilingual context information and monolingual definitions of OOV terms. In addition, we propose a novel cross-language definition retrieval system for OOV terms. Never the less, we propose an auto re-evaluation method to evaluate the correctness of OOV translations and definitions. Our methods achieve high performances against existing methods.展开更多
Traditional pattern representation in information extraction lack in the ability of representing domain-specific concepts and are therefore devoid of flexibility. To overcome these restrictions, an enhanced pattern re...Traditional pattern representation in information extraction lack in the ability of representing domain-specific concepts and are therefore devoid of flexibility. To overcome these restrictions, an enhanced pattern representation is designed which includes ontological concepts, neighboring-tree structures and soft constraints. An information-(extraction) inference engine based on hypothesis-generation and conflict-resolution is implemented. The proposed technique is successfully applied to an information extraction system for Chinese-language query front-end of a job-recruitment search engine.展开更多
In order to explore how to extract more transport information from current fluctuation, a theoretical extraction scheme is presented in a single barrier structure based on exclusion models, which include counter-flows...In order to explore how to extract more transport information from current fluctuation, a theoretical extraction scheme is presented in a single barrier structure based on exclusion models, which include counter-flows model and tunnel model. The first four cumulants of these two exclusion models are computed in a single barrier structure, and their characteristics are obtained. A scheme with the help of the first three cumulants is devised to check a transport process to follow the counter-flows model, the tunnel model or neither of them. Time series generated by Monte Carlo techniques is adopted to validate the abstraction procedure, and the result is reasonable.展开更多
Satellite remote sensing data are usually used to analyze the spatial distribution pattern of geological structures and generally serve as a significant means for the identification of alteration zones. Based on the L...Satellite remote sensing data are usually used to analyze the spatial distribution pattern of geological structures and generally serve as a significant means for the identification of alteration zones. Based on the Landsat Enhanced Thematic Mapper (ETM+) data, which have better spectral resolution (8 bands) and spatial resolution (15 m in PAN band), the synthesis processing techniques were presented to fulfill alteration information extraction: data preparation, vegetation indices and band ratios, and expert classifier-based classification. These techniques have been implemented in the MapGIS-RSP software (version 1.0), developed by the Wuhan Zondy Cyber Technology Co., Ltd, China. In the study area application of extracting alteration information in the Zhaoyuan (招远) gold mines, Shandong (山东) Province, China, several hydorthermally altered zones (included two new sites) were found after satellite imagery interpretation coupled with field surveys. It is concluded that these synthesis processing techniques are useful approaches and are applicable to a wide range of gold-mineralized alteration information extraction.展开更多
A two-step information extraction method is presented to capture the specific index-related information more accurately.In the first step,the overall process variables are separated into two sets based on Pearson corr...A two-step information extraction method is presented to capture the specific index-related information more accurately.In the first step,the overall process variables are separated into two sets based on Pearson correlation coefficient.One is process variables strongly related to the specific index and the other is process variables weakly related to the specific index.Through performing principal component analysis(PCA)on the two sets,the directions of latent variables have changed.In other words,the correlation between latent variables in the set with strong correlation and the specific index may become weaker.Meanwhile,the correlation between latent variables in the set with weak correlation and the specific index may be enhanced.In the second step,the two sets are further divided into a subset strongly related to the specific index and a subset weakly related to the specific index from the perspective of latent variables using Pearson correlation coefficient,respectively.Two subsets strongly related to the specific index form a new subspace related to the specific index.Then,a hybrid monitoring strategy based on predicted specific index using partial least squares(PLS)and T2statistics-based method is proposed for specific index-related process monitoring using comprehensive information.Predicted specific index reflects real-time information for the specific index.T2statistics are used to monitor specific index-related information.Finally,the proposed method is applied to Tennessee Eastman(TE).The results indicate the effectiveness of the proposed method.展开更多
A cryptosystem based on computation of square roots of complex integers modulo composite n is described in this paper. This paper provides an algorithm extracting a square root of Gaussian integer. Various properties ...A cryptosystem based on computation of square roots of complex integers modulo composite n is described in this paper. This paper provides an algorithm extracting a square root of Gaussian integer. Various properties of square roots and a method for finding Gaussian generators are demonstrated. The generators can be instrumental in constructing other cryptosystems. It is shown how to significantly reduce average complexity of decryption per each block of ciphertext.展开更多
Information extraction techniques on the Web are the current research hotspot. Now many information extraction techniques based on different principles have appeared and have different capabilities. We classify the ex...Information extraction techniques on the Web are the current research hotspot. Now many information extraction techniques based on different principles have appeared and have different capabilities. We classify the existing information extraction techniques by the principle of information extraction and analyze the methods and principles of semantic information adding, schema defining, rule expression, semantic items locating and object locating in the approaches. Based on the above survey and analysis, several open problems are discussed.展开更多
文摘Processing police incident data in public security involves complex natural language processing(NLP)tasks,including information extraction.This data contains extensive entity information—such as people,locations,and events—while also involving reasoning tasks like personnel classification,relationship judgment,and implicit inference.Moreover,utilizing models for extracting information from police incident data poses a significant challenge—data scarcity,which limits the effectiveness of traditional rule-based and machine-learning methods.To address these,we propose TIPS.In collaboration with public security experts,we used de-identified police incident data to create templates that enable large language models(LLMs)to populate data slots and generate simulated data,enhancing data density and diversity.We then designed schemas to efficiently manage complex extraction and reasoning tasks,constructing a high-quality dataset and fine-tuning multiple open-source LLMs.Experiments showed that the fine-tuned ChatGLM-4-9B model achieved an F1 score of 87.14%,nearly 30%higher than the base model,significantly reducing error rates.Manual corrections further improved performance by 9.39%.This study demonstrates that combining largescale pre-trained models with limited high-quality domain-specific data can greatly enhance information extraction in low-resource environments,offering a new approach for intelligent public security applications.
文摘Background:Acquiring relevant information about procurement targets is fundamental to procuring medical devices.Although traditional Natural Language Processing(NLP)and Machine Learning(ML)methods have improved information retrieval efficiency to a certain extent,they exhibit significant limitations in adaptability and accuracy when dealing with procurement documents characterized by diverse formats and a high degree of unstructured content.The emergence of Large Language Models(LLMs)offers new possibilities for efficient procurement information processing and extraction.Methods:This study collected procurement transaction documents from public procurement websites,and proposed a procurement Information Extraction(IE)method based on LLMs.Unlike traditional approaches,this study systematically explores the applicability of LLMs in both structured and unstructured entities in procurement documents,addressing the challenges posed by format variability and content complexity.Furthermore,an optimized prompt framework tailored for procurement document extraction tasks is developed to enhance the accuracy and robustness of IE.The aim is to process and extract key information from medical device procurement quickly and accurately,meeting stakeholders'demands for precision and timeliness in information retrieval.Results:Experimental results demonstrate that,compared to traditional methods,the proposed approach achieves an F1 Score of 0.9698,representing a 4.85%improvement over the best baseline model.Moreover,both recall and precision rates are close to 97%,significantly outperforming other models and exhibiting exceptional overall recognition capabilities.Notably,further analysis reveals that the proposed method consistently maintains high performance across both structured and unstructured entities in procurement documents while balancing recall and precision effectively,demonstrating its adaptability in handling varying document formats.The results of ablation experiments validate the effectiveness of the proposed prompting strategy.Conclusion:Additionally,this study explores the challenges and potential improvements of the proposed method in IE tasks and provides insights into its feasibility for real-world deployment and application directions,further clarifying its adaptability and value.This method not only exhibits significant advantages in medical device procurement but also holds promise for providing new approaches to information processing and decision support in various domains.
文摘With the rapid expansion of social media,analyzing emotions and their causes in texts has gained significant importance.Emotion-cause pair extraction enables the identification of causal relationships between emotions and their triggers within a text,facilitating a deeper understanding of expressed sentiments and their underlying reasons.This comprehension is crucial for making informed strategic decisions in various business and societal contexts.However,recent research approaches employing multi-task learning frameworks for modeling often face challenges such as the inability to simultaneouslymodel extracted features and their interactions,or inconsistencies in label prediction between emotion-cause pair extraction and independent assistant tasks like emotion and cause extraction.To address these issues,this study proposes an emotion-cause pair extraction methodology that incorporates joint feature encoding and task alignment mechanisms.The model consists of two primary components:First,joint feature encoding simultaneously generates features for emotion-cause pairs and clauses,enhancing feature interactions between emotion clauses,cause clauses,and emotion-cause pairs.Second,the task alignment technique is applied to reduce the labeling distance between emotion-cause pair extraction and the two assistant tasks,capturing deep semantic information interactions among tasks.The proposed method is evaluated on a Chinese benchmark corpus using 10-fold cross-validation,assessing key performance metrics such as precision,recall,and F1 score.Experimental results demonstrate that the model achieves an F1 score of 76.05%,surpassing the state-of-the-art by 1.03%.The proposed model exhibits significant improvements in emotion-cause pair extraction(ECPE)and cause extraction(CE)compared to existing methods,validating its effectiveness.This research introduces a novel approach based on joint feature encoding and task alignment mechanisms,contributing to advancements in emotion-cause pair extraction.However,the study’s limitation lies in the data sources,potentially restricting the generalizability of the findings.
文摘Web data extraction has become a key technology for extracting valuable data from websites.At present,most extraction methods based on rule learning,visual pattern or tree matching have limited performance on complex web pages.Through ana-lyzing various statistical characteristics of HTML el-ements in web documents,this paper proposes,based on statistical features,an unsupervised web data ex-traction method—traversing the HTML DOM parse tree at first,calculating and generating the statistical matrix of the elements,and then locating data records by clustering method and heuristic rules that reveal in-herent links between the visual characteristics of the data recording areas and the statistical characteristics of the HTML nodes—which is both suitable for data records extraction of single-page and multi-pages,and it has strong generality and needs no training.The ex-periments show that the accuracy and efficiency of this method are equally better than the current data extrac-tion method.
基金supported by the Fundamental Research Funds for the Central Universities of China(Grant No.2013SCU11006)the Key Laboratory of Digital Mapping and Land Information Application of National Administration of Surveying,Mapping and Geoinformation of China(Grant NO.DM2014SC02)the Key Laboratory of Geospecial Information Technology,Ministry of Land and Resources of China(Grant NO.KLGSIT201504)
文摘The development of precision agriculture demands high accuracy and efficiency of cultivated land information extraction. As a new means of monitoring the ground in recent years, unmanned aerial vehicle (UAV) low-height remote sensing technique, which is flexible, efficient with low cost and with high resolution, is widely applied to investing various resources. Based on this, a novel extraction method for cultivated land information based on Deep Convolutional Neural Network and Transfer Learning (DTCLE) was proposed. First, linear features (roads and ridges etc.) were excluded based on Deep Convolutional Neural Network (DCNN). Next, feature extraction method learned from DCNN was used to cultivated land information extraction by introducing transfer learning mechanism. Last, cultivated land information extraction results were completed by the DTCLE and eCognifion for cultivated land information extraction (ECLE). The location of the Pengzhou County and Guanghan County, Sichuan Province were selected for the experimental purpose. The experimental results showed that the overall precision for the experimental image 1, 2 and 3 (of extracting cultivated land) with the DTCLE method was 91.7%, 88.1% and 88.2% respectively, and the overall precision of ECLE is 9o.7%, 90.5% and 87.0%, respectively. Accuracy of DTCLE was equivalent to that of ECLE, and also outperformed ECLE in terms of integrity and continuity.
文摘Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three general features and the similarities between fragments are then defined on the bases of these features. Through competitions of fragments for different slots in information templates, the method classifies fragments into slot classes and filters out noise information. Far less annotated samples are needed as compared with rule-based methods and therefore it has a strong portability. Experiments show that the method has good performance and is superior to DOM-based method in information extraction. Key words information extraction - competing classification - feature extraction - wrapper induction CLC number TP 311 Foundation item: Supported by the National Natural Science Foundation of China (60303024)Biography: LI Xiang-yang (1974-), male, Ph. D. Candidate, research direction: information extraction, natural language processing.
基金This work is developed with the support of the H2020 RISIS 2 Project(No.824091)and of the“Sapienza”Research Awards No.RM1161550376E40E of 2016 and RM11916B8853C925 of 2019.This article is a largely extended version of Bianchi et al.(2019)presented at the ISSI 2019 Conference held in Rome,2–5 September 2019.
文摘Purpose:The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’websites.The information automatically extracted can be potentially updated with a frequency higher than once per year,and be safe from manipulations or misinterpretations.Moreover,this approach allows us flexibility in collecting indicators about the efficiency of universities’websites and their effectiveness in disseminating key contents.These new indicators can complement traditional indicators of scientific research(e.g.number of articles and number of citations)and teaching(e.g.number of students and graduates)by introducing further dimensions to allow new insights for“profiling”the analyzed universities.Design/methodology/approach:Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web.This study implements an advanced application of the webometric approach,exploiting all the three categories of web mining:web content mining;web structure mining;web usage mining.The information to compute our indicators has been extracted from the universities’websites by using web scraping and text mining techniques.The scraped information has been stored in a NoSQL DB according to a semistructured form to allow for retrieving information efficiently by text mining techniques.This provides increased flexibility in the design of new indicators,opening the door to new types of analyses.Some data have also been collected by means of batch interrogations of search engines(Bing,www.bing.com)or from a leading provider of Web analytics(SimilarWeb,http://www.similarweb.com).The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register(https://eter.joanneum.at/#/home),a database collecting information on Higher Education Institutions(HEIs)at European level.All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.Findings:The main findings of this study concern the evaluation of the potential in digitalization of universities,in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’websites.These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitations:The results reported in this study refers to Italian universities only,but the approach could be extended to other university systems abroad.Practical implications:The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites.The approach could be applied to other university systems.Originality/value:This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping,optical character recognition and nontrivial text mining operations(Bruni&Bianchi,2020).
基金This work was supported by the Deanship of Scientific Research at King Khalid University through a General Research Project under Grant Number GRP/41/42.
文摘Information extraction plays a vital role in natural language processing,to extract named entities and events from unstructured data.Due to the exponential data growth in the agricultural sector,extracting significant information has become a challenging task.Though existing deep learningbased techniques have been applied in smart agriculture for crop cultivation,crop disease detection,weed removal,and yield production,still it is difficult to find the semantics between extracted information due to unswerving effects of weather,soil,pest,and fertilizer data.This paper consists of two parts.An initial phase,which proposes a data preprocessing technique for removal of ambiguity in input corpora,and the second phase proposes a novel deep learning-based long short-term memory with rectification in Adam optimizer andmultilayer perceptron to find agricultural-based named entity recognition,events,and relations between them.The proposed algorithm has been trained and tested on four input corpora i.e.,agriculture,weather,soil,and pest&fertilizers.The experimental results have been compared with existing techniques and itwas observed that the proposed algorithm outperformsWeighted-SOM,LSTM+RAO,PLR-DBN,KNN,and Na飗e Bayes on standard parameters like accuracy,sensitivity,and specificity.
文摘This paper presents an innovative Soft Design Science Methodology for improving information systems security using multi-layered security approach. The study applied Soft Design Science Methodology to address the problematic situation on how information systems security can be improved. In addition, Soft Design Science Methodology was compounded with mixed research methodology. This holistic approach helped for research methodology triangulation. The study assessed security requirements and developed a framework for improving information systems security. The study carried out maturity level assessment to determine security status quo in the education sector in Tanzania. The study identified security requirements gap (IT security controls, IT security measures) using ISO/IEC 21827: Systems Security Engineering-Capability Maturity Model (SSE-CMM) with a rating scale of 0 - 5. The results of this study show that maturity level across security domain is 0.44 out of 5. The finding shows that the implementation of IT security controls and security measures for ensuring security goals are lacking or conducted in ad-hoc. Thus, for improving the security of information systems, organisations should implement security controls and security measures in each security domain (multi-layer security). This research provides a framework for enhancing information systems security during capturing, processing, storage and transmission of information. This research has several practical contributions. Firstly, it contributes to the body of knowledge of information systems security by providing a set of security requirements for ensuring information systems security. Secondly, it contributes empirical evidence on how information systems security can be improved. Thirdly, it contributes on the applicability of Soft Design Science Methodology on addressing the problematic situation in information systems security. The research findings can be used by decision makers and lawmakers to improve existing cyber security laws, and enact laws for data privacy and sharing of open data.
文摘Synthetic aperture radar (SAR) provides a large amount of image data for the observation and research of oceanic eddies. The use of SAR images to automatically depict the shape of eddies and extract the eddy information is of great significance to the study of the oceanic eddies and the application of SAR eddy images. In this paper, a method of automatic shape depiction and information extraction for oceanic eddies in SAR images is proposed, which is for the research of spiral eddies. Firstly, the skeleton image is got by the skeletonization of SAR image. Secondly, the logarithmic spirals detected in the skeleton image are drawn on the SAR image to depict the shape of oceanic eddies. Finally, the eddy information is extracted based on the results of shape depiction. The sentinel 1 SAR eddy images in the Black Sea area were used for the experiment in this paper. The experimental results show that the proposed method can automatically depict the shape of eddies and extract the eddy information. The shape depiction results are consistent with the actual shape of the eddies, and the extracted eddy information is consistent with the reference information extracted by manual operation. As a result, the validity of the method is verified.
基金the Artificial Intelligence Innovation and Development Project of Shanghai Municipal Commission of Economy and Information (No. 2019-RGZN-01081)。
文摘Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.
文摘Because of the developed economy and lush vegetation in southern China, the following obstacles or difficulties exist in remote sensing land surface classification: 1) Diverse surface composition types;2) Undulating terrains;3) Small fragmented land;4) Indistinguishable shadows of surface objects. It is our top priority to clarify how to use the concept of big data (Data mining technology) and various new technologies and methods to make complex surface remote sensing information extraction technology develop in the direction of automation, refinement and intelligence. In order to achieve the above research objectives, the paper takes the Gaofen-2 satellite data produced in China as the data source, and takes the complex surface remote sensing information extraction technology as the research object, and intelligently analyzes the remote sensing information of complex surface on the basis of completing the data collection and preprocessing. The specific extraction methods are as follows: 1) extraction research on fractal texture features of Brownian motion;2) extraction research on color features;3) extraction research on vegetation index;4) research on vectors and corresponding classification. In this paper, fractal texture features, color features, vegetation features and spectral features of remote sensing images are combined to form a combination feature vector, which improves the dimension of features, and the feature vector improves the difference of remote sensing features, and it is more conducive to the classification of remote sensing features, and thus it improves the classification accuracy of remote sensing images. It is suitable for remote sensing information extraction of complex surface in southern China. This method can be extended to complex surface area in the future.
文摘GIS, a powerful tool for processing spatial data, is advantageous in its spatial overlaying. In this paper, GIS is applied to the extraction of geological information. Information associated with mineral resources is chosen to delineate the geo anomalies, the basis of ore forming anomalies and of mineral deposit location. This application is illustrated with an example in Weixi area, Yunnan Province.
文摘OOV term translation plays an important role in natural language processing. Although many researchers in the past have endeavored to solve the OOV term translation problems, but none existing methods offer definition or context information of OOV terms. Furthermore, non-existing methods focus on cross-language definition retrieval for OOV terms. Never the less, it has always been so difficult to evaluate the correctness of an OOV term translation without domain specific knowledge and correct references. Our English definition ranking method differentiate the types of OOV terms, and applies different methods for translation extraction. Our English definition ranking method also extracts multilingual context information and monolingual definitions of OOV terms. In addition, we propose a novel cross-language definition retrieval system for OOV terms. Never the less, we propose an auto re-evaluation method to evaluate the correctness of OOV translations and definitions. Our methods achieve high performances against existing methods.
文摘Traditional pattern representation in information extraction lack in the ability of representing domain-specific concepts and are therefore devoid of flexibility. To overcome these restrictions, an enhanced pattern representation is designed which includes ontological concepts, neighboring-tree structures and soft constraints. An information-(extraction) inference engine based on hypothesis-generation and conflict-resolution is implemented. The proposed technique is successfully applied to an information extraction system for Chinese-language query front-end of a job-recruitment search engine.
基金Project supported by the National Natural Science Foundation of China (Grant No. 60676053)Applied Material in Xi’an Innovation Funds,China (Grant No. XA-AM-200603)
文摘In order to explore how to extract more transport information from current fluctuation, a theoretical extraction scheme is presented in a single barrier structure based on exclusion models, which include counter-flows model and tunnel model. The first four cumulants of these two exclusion models are computed in a single barrier structure, and their characteristics are obtained. A scheme with the help of the first three cumulants is devised to check a transport process to follow the counter-flows model, the tunnel model or neither of them. Time series generated by Monte Carlo techniques is adopted to validate the abstraction procedure, and the result is reasonable.
基金The paper is supported by the Research Foundation for Out-standing Young Teachers, China University of Geosciences (Wuhan) (Nos. CUGQNL0628, CUGQNL0640)the National High-Tech Research and Development Program (863 Program) (No. 2001AA135170)the Postdoctoral Foundation of the Shandong Zhaojin Group Co. (No. 20050262120)
文摘Satellite remote sensing data are usually used to analyze the spatial distribution pattern of geological structures and generally serve as a significant means for the identification of alteration zones. Based on the Landsat Enhanced Thematic Mapper (ETM+) data, which have better spectral resolution (8 bands) and spatial resolution (15 m in PAN band), the synthesis processing techniques were presented to fulfill alteration information extraction: data preparation, vegetation indices and band ratios, and expert classifier-based classification. These techniques have been implemented in the MapGIS-RSP software (version 1.0), developed by the Wuhan Zondy Cyber Technology Co., Ltd, China. In the study area application of extracting alteration information in the Zhaoyuan (招远) gold mines, Shandong (山东) Province, China, several hydorthermally altered zones (included two new sites) were found after satellite imagery interpretation coupled with field surveys. It is concluded that these synthesis processing techniques are useful approaches and are applicable to a wide range of gold-mineralized alteration information extraction.
基金Projects(61374140,61673173)supported by the National Natural Science Foundation of ChinaProjects(222201717006,222201714031)supported by the Fundamental Research Funds for the Central Universities,China
文摘A two-step information extraction method is presented to capture the specific index-related information more accurately.In the first step,the overall process variables are separated into two sets based on Pearson correlation coefficient.One is process variables strongly related to the specific index and the other is process variables weakly related to the specific index.Through performing principal component analysis(PCA)on the two sets,the directions of latent variables have changed.In other words,the correlation between latent variables in the set with strong correlation and the specific index may become weaker.Meanwhile,the correlation between latent variables in the set with weak correlation and the specific index may be enhanced.In the second step,the two sets are further divided into a subset strongly related to the specific index and a subset weakly related to the specific index from the perspective of latent variables using Pearson correlation coefficient,respectively.Two subsets strongly related to the specific index form a new subspace related to the specific index.Then,a hybrid monitoring strategy based on predicted specific index using partial least squares(PLS)and T2statistics-based method is proposed for specific index-related process monitoring using comprehensive information.Predicted specific index reflects real-time information for the specific index.T2statistics are used to monitor specific index-related information.Finally,the proposed method is applied to Tennessee Eastman(TE).The results indicate the effectiveness of the proposed method.
文摘A cryptosystem based on computation of square roots of complex integers modulo composite n is described in this paper. This paper provides an algorithm extracting a square root of Gaussian integer. Various properties of square roots and a method for finding Gaussian generators are demonstrated. The generators can be instrumental in constructing other cryptosystems. It is shown how to significantly reduce average complexity of decryption per each block of ciphertext.
文摘Information extraction techniques on the Web are the current research hotspot. Now many information extraction techniques based on different principles have appeared and have different capabilities. We classify the existing information extraction techniques by the principle of information extraction and analyze the methods and principles of semantic information adding, schema defining, rule expression, semantic items locating and object locating in the approaches. Based on the above survey and analysis, several open problems are discussed.