Standardized datasets are foundational to healthcare informatization by enhancing data quality and unleashing the value of data elements.Using bibliometrics and content analysis,this study examines China's healthc...Standardized datasets are foundational to healthcare informatization by enhancing data quality and unleashing the value of data elements.Using bibliometrics and content analysis,this study examines China's healthcare dataset standards from 2011 to 2025.It analyzes their evolution across types,applications,institutions,and themes,highlighting key achievements including substantial growth in quantity,optimized typology,expansion into innovative application scenarios such as health decision support,and broadened institutional involvement.The study also identifies critical challenges,including imbalanced development,insufficient quality control,and a lack of essential metadata—such as authoritative data element mappings and privacy annotations—which hampers the delivery of intelligent services.To address these challenges,the study proposes a multi-faceted strategy focused on optimizing the standard system's architecture,enhancing quality and implementation,and advancing both data governance—through authoritative tracing and privacy protection—and intelligent service provision.These strategies aim to promote the application of dataset standards,thereby fostering and securing the development of new productive forces in healthcare.展开更多
Deep neural networks provide accurate results for most applications.However,they need a big dataset to train properly.Providing a big dataset is a significant challenge in most applications.Image augmentation refers t...Deep neural networks provide accurate results for most applications.However,they need a big dataset to train properly.Providing a big dataset is a significant challenge in most applications.Image augmentation refers to techniques that increase the amount of image data.Common operations for image augmentation include changes in illumination,rotation,contrast,size,viewing angle,and others.Recently,Generative Adversarial Networks(GANs)have been employed for image generation.However,like image augmentation methods,GAN approaches can only generate images that are similar to the original images.Therefore,they also cannot generate new classes of data.Texture images presentmore challenges than general images,and generating textures is more complex than creating other types of images.This study proposes a gradient-based deep neural network method that generates a new class of texture.It is possible to rapidly generate new classes of textures using different kernels from pre-trained deep networks.After generating new textures for each class,the number of textures increases through image augmentation.During this process,several techniques are proposed to automatically remove incomplete and similar textures that are created.The proposed method is faster than some well-known generative networks by around 4 to 10 times.In addition,the quality of the generated textures surpasses that of these networks.The proposed method can generate textures that surpass those of someGANs and parametric models in certain image qualitymetrics.It can provide a big texture dataset to train deep networks.A new big texture dataset is created artificially using the proposed method.This dataset is approximately 2 GB in size and comprises 30,000 textures,each 150×150 pixels in size,organized into 600 classes.It is uploaded to the Kaggle site and Google Drive.This dataset is called BigTex.Compared to other texture datasets,the proposed dataset is the largest and can serve as a comprehensive texture dataset for training more powerful deep neural networks and mitigating overfitting.展开更多
Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This s...Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.展开更多
Industrial data mining usually deals with data from different sources.These heterogeneous datasets describe the same object in different views.However,samples from some of the datasets may be lost.Then the remaining s...Industrial data mining usually deals with data from different sources.These heterogeneous datasets describe the same object in different views.However,samples from some of the datasets may be lost.Then the remaining samples do not correspond one-to-one correctly.Mismatched datasets caused by missing samples make the industrial data unavailable for further machine learning.In order to align the mismatched samples,this article presents a cooperative iteration matching method(CIMM)based on the modified dynamic time warping(DTW).The proposed method regards the sequentially accumulated industrial data as the time series.Mismatched samples are aligned by the DTW.In addition,dynamic constraints are applied to the warping distance of the DTW process to make the alignment more efficient.Then a series of models are trained with the cumulated samples iteratively.Several groups of numerical experiments on different missing patterns and missing locations are designed and analyzed to prove the effectiveness and the applicability of the proposed method.展开更多
Outcrop analogue studies play an important role in advancing our comprehension of reservoir architectures,offering insights into hidden reservoir rocks prior to drilling,in a cost-effective manner.These studies contri...Outcrop analogue studies play an important role in advancing our comprehension of reservoir architectures,offering insights into hidden reservoir rocks prior to drilling,in a cost-effective manner.These studies contribute to the delineation of the three-dimensional geometry of geological structures,the characterization of petro-and thermo-physical properties,and the structural geological aspects of reservoir rocks.Nevertheless,several challenges,including inaccessible sampling sites,limited resources,and the dimensional constraints of different laboratories hinder the acquisition of comprehensive datasets.In this study,we employ machine learning techniques to estimate missing data in a petrophysical dataset of fractured Variscan granites from the Cornubian Batholith in Southwest UK.The utilization of mean,k-nearest neighbors,and random forest imputation methods addresses the challenge of missing data,thereby revealing the effectiveness of random forest imputation in providing realistic estimations.Subsequently,supervised classification models are trained to classify samples according to their pluton origins,with promising accuracy achieved by models trained with imputed values.Variable importance ranking of the models showed that the choice of imputation method influences the inferred importance of specific petrophysical properties.While porosity(POR)and grain density(GD)were among important variables,variables with high missingness ratio were not among the top variables.This study demonstrates the value of machine learning in enhancing petrophysical datasets,while emphasizing the importance of careful method selection and model validation for reliable results.The findings contribute to a more informed decision-making process in geothermal exploration and reservoir tion characteriza-efforts,thereby demonstrating the potential of machine learning in advancing subsurface characterization techniques.展开更多
In the BESⅢdetector at Beijing electron-positron collider,billions of events from e^(+)e^(-)collisions were recorded.These events passing through the trigger system were saved in raw data format files.They play an im...In the BESⅢdetector at Beijing electron-positron collider,billions of events from e^(+)e^(-)collisions were recorded.These events passing through the trigger system were saved in raw data format files.They play an important role in the study of physics inτ-charm energy region.Here,we published an e^(+)e^(-)collision dataset containing both Monte Carlo simulation samples and real data collected by the BESⅢdetector.The data pass through the detector trigger system,file format conversion,and physics information extraction and finally save the physics information and detector response in text format files.This dataset is publicly available and is intended to provide interested scientists and those outside of the BESⅢcollaboration with event information from BESⅢ,which can be used to understand physics research in e^(+)e^(-)collisions,developing visualization projects for physics education,public outreach,and science advocacy.展开更多
Founded in 2009,the China Dark Matter Experiment(CDEX)collaboration was dedicated to the detection of dark matter(DM)and neutrinoless double beta decay using high-purity germanium(HPGe)detectors in the China Jinping U...Founded in 2009,the China Dark Matter Experiment(CDEX)collaboration was dedicated to the detection of dark matter(DM)and neutrinoless double beta decay using high-purity germanium(HPGe)detectors in the China Jinping Underground Laboratory.HPGe detectors are characterized by a high energy resolution,low analysis threshold,and low radioactive background,making them an ideal platform for the direct detection of DM.Over the years,CDEX has accumulated a massive amount of experimental data,based on which various results on DM detection and neutrinoless double beta decay have been presented.Because the dataset was collected in a low-background environment,apart from the analysis of DM-related physical channels,it has great potential as an indicator in other rare physical events searches.Furthermore,by providing raw pulse shapes,the dataset can serve as a tool for effectively understanding the internal mechanisms of HPGe detectors.展开更多
Purpose:This paper examines African Journals Online(AJOL)as a bibliometric resource,providing a structured dataset of journal and publication metadata.In addition,it integrates AJOL data with OpenAlex to enhance metad...Purpose:This paper examines African Journals Online(AJOL)as a bibliometric resource,providing a structured dataset of journal and publication metadata.In addition,it integrates AJOL data with OpenAlex to enhance metadata coverage and improve interoperability with other bibliometric sources.Design/methodology/approach:The journal list and publications indexed in AJOL were retrieved using web scraping techniques.This paper details the database construction process,highlighting its strengths and limitations,and presents a descriptive analysis of AJOL’s indexed journals and publications.Findings:The publication analysis demonstrates a steady growth in the number of publications over time but reveals significant disparities in their distribution across African countries.This paper presents an example of the possibility of integrating both sources using author country data from OpenAlex.The analysis of author contributions reveals that African journals serve as both regional and international venues,confirming that African journals play a dual role in fostering both regional and global research engagement.Research limitations:While AJOL contains relevant information for identifying and providing insights about African publications and journals,its metadata are limited.Therefore,the kind of analysis that can be performed with the database presented here is also limited.The integration with OpenAlex aims to overcome some of the limitations.Finally,although some automatic citation procedures have been performed,the metadata has not been manually curated.Therefore,if errors or inaccuracies are present in the AJOL,they may be reproduced in this database.Practical implications:The database introduced in this article contributes to the accessibility of African scholarly publications by providing structured,accessible metadata derived from the AJOL.It facilitates bibliometric analyses that are more representative of African research activities.This contribution complements ongoing efforts to develop alternative data sources and infrastructure that better reflect the diversity of global knowledge production.Originality/value:This paper presents a novel database for bibliometric analysis and offers a detailed report of the retrieval and construction procedures.The inclusion of matched data with OpenAlex further enhances the database’s utility.By showcasing AJOL’s potential,this study contributes to the broader goal of fostering inclusivity and improving the representation of African research in global bibliometric analyses.展开更多
Financial distress prediction(FDP)is a critical area of study for researchers,industry stakeholders,and regulatory authorities.However,FDP tasks present several challenges,including high-dimensional datasets,class imb...Financial distress prediction(FDP)is a critical area of study for researchers,industry stakeholders,and regulatory authorities.However,FDP tasks present several challenges,including high-dimensional datasets,class imbalances,and the complexity of parameter optimization.These issues often hinder the predictive model’s ability to accurately identify companies at high risk of financial distress.To mitigate these challenges,we introduce FinMHSPE—a novel multi-heterogeneous self-paced ensemble(MHSPE)FDP learning framework.The proposed model uses pairwise comparisons of data from multiple time frames combined with the maximum relevance and minimum redundancy method to select an optimal subset of features,effectively resolving the high dimensionality issue.Furthermore,the proposed framework incorporates the MHSPE model to iteratively identify the most informative majority class data samples,effectively addressing the class imbalance issue.To optimize the model’s parameters,we leverage the particle swarm optimization algorithm.The robustness of our proposed model is validated through extensive experiments performed on a financial dataset of Chinese listed companies.The empirical results demonstrate that the proposed model outperforms existing competing models in the field of FDP.Specifically,our FinMHSPE framework achieves the highest performance,achieving an area under the curve(AUC)value of 0.9574,considerably surpassing all existing methods.A comparative analysis of AUC values further reveals that FinMHSPE outperforms state-of-the-art approaches that rely on financial features as inputs.Furthermore,our investigation identifies several valuable features for enhancing FDP model performance,notably those associated with a company’s information and growth potential.展开更多
Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcia...Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcial intelligence models predicting short-term and long-term health outcomes forpatients with the Fontan circulation are needed. Generative adversarial networks (GANs) provide a solutionfor generating realistic and useful synthetic data that can be used to train such models. Methods: Despitetheir promise, GANs have not been widely adopted in the congenital heart disease research communitydue, in some part, to a lack of knowledge on how to employ them. In this research study, a GAN was usedto generate synthetic data from the Pediatric Heart Network Fontan I dataset. A subset of data consistingof the echocardiographic and BNP measures collected from Fontan patients was used to train the GAN.Two sets of synthetic data were created to understand the effect of data missingness on synthetic datageneration. Synthetic data was created from real data in which the missing values were imputed usingMultiple Imputation by Chained Equations (MICE) (referred to as synthetic from imputed real samples). Inaddition, synthetic data was created from real data in which the missing values were dropped (referred to assynthetic from dropped real samples). Both synthetic datasets were evaluated for ffdelity by using visualmethods which involved comparing histograms and principal component analysis (PCA) plots. Fidelitywas measured quantitatively by (1) comparing synthetic and real data using the Kolmogorov-Smirnovtest to evaluate the similarity between two distributions and (2) training a neural network to distinguishbetween real and synthetic samples. Both synthetic datasets were evaluated for utility by training aneural network with synthetic data and testing the neural network on its ability to classify patients thathave ventricular dysfunction using echocardiograph measures and serological measures. Results: Usinghistograms, associated probability density functions, and (PCA), both synthetic datasets showed visualresemblance in distribution and variance to real Fontan data. Quantitatively, synthetic data from droppedreal samples had higher similarity scores, as demonstrated by the Kolmogorov–Smirnov statistic, for all butone feature (age at Fontan) compared to synthetic data from imputed real samples, which demonstrateddissimilar scores for three features (Echo SV, Echo tda, and BNP). In addition, synthetic data from droppedreal samples resembled real data to a larger extent (49.3% classiffcation error) than synthetic data fromimputed real samples (65.28% classiffcation error). Classiffcation errors approximating 50% represent datasetsthat are indistinguishable. In terms of utility, synthetic data created from real data in which the missingvalues were imputed classiffed ventricular dysfunction in real data with a classiffcation error of 10.99%.Similarly, utility of the generated synthetic data by showing that a neural network trained on synthetic dataderived from real data in which the missing values were dropped could classify ventricular dysfunction inreal data with a classiffcation error of 9.44%. Conclusions: Although representing a limited subset of thevast data available on the Pediatric Heart Network, generative adversarial networks can create syntheticdata that mimics the probability distribution of real Fontan echocardiographic measures. Clinicians can usethese synthetic data to create models that predict health outcomes for Fontan patients.展开更多
Power transmission lines are a critical component of the entire power system,and ice accretion incidents caused by various types of power systems can result in immeasurable harm.Currently,network models used for ice d...Power transmission lines are a critical component of the entire power system,and ice accretion incidents caused by various types of power systems can result in immeasurable harm.Currently,network models used for ice detection on power transmission lines require a substantial amount of sample data to support their training,and their drawback is that detection accuracy is significantly affected by the inaccurate annotation among training dataset.Therefore,we propose a transformer-based detection model,structured into two stages to collectively address the impact of inaccurate datasets on model training.In the first stage,a spatial similarity enhancement(SSE)module is designed to leverage spatial information to enhance the construction of the detection framework,thereby improving the accuracy of the detector.In the second stage,a target similarity enhancement(TSE)module is introduced to enhance object-related features,reducing the impact of inaccurate data on model training,thereby expanding global correlation.Additionally,by incorporating a multi-head adaptive attention window(MAAW),spatial information is combined with category information to achieve information interaction.Simultaneously,a quasi-wavelet structure,compatible with deep learning,is employed to highlight subtle features at different scales.Experimental results indicate that the proposed model in this paper outperforms existing mainstream detection models,demonstrating superior performance and stability.展开更多
To manipulate the heterogeneous and distributed data better in the data grid,a dataspace management framework for grid data is proposed based on in-depth research on grid technology.Combining technologies in dataspace...To manipulate the heterogeneous and distributed data better in the data grid,a dataspace management framework for grid data is proposed based on in-depth research on grid technology.Combining technologies in dataspace management,such as data model iDM and query language iTrails,with the grid data access middleware OGSA-DAI,a grid dataspace management prototype system is built,in which tasks like data accessing,Abstraction,indexing,services management and answer-query are implemented by the OGSA-DAI workflows.Experimental results show that it is feasible to apply a dataspace management mechanism to the grid environment.Dataspace meets the grid data management needs in that it hides the heterogeneity and distribution of grid data and can adapt to the dynamic characteristics of the grid.The proposed grid dataspace management provides a new method for grid data management.展开更多
Automatic pavement crack detection is a critical task for maintaining the pavement stability and driving safety.The task is challenging because the shadows on the pavement may have similar intensity with the crack,whi...Automatic pavement crack detection is a critical task for maintaining the pavement stability and driving safety.The task is challenging because the shadows on the pavement may have similar intensity with the crack,which interfere with the crack detection performance.Till to the present,there still lacks efficient algorithm models and training datasets to deal with the interference brought by the shadows.To fill in the gap,we made several contributions as follows.First,we proposed a new pavement shadow and crack dataset,which contains a variety of shadow and pavement pixel size combinations.It also covers all common cracks(linear cracks and network cracks),placing higher demands on crack detection methods.Second,we designed a two-step shadow-removal-oriented crack detection approach:SROCD,which improves the performance of the algorithm by first removing the shadow and then detecting it.In addition to shadows,the method can cope with other noise disturbances.Third,we explored the mechanism of how shadows affect crack detection.Based on this mechanism,we propose a data augmentation method based on the difference in brightness values,which can adapt to brightness changes caused by seasonal and weather changes.Finally,we introduced a residual feature augmentation algorithm to detect small cracks that can predict sudden disasters,and the algorithm improves the performance of the model overall.We compare our method with the state-of-the-art methods on existing pavement crack datasets and the shadow-crack dataset,and the experimental results demonstrate the superiority of our method.展开更多
Version 4(v4) of the Extended Reconstructed Sea Surface Temperature(ERSST) dataset is compared with its precedent, the widely used version 3b(v3b). The essential upgrades applied to v4 lead to remarkable differences i...Version 4(v4) of the Extended Reconstructed Sea Surface Temperature(ERSST) dataset is compared with its precedent, the widely used version 3b(v3b). The essential upgrades applied to v4 lead to remarkable differences in the characteristics of the sea surface temperature(SST) anomaly(SSTa) in both the temporal and spatial domains. First, the largest discrepancy of the global mean SSTa values around the 1940 s is due to ship-observation corrections made to reconcile observations from buckets and engine intake thermometers. Second, differences in global and regional mean SSTa values between v4 and v3b exhibit a downward trend(around-0.032℃ per decade) before the 1940s, an upward trend(around 0.014℃ per decade) during the period of 1950–2015, interdecadal oscillation with one peak around the 1980s, and two troughs during the 1960s and 2000s, respectively. This does not derive from treatments of the polar or the other data-void regions, since the difference of the SSTa does not share the common features. Third, the spatial pattern of the ENSO-related variability of v4 exhibits a wider but weaker cold tongue in the tropical region of the Pacific Ocean compared with that of v3b, which could be attributed to differences in gap-filling assumptions since the latter features satellite observations whereas the former features in situ ones. This intercomparison confirms that the structural uncertainty arising from underlying assumptions on the treatment of diverse SST observations even in the same SST product family is the main source of significant SST differences in the temporal domain. Why this uncertainty introduces artificial decadal oscillations remains unknown.展开更多
To address the shortage of public datasets for customs X-ray images of contraband and the difficulties in deploying trained models in engineering applications,a method has been proposed that employs the Extract-Transf...To address the shortage of public datasets for customs X-ray images of contraband and the difficulties in deploying trained models in engineering applications,a method has been proposed that employs the Extract-Transform-Load(ETL)approach to create an X-ray dataset of contraband items.Initially,X-ray scatter image data is collected and cleaned.Using Kafka message queues and the Elasticsearch(ES)distributed search engine,the data is transmitted in real-time to cloud servers.Subsequently,contraband data is annotated using a combination of neural networks and manual methods to improve annotation efficiency and implemented mean hash algorithm for quick image retrieval.The method of integrating targets with backgrounds has enhanced the X-ray contraband image data,increasing the number of positive samples.Finally,an Airport Customs X-ray dataset(ACXray)compatible with customs business scenarios has been constructed,featuring an increased number of positive contraband samples.Experimental tests using three datasets to train the Mask Region-based Convolutional Neural Network(Mask R-CNN)algorithm and tested on 400 real customs images revealed that the recognition accuracy of algorithms trained with Security Inspection X-ray(SIXray)and Occluded Prohibited Items X-ray(OPIXray)decreased by 16.3%and 15.1%,respectively,while the ACXray dataset trained algorithm’s accuracy was almost unaffected.This indicates that the ACXray dataset-trained algorithm possesses strong generalization capabilities and is more suitable for customs detection scenarios.展开更多
For the first time, this article introduces a LiDAR Point Clouds Dataset of Ships composed of both collected and simulated data to address the scarcity of LiDAR data in maritime applications. The collected data are ac...For the first time, this article introduces a LiDAR Point Clouds Dataset of Ships composed of both collected and simulated data to address the scarcity of LiDAR data in maritime applications. The collected data are acquired using specialized maritime LiDAR sensors in both inland waterways and wide-open ocean environments. The simulated data is generated by placing a ship in the LiDAR coordinate system and scanning it with a redeveloped Blensor that emulates the operation of a LiDAR sensor equipped with various laser beams. Furthermore,we also render point clouds for foggy and rainy weather conditions. To describe a realistic shipping environment, a dynamic tail wave is modeled by iterating the wave elevation of each point in a time series. Finally, networks serving small objects are migrated to ship applications by feeding our dataset. The positive effect of simulated data is described in object detection experiments, and the negative impact of tail waves as noise is verified in single-object tracking experiments. The Dataset is available at https://github.com/zqy411470859/ship_dataset.展开更多
The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized...The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized cloud server is not applicable due to data privacy and communication costs concerns,hindering artificial intelligence from empowering mobile devices.Moreover,these data are not identically and independently distributed(Non-IID)caused by their different context,which will deteriorate the performance of the model.To address these issues,we propose a novel Distributed Learning algorithm based on hierarchical clustering and Adaptive Dataset Condensation,named ADC-DL,which learns a shared model by collecting the synthetic samples generated on each device.To tackle the heterogeneity of data distribution,we propose an entropy topsis comprehensive tiering model for hierarchical clustering,which distinguishes clients in terms of their data characteristics.Subsequently,synthetic dummy samples are generated based on the hierarchical structure utilizing adaptive dataset condensation.The procedure of dataset condensation can be adjusted adaptively according to the tier of the client.Extensive experiments demonstrate that the performance of our ADC-DL is more outstanding in prediction accuracy and communication costs compared with existing algorithms.展开更多
A M_(S)6.4 earthquake occurred on 21 May 2021 in Yangbi county,Dali prefecture,Yunnan,China,at 21:48 Beijing Time(13:48 UTC).Earthquakes with an M3.0 or higher occurred before and after the main shock.Seismic data ana...A M_(S)6.4 earthquake occurred on 21 May 2021 in Yangbi county,Dali prefecture,Yunnan,China,at 21:48 Beijing Time(13:48 UTC).Earthquakes with an M3.0 or higher occurred before and after the main shock.Seismic data analysis is essential for the in-depth investigation of the 2021 Yangbi M_(S)6.4 earthquake sequence and the seismotectonics of northwestern Yunnan.Institute of Geophysics,China Earthquake Administration(CEA),has compiled a dataset of seismological observations from 157 broadband stations located within 500 km of the epicenter,and has made this dataset available to the earthquake science research community.The dataset(total file size:329 GB)consists of event waveforms with a sampling frequency of 100 sps collected from 18 to 28 May 2021,20-Hz and 100-Hz continuous waveforms collected from 12 to 31 May 2021,and seismic instrument response files.To promote data sharing,the dataset also includes the seismic event waveforms from 20 to 22 May 2021 recorded at 50 stations of the ongoing Binchuan Active Source Geophysical Observation Project,for which the data protection period has not expired.Sample waveforms of the main shock are included in the appendix of this article and can be downloaded from the Earthquake Science website.The event and continuous waveforms are available from the Earthquake Science Data Center website(www.esdc.ac.cn)on application.展开更多
One of the biggest dangers to society today is terrorism, where attacks have become one of the most significantrisks to international peace and national security. Big data, information analysis, and artificial intelli...One of the biggest dangers to society today is terrorism, where attacks have become one of the most significantrisks to international peace and national security. Big data, information analysis, and artificial intelligence (AI) havebecome the basis for making strategic decisions in many sensitive areas, such as fraud detection, risk management,medical diagnosis, and counter-terrorism. However, there is still a need to assess how terrorist attacks are related,initiated, and detected. For this purpose, we propose a novel framework for classifying and predicting terroristattacks. The proposed framework posits that neglected text attributes included in the Global Terrorism Database(GTD) can influence the accuracy of the model’s classification of terrorist attacks, where each part of the datacan provide vital information to enrich the ability of classifier learning. Each data point in a multiclass taxonomyhas one or more tags attached to it, referred as “related tags.” We applied machine learning classifiers to classifyterrorist attack incidents obtained from the GTD. A transformer-based technique called DistilBERT extracts andlearns contextual features from text attributes to acquiremore information from text data. The extracted contextualfeatures are combined with the “key features” of the dataset and used to perform the final classification. Thestudy explored different experimental setups with various classifiers to evaluate the model’s performance. Theexperimental results show that the proposed framework outperforms the latest techniques for classifying terroristattacks with an accuracy of 98.7% using a combined feature set and extreme gradient boosting classifier.展开更多
文摘Standardized datasets are foundational to healthcare informatization by enhancing data quality and unleashing the value of data elements.Using bibliometrics and content analysis,this study examines China's healthcare dataset standards from 2011 to 2025.It analyzes their evolution across types,applications,institutions,and themes,highlighting key achievements including substantial growth in quantity,optimized typology,expansion into innovative application scenarios such as health decision support,and broadened institutional involvement.The study also identifies critical challenges,including imbalanced development,insufficient quality control,and a lack of essential metadata—such as authoritative data element mappings and privacy annotations—which hampers the delivery of intelligent services.To address these challenges,the study proposes a multi-faceted strategy focused on optimizing the standard system's architecture,enhancing quality and implementation,and advancing both data governance—through authoritative tracing and privacy protection—and intelligent service provision.These strategies aim to promote the application of dataset standards,thereby fostering and securing the development of new productive forces in healthcare.
基金supported via funding from Prince Sattam bin Abdulaziz University(PSAU/2025/R/1446)Princess Nourah bint Abdulrahman University(PNURSP2025R300)Prince Sultan University.
文摘Deep neural networks provide accurate results for most applications.However,they need a big dataset to train properly.Providing a big dataset is a significant challenge in most applications.Image augmentation refers to techniques that increase the amount of image data.Common operations for image augmentation include changes in illumination,rotation,contrast,size,viewing angle,and others.Recently,Generative Adversarial Networks(GANs)have been employed for image generation.However,like image augmentation methods,GAN approaches can only generate images that are similar to the original images.Therefore,they also cannot generate new classes of data.Texture images presentmore challenges than general images,and generating textures is more complex than creating other types of images.This study proposes a gradient-based deep neural network method that generates a new class of texture.It is possible to rapidly generate new classes of textures using different kernels from pre-trained deep networks.After generating new textures for each class,the number of textures increases through image augmentation.During this process,several techniques are proposed to automatically remove incomplete and similar textures that are created.The proposed method is faster than some well-known generative networks by around 4 to 10 times.In addition,the quality of the generated textures surpasses that of these networks.The proposed method can generate textures that surpass those of someGANs and parametric models in certain image qualitymetrics.It can provide a big texture dataset to train deep networks.A new big texture dataset is created artificially using the proposed method.This dataset is approximately 2 GB in size and comprises 30,000 textures,each 150×150 pixels in size,organized into 600 classes.It is uploaded to the Kaggle site and Google Drive.This dataset is called BigTex.Compared to other texture datasets,the proposed dataset is the largest and can serve as a comprehensive texture dataset for training more powerful deep neural networks and mitigating overfitting.
基金supported by the National Natural Science Foundation of China(Grant No.52409151)the Programme of Shenzhen Key Laboratory of Green,Efficient and Intelligent Construction of Underground Metro Station(Programme No.ZDSYS20200923105200001)the Science and Technology Major Project of Xizang Autonomous Region of China(XZ202201ZD0003G).
文摘Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.
基金the Key National Natural Science Foundation of China(No.U1864211)the National Natural Science Foundation of China(No.11772191)the Natural Science Foundation of Shanghai(No.21ZR1431500)。
文摘Industrial data mining usually deals with data from different sources.These heterogeneous datasets describe the same object in different views.However,samples from some of the datasets may be lost.Then the remaining samples do not correspond one-to-one correctly.Mismatched datasets caused by missing samples make the industrial data unavailable for further machine learning.In order to align the mismatched samples,this article presents a cooperative iteration matching method(CIMM)based on the modified dynamic time warping(DTW).The proposed method regards the sequentially accumulated industrial data as the time series.Mismatched samples are aligned by the DTW.In addition,dynamic constraints are applied to the warping distance of the DTW process to make the alignment more efficient.Then a series of models are trained with the cumulated samples iteratively.Several groups of numerical experiments on different missing patterns and missing locations are designed and analyzed to prove the effectiveness and the applicability of the proposed method.
文摘Outcrop analogue studies play an important role in advancing our comprehension of reservoir architectures,offering insights into hidden reservoir rocks prior to drilling,in a cost-effective manner.These studies contribute to the delineation of the three-dimensional geometry of geological structures,the characterization of petro-and thermo-physical properties,and the structural geological aspects of reservoir rocks.Nevertheless,several challenges,including inaccessible sampling sites,limited resources,and the dimensional constraints of different laboratories hinder the acquisition of comprehensive datasets.In this study,we employ machine learning techniques to estimate missing data in a petrophysical dataset of fractured Variscan granites from the Cornubian Batholith in Southwest UK.The utilization of mean,k-nearest neighbors,and random forest imputation methods addresses the challenge of missing data,thereby revealing the effectiveness of random forest imputation in providing realistic estimations.Subsequently,supervised classification models are trained to classify samples according to their pluton origins,with promising accuracy achieved by models trained with imputed values.Variable importance ranking of the models showed that the choice of imputation method influences the inferred importance of specific petrophysical properties.While porosity(POR)and grain density(GD)were among important variables,variables with high missingness ratio were not among the top variables.This study demonstrates the value of machine learning in enhancing petrophysical datasets,while emphasizing the importance of careful method selection and model validation for reliable results.The findings contribute to a more informed decision-making process in geothermal exploration and reservoir tion characteriza-efforts,thereby demonstrating the potential of machine learning in advancing subsurface characterization techniques.
基金supported by the National Key Research and Development Program of China(No.2023YFA1606000)National Natural Science Foundation of China(Nos.12175321,11975021,and U1932101)National College Students Science and Technology Innovation Project of Sun Yat-sen University。
文摘In the BESⅢdetector at Beijing electron-positron collider,billions of events from e^(+)e^(-)collisions were recorded.These events passing through the trigger system were saved in raw data format files.They play an important role in the study of physics inτ-charm energy region.Here,we published an e^(+)e^(-)collision dataset containing both Monte Carlo simulation samples and real data collected by the BESⅢdetector.The data pass through the detector trigger system,file format conversion,and physics information extraction and finally save the physics information and detector response in text format files.This dataset is publicly available and is intended to provide interested scientists and those outside of the BESⅢcollaboration with event information from BESⅢ,which can be used to understand physics research in e^(+)e^(-)collisions,developing visualization projects for physics education,public outreach,and science advocacy.
基金supported by the National Key Research and Development Program of China(Nos.2023YFA1607100 and 2022YFA1605000)the National Natural Science Foundation of China(No.12322511)。
文摘Founded in 2009,the China Dark Matter Experiment(CDEX)collaboration was dedicated to the detection of dark matter(DM)and neutrinoless double beta decay using high-purity germanium(HPGe)detectors in the China Jinping Underground Laboratory.HPGe detectors are characterized by a high energy resolution,low analysis threshold,and low radioactive background,making them an ideal platform for the direct detection of DM.Over the years,CDEX has accumulated a massive amount of experimental data,based on which various results on DM detection and neutrinoless double beta decay have been presented.Because the dataset was collected in a low-background environment,apart from the analysis of DM-related physical channels,it has great potential as an indicator in other rare physical events searches.Furthermore,by providing raw pulse shapes,the dataset can serve as a tool for effectively understanding the internal mechanisms of HPGe detectors.
基金supported by a PIPF contract of the Madrid Education,Science and Universities Office(grant number:PIPF-2022/PH-HUM-25403).
文摘Purpose:This paper examines African Journals Online(AJOL)as a bibliometric resource,providing a structured dataset of journal and publication metadata.In addition,it integrates AJOL data with OpenAlex to enhance metadata coverage and improve interoperability with other bibliometric sources.Design/methodology/approach:The journal list and publications indexed in AJOL were retrieved using web scraping techniques.This paper details the database construction process,highlighting its strengths and limitations,and presents a descriptive analysis of AJOL’s indexed journals and publications.Findings:The publication analysis demonstrates a steady growth in the number of publications over time but reveals significant disparities in their distribution across African countries.This paper presents an example of the possibility of integrating both sources using author country data from OpenAlex.The analysis of author contributions reveals that African journals serve as both regional and international venues,confirming that African journals play a dual role in fostering both regional and global research engagement.Research limitations:While AJOL contains relevant information for identifying and providing insights about African publications and journals,its metadata are limited.Therefore,the kind of analysis that can be performed with the database presented here is also limited.The integration with OpenAlex aims to overcome some of the limitations.Finally,although some automatic citation procedures have been performed,the metadata has not been manually curated.Therefore,if errors or inaccuracies are present in the AJOL,they may be reproduced in this database.Practical implications:The database introduced in this article contributes to the accessibility of African scholarly publications by providing structured,accessible metadata derived from the AJOL.It facilitates bibliometric analyses that are more representative of African research activities.This contribution complements ongoing efforts to develop alternative data sources and infrastructure that better reflect the diversity of global knowledge production.Originality/value:This paper presents a novel database for bibliometric analysis and offers a detailed report of the retrieval and construction procedures.The inclusion of matched data with OpenAlex further enhances the database’s utility.By showcasing AJOL’s potential,this study contributes to the broader goal of fostering inclusivity and improving the representation of African research in global bibliometric analyses.
基金China Postdoctoral Science Foundation(No.2023M740237,2024M750254)Postdoctoral Fellowship Program of CPSF(No.GZB20230934)+1 种基金National Natural Science Foundation of China(No.71801113,72401029,72431003)China Scholarship Council(No.202006060162).
文摘Financial distress prediction(FDP)is a critical area of study for researchers,industry stakeholders,and regulatory authorities.However,FDP tasks present several challenges,including high-dimensional datasets,class imbalances,and the complexity of parameter optimization.These issues often hinder the predictive model’s ability to accurately identify companies at high risk of financial distress.To mitigate these challenges,we introduce FinMHSPE—a novel multi-heterogeneous self-paced ensemble(MHSPE)FDP learning framework.The proposed model uses pairwise comparisons of data from multiple time frames combined with the maximum relevance and minimum redundancy method to select an optimal subset of features,effectively resolving the high dimensionality issue.Furthermore,the proposed framework incorporates the MHSPE model to iteratively identify the most informative majority class data samples,effectively addressing the class imbalance issue.To optimize the model’s parameters,we leverage the particle swarm optimization algorithm.The robustness of our proposed model is validated through extensive experiments performed on a financial dataset of Chinese listed companies.The empirical results demonstrate that the proposed model outperforms existing competing models in the field of FDP.Specifically,our FinMHSPE framework achieves the highest performance,achieving an area under the curve(AUC)value of 0.9574,considerably surpassing all existing methods.A comparative analysis of AUC values further reveals that FinMHSPE outperforms state-of-the-art approaches that rely on financial features as inputs.Furthermore,our investigation identifies several valuable features for enhancing FDP model performance,notably those associated with a company’s information and growth potential.
文摘Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcial intelligence models predicting short-term and long-term health outcomes forpatients with the Fontan circulation are needed. Generative adversarial networks (GANs) provide a solutionfor generating realistic and useful synthetic data that can be used to train such models. Methods: Despitetheir promise, GANs have not been widely adopted in the congenital heart disease research communitydue, in some part, to a lack of knowledge on how to employ them. In this research study, a GAN was usedto generate synthetic data from the Pediatric Heart Network Fontan I dataset. A subset of data consistingof the echocardiographic and BNP measures collected from Fontan patients was used to train the GAN.Two sets of synthetic data were created to understand the effect of data missingness on synthetic datageneration. Synthetic data was created from real data in which the missing values were imputed usingMultiple Imputation by Chained Equations (MICE) (referred to as synthetic from imputed real samples). Inaddition, synthetic data was created from real data in which the missing values were dropped (referred to assynthetic from dropped real samples). Both synthetic datasets were evaluated for ffdelity by using visualmethods which involved comparing histograms and principal component analysis (PCA) plots. Fidelitywas measured quantitatively by (1) comparing synthetic and real data using the Kolmogorov-Smirnovtest to evaluate the similarity between two distributions and (2) training a neural network to distinguishbetween real and synthetic samples. Both synthetic datasets were evaluated for utility by training aneural network with synthetic data and testing the neural network on its ability to classify patients thathave ventricular dysfunction using echocardiograph measures and serological measures. Results: Usinghistograms, associated probability density functions, and (PCA), both synthetic datasets showed visualresemblance in distribution and variance to real Fontan data. Quantitatively, synthetic data from droppedreal samples had higher similarity scores, as demonstrated by the Kolmogorov–Smirnov statistic, for all butone feature (age at Fontan) compared to synthetic data from imputed real samples, which demonstrateddissimilar scores for three features (Echo SV, Echo tda, and BNP). In addition, synthetic data from droppedreal samples resembled real data to a larger extent (49.3% classiffcation error) than synthetic data fromimputed real samples (65.28% classiffcation error). Classiffcation errors approximating 50% represent datasetsthat are indistinguishable. In terms of utility, synthetic data created from real data in which the missingvalues were imputed classiffed ventricular dysfunction in real data with a classiffcation error of 10.99%.Similarly, utility of the generated synthetic data by showing that a neural network trained on synthetic dataderived from real data in which the missing values were dropped could classify ventricular dysfunction inreal data with a classiffcation error of 9.44%. Conclusions: Although representing a limited subset of thevast data available on the Pediatric Heart Network, generative adversarial networks can create syntheticdata that mimics the probability distribution of real Fontan echocardiographic measures. Clinicians can usethese synthetic data to create models that predict health outcomes for Fontan patients.
文摘Power transmission lines are a critical component of the entire power system,and ice accretion incidents caused by various types of power systems can result in immeasurable harm.Currently,network models used for ice detection on power transmission lines require a substantial amount of sample data to support their training,and their drawback is that detection accuracy is significantly affected by the inaccurate annotation among training dataset.Therefore,we propose a transformer-based detection model,structured into two stages to collectively address the impact of inaccurate datasets on model training.In the first stage,a spatial similarity enhancement(SSE)module is designed to leverage spatial information to enhance the construction of the detection framework,thereby improving the accuracy of the detector.In the second stage,a target similarity enhancement(TSE)module is introduced to enhance object-related features,reducing the impact of inaccurate data on model training,thereby expanding global correlation.Additionally,by incorporating a multi-head adaptive attention window(MAAW),spatial information is combined with category information to achieve information interaction.Simultaneously,a quasi-wavelet structure,compatible with deep learning,is employed to highlight subtle features at different scales.Experimental results indicate that the proposed model in this paper outperforms existing mainstream detection models,demonstrating superior performance and stability.
文摘To manipulate the heterogeneous and distributed data better in the data grid,a dataspace management framework for grid data is proposed based on in-depth research on grid technology.Combining technologies in dataspace management,such as data model iDM and query language iTrails,with the grid data access middleware OGSA-DAI,a grid dataspace management prototype system is built,in which tasks like data accessing,Abstraction,indexing,services management and answer-query are implemented by the OGSA-DAI workflows.Experimental results show that it is feasible to apply a dataspace management mechanism to the grid environment.Dataspace meets the grid data management needs in that it hides the heterogeneity and distribution of grid data and can adapt to the dynamic characteristics of the grid.The proposed grid dataspace management provides a new method for grid data management.
基金supported in part by the 14th Five-Year Project of Ministry of Science and Technology of China(2021YFD2000304)Fundamental Research Funds for the Central Universities(531118010509)Natural Science Foundation of Hunan Province,China(2021JJ40114)。
文摘Automatic pavement crack detection is a critical task for maintaining the pavement stability and driving safety.The task is challenging because the shadows on the pavement may have similar intensity with the crack,which interfere with the crack detection performance.Till to the present,there still lacks efficient algorithm models and training datasets to deal with the interference brought by the shadows.To fill in the gap,we made several contributions as follows.First,we proposed a new pavement shadow and crack dataset,which contains a variety of shadow and pavement pixel size combinations.It also covers all common cracks(linear cracks and network cracks),placing higher demands on crack detection methods.Second,we designed a two-step shadow-removal-oriented crack detection approach:SROCD,which improves the performance of the algorithm by first removing the shadow and then detecting it.In addition to shadows,the method can cope with other noise disturbances.Third,we explored the mechanism of how shadows affect crack detection.Based on this mechanism,we propose a data augmentation method based on the difference in brightness values,which can adapt to brightness changes caused by seasonal and weather changes.Finally,we introduced a residual feature augmentation algorithm to detect small cracks that can predict sudden disasters,and the algorithm improves the performance of the model overall.We compare our method with the state-of-the-art methods on existing pavement crack datasets and the shadow-crack dataset,and the experimental results demonstrate the superiority of our method.
基金supported by the National Key Basic Research and Development Plan (No.2015CB953900)the Natural Science Foundation of China (Nos.41330960 and 41776032)
文摘Version 4(v4) of the Extended Reconstructed Sea Surface Temperature(ERSST) dataset is compared with its precedent, the widely used version 3b(v3b). The essential upgrades applied to v4 lead to remarkable differences in the characteristics of the sea surface temperature(SST) anomaly(SSTa) in both the temporal and spatial domains. First, the largest discrepancy of the global mean SSTa values around the 1940 s is due to ship-observation corrections made to reconcile observations from buckets and engine intake thermometers. Second, differences in global and regional mean SSTa values between v4 and v3b exhibit a downward trend(around-0.032℃ per decade) before the 1940s, an upward trend(around 0.014℃ per decade) during the period of 1950–2015, interdecadal oscillation with one peak around the 1980s, and two troughs during the 1960s and 2000s, respectively. This does not derive from treatments of the polar or the other data-void regions, since the difference of the SSTa does not share the common features. Third, the spatial pattern of the ENSO-related variability of v4 exhibits a wider but weaker cold tongue in the tropical region of the Pacific Ocean compared with that of v3b, which could be attributed to differences in gap-filling assumptions since the latter features satellite observations whereas the former features in situ ones. This intercomparison confirms that the structural uncertainty arising from underlying assumptions on the treatment of diverse SST observations even in the same SST product family is the main source of significant SST differences in the temporal domain. Why this uncertainty introduces artificial decadal oscillations remains unknown.
基金supported by the National Natural Science Foundation of China(Grant No.51605069).
文摘To address the shortage of public datasets for customs X-ray images of contraband and the difficulties in deploying trained models in engineering applications,a method has been proposed that employs the Extract-Transform-Load(ETL)approach to create an X-ray dataset of contraband items.Initially,X-ray scatter image data is collected and cleaned.Using Kafka message queues and the Elasticsearch(ES)distributed search engine,the data is transmitted in real-time to cloud servers.Subsequently,contraband data is annotated using a combination of neural networks and manual methods to improve annotation efficiency and implemented mean hash algorithm for quick image retrieval.The method of integrating targets with backgrounds has enhanced the X-ray contraband image data,increasing the number of positive samples.Finally,an Airport Customs X-ray dataset(ACXray)compatible with customs business scenarios has been constructed,featuring an increased number of positive contraband samples.Experimental tests using three datasets to train the Mask Region-based Convolutional Neural Network(Mask R-CNN)algorithm and tested on 400 real customs images revealed that the recognition accuracy of algorithms trained with Security Inspection X-ray(SIXray)and Occluded Prohibited Items X-ray(OPIXray)decreased by 16.3%and 15.1%,respectively,while the ACXray dataset trained algorithm’s accuracy was almost unaffected.This indicates that the ACXray dataset-trained algorithm possesses strong generalization capabilities and is more suitable for customs detection scenarios.
基金supported by the National Natural Science Foundation of China (62173103)the Fundamental Research Funds for the Central Universities of China (3072022JC0402,3072022JC0403)。
文摘For the first time, this article introduces a LiDAR Point Clouds Dataset of Ships composed of both collected and simulated data to address the scarcity of LiDAR data in maritime applications. The collected data are acquired using specialized maritime LiDAR sensors in both inland waterways and wide-open ocean environments. The simulated data is generated by placing a ship in the LiDAR coordinate system and scanning it with a redeveloped Blensor that emulates the operation of a LiDAR sensor equipped with various laser beams. Furthermore,we also render point clouds for foggy and rainy weather conditions. To describe a realistic shipping environment, a dynamic tail wave is modeled by iterating the wave elevation of each point in a time series. Finally, networks serving small objects are migrated to ship applications by feeding our dataset. The positive effect of simulated data is described in object detection experiments, and the negative impact of tail waves as noise is verified in single-object tracking experiments. The Dataset is available at https://github.com/zqy411470859/ship_dataset.
基金the General Program of National Natural Science Foundation of China(62072049).
文摘The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized cloud server is not applicable due to data privacy and communication costs concerns,hindering artificial intelligence from empowering mobile devices.Moreover,these data are not identically and independently distributed(Non-IID)caused by their different context,which will deteriorate the performance of the model.To address these issues,we propose a novel Distributed Learning algorithm based on hierarchical clustering and Adaptive Dataset Condensation,named ADC-DL,which learns a shared model by collecting the synthetic samples generated on each device.To tackle the heterogeneity of data distribution,we propose an entropy topsis comprehensive tiering model for hierarchical clustering,which distinguishes clients in terms of their data characteristics.Subsequently,synthetic dummy samples are generated based on the hierarchical structure utilizing adaptive dataset condensation.The procedure of dataset condensation can be adjusted adaptively according to the tier of the client.Extensive experiments demonstrate that the performance of our ADC-DL is more outstanding in prediction accuracy and communication costs compared with existing algorithms.
文摘A M_(S)6.4 earthquake occurred on 21 May 2021 in Yangbi county,Dali prefecture,Yunnan,China,at 21:48 Beijing Time(13:48 UTC).Earthquakes with an M3.0 or higher occurred before and after the main shock.Seismic data analysis is essential for the in-depth investigation of the 2021 Yangbi M_(S)6.4 earthquake sequence and the seismotectonics of northwestern Yunnan.Institute of Geophysics,China Earthquake Administration(CEA),has compiled a dataset of seismological observations from 157 broadband stations located within 500 km of the epicenter,and has made this dataset available to the earthquake science research community.The dataset(total file size:329 GB)consists of event waveforms with a sampling frequency of 100 sps collected from 18 to 28 May 2021,20-Hz and 100-Hz continuous waveforms collected from 12 to 31 May 2021,and seismic instrument response files.To promote data sharing,the dataset also includes the seismic event waveforms from 20 to 22 May 2021 recorded at 50 stations of the ongoing Binchuan Active Source Geophysical Observation Project,for which the data protection period has not expired.Sample waveforms of the main shock are included in the appendix of this article and can be downloaded from the Earthquake Science website.The event and continuous waveforms are available from the Earthquake Science Data Center website(www.esdc.ac.cn)on application.
文摘One of the biggest dangers to society today is terrorism, where attacks have become one of the most significantrisks to international peace and national security. Big data, information analysis, and artificial intelligence (AI) havebecome the basis for making strategic decisions in many sensitive areas, such as fraud detection, risk management,medical diagnosis, and counter-terrorism. However, there is still a need to assess how terrorist attacks are related,initiated, and detected. For this purpose, we propose a novel framework for classifying and predicting terroristattacks. The proposed framework posits that neglected text attributes included in the Global Terrorism Database(GTD) can influence the accuracy of the model’s classification of terrorist attacks, where each part of the datacan provide vital information to enrich the ability of classifier learning. Each data point in a multiclass taxonomyhas one or more tags attached to it, referred as “related tags.” We applied machine learning classifiers to classifyterrorist attack incidents obtained from the GTD. A transformer-based technique called DistilBERT extracts andlearns contextual features from text attributes to acquiremore information from text data. The extracted contextualfeatures are combined with the “key features” of the dataset and used to perform the final classification. Thestudy explored different experimental setups with various classifiers to evaluate the model’s performance. Theexperimental results show that the proposed framework outperforms the latest techniques for classifying terroristattacks with an accuracy of 98.7% using a combined feature set and extreme gradient boosting classifier.