Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subse...Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.展开更多
Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy cl...Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy clustering techniques,such as Fuzzy C-Means(FCM),face significant challenges in handling uncertainty and the dependencies between different views.To overcome these limitations,we introduce a new multi-view fuzzy clustering approach that integrates picture fuzzy sets with a dual-anchor graph method for multi-view data,aiming to enhance clustering accuracy and robustness,termed Multi-view Picture Fuzzy Clustering(MPFC).In particular,the picture fuzzy set theory extends the capability to represent uncertainty by modeling three membership levels:membership degrees,neutral degrees,and refusal degrees.This allows for a more flexible representation of uncertain and conflicting data than traditional fuzzy models.Meanwhile,dual-anchor graphs exploit the similarity relationships between data points and integrate information across views.This combination improves stability,scalability,and robustness when handling noisy and heterogeneous data.Experimental results on several benchmark datasets demonstrate significant improvements in clustering accuracy and efficiency,outperforming traditional methods.Specifically,the MPFC algorithm demonstrates outstanding clustering performance on a variety of datasets,attaining a Purity(PUR)score of 0.6440 and an Accuracy(ACC)score of 0.6213 for the 3 Sources dataset,underscoring its robustness and efficiency.The proposed approach significantly contributes to fields such as pattern recognition,multi-view relational data analysis,and large-scale clustering problems.Future work will focus on extending the method for semi-supervised multi-view clustering,aiming to enhance adaptability,scalability,and performance in real-world applications.展开更多
The formation water sample in oil and gas fields may be polluted in processes of testing, trial production, collection, storage, transportation and analysis, making the properties of formation water not be reflected t...The formation water sample in oil and gas fields may be polluted in processes of testing, trial production, collection, storage, transportation and analysis, making the properties of formation water not be reflected truly. This paper discusses identification methods and the data credibility evaluation method for formation water in oil and gas fields of petroliferous basins within China. The results of the study show that: (1) the identification methods of formation water include the basic methods of single factors such as physical characteristics, water composition characteristics, water type characteristics, and characteristic coefficients, as well as the comprehensive evaluation method of data credibility proposed on this basis, which mainly relies on the correlation analysis sodium chloride coefficient and desulfurization coefficient and combines geological background evaluation;(2) The basic identifying methods for formation water enable the preliminary identification of hydrochemical data and the preliminary screening of data on site, the proposed comprehensive method realizes the evaluation by classifying the CaCl2-type water into types A-I to A-VI and the NaHCO3-type water into types B-I to B-IV, so that researchers can make in-depth evaluation on the credibility of hydrochemical data and analysis of influencing factors;(3) When the basic methods are used to identify the formation water, the formation water containing anions such as CO_(3)^(2-), OH- and NO_(3)^(-), or the formation water with the sodium chloride coefficient and desulphurization coefficient not matching the geological setting, are all invaded with surface water or polluted by working fluid;(4) When the comprehensive method is used, the data credibility of A-I, A-II, B-I and B-II formation water can be evaluated effectively and accurately only if the geological setting analysis in respect of the factors such as formation environment, sampling conditions, condensate water, acid fluid, leaching of ancient weathering crust, and ancient atmospheric fresh water, is combined, although such formation water is believed with high credibility.展开更多
Data hiding methods involve embedding secret messages into cover objects to enable covert communication in a way that is difficult to detect.In data hiding methods based on image interpolation,the image size is reduce...Data hiding methods involve embedding secret messages into cover objects to enable covert communication in a way that is difficult to detect.In data hiding methods based on image interpolation,the image size is reduced and then enlarged through interpolation,followed by the embedding of secret data into the newly generated pixels.A general improving approach for embedding secret messages is proposed.The approach may be regarded a general model for enhancing the data embedding capacity of various existing image interpolation-based data hiding methods.This enhancement is achieved by expanding the range of pixel values available for embedding secret messages,removing the limitations of many existing methods,where the range is restricted to powers of two to facilitate the direct embedding of bit-based messages.This improvement is accomplished through the application of multiple-based number conversion to the secret message data.The method converts the message bits into a multiple-based number and uses an algorithm to embed each digit of this number into an individual pixel,thereby enhancing the message embedding efficiency,as proved by a theorem derived in this study.The proposed improvement method has been tested through experiments on three well-known image interpolation-based data hiding methods.The results show that the proposed method can enhance the three data embedding rates by approximately 14%,13%,and 10%,respectively,create stego-images with good quality,and resist RS steganalysis attacks.These experimental results indicate that the use of the multiple-based number conversion technique to improve the three interpolation-based methods for embedding secret messages increases the number of message bits embedded in the images.For many image interpolation-based data hiding methods,which use power-of-two pixel-value ranges for message embedding,other than the three tested ones,the proposed improvement method is also expected to be effective for enhancing their data embedding capabilities.展开更多
Influenza,an acute respiratory infectious disease caused by the influenza virus,exhibits distinct seasonal patterns in China,with peak activity occurring in winter and spring in northern regions,and in winter and summ...Influenza,an acute respiratory infectious disease caused by the influenza virus,exhibits distinct seasonal patterns in China,with peak activity occurring in winter and spring in northern regions,and in winter and summer in southern areas[1].The World Health Organization(WHO)emphasizes that early warning and epidemic intensity assessments are critical public health strategies for influenza prevention and control.Internet-based flu surveillance,with real-time data and low costs,effectively complements traditional methods.The Baidu Search Index,which reflects flu-related queries,strongly correlates with influenza trends,aiding in regional activity assessment and outbreak tracking[2].展开更多
In response to the issue of fuzzy matching and association when optical observation data are matched with the orbital elements in a catalog database,this paper proposes a matching and association strategy based on the...In response to the issue of fuzzy matching and association when optical observation data are matched with the orbital elements in a catalog database,this paper proposes a matching and association strategy based on the arcsegment difference method.First,a matching error threshold is set to match the observation data with the known catalog database.Second,the matching results for the same day are sorted on the basis of target identity and observation residuals.Different matching error thresholds and arc-segment dynamic association thresholds are then applied to categorize the observation residuals of the same target across different arc-segments,yielding matching results under various thresholds.Finally,the orbital residual is computed through orbit determination(OD),and the positional error is derived by comparing the OD results with the orbit track from the catalog database.The appropriate matching error threshold is then selected on the basis of these results,leading to the final matching and association of the fuzzy correlation data.Experimental results showed that the correct matching rate for data arc-segments is 92.34% when the matching error threshold is set to 720″,with the arc-segment difference method processing the results of an average matching rate of 97.62% within 8 days.The remaining 5.28% of the fuzzy correlation data are correctly matched and associated,enabling identification of orbital maneuver targets through further processing and analysis.This method substantially enhances the efficiency and accuracy of space target cataloging,offering robust technical support for dynamic maintenance of the space target database.展开更多
The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and ...The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and apparatuses have been proposed over the past few decades.The objective of the present study is to summarize the status and development in theories,test apparatuses,data processing of the existing testing methods for UCS measurement.It starts with elaborating the theories of these test methods.Then the test apparatus and development trends for UCS measurement are summarized,followed by a discussion on rock specimens for test apparatus,and data processing methods.Next,the method selection for UCS measurement is recommended.It reveals that the rock failure mechanism in the UCS testing methods can be divided into compression-shear,compression-tension,composite failure mode,and no obvious failure mode.The trends of these apparatuses are towards automation,digitization,precision,and multi-modal test.Two size correction methods are commonly used.One is to develop empirical correlation between the measured indices and the specimen size.The other is to use a standard specimen to calculate the size correction factor.Three to five input parameters are commonly utilized in soft computation models to predict the UCS of rocks.The selection of the test methods for the UCS measurement can be carried out according to the testing scenario and the specimen size.The engineers can gain a comprehensive understanding of the UCS testing methods and its potential developments in various rock engineering endeavors.展开更多
This article introduces the methodologies and instrumentation for data measurement and propagation at the Back-n white neutron facility of the China Spallation Neutron Source.The Back-n facility employs backscattering...This article introduces the methodologies and instrumentation for data measurement and propagation at the Back-n white neutron facility of the China Spallation Neutron Source.The Back-n facility employs backscattering techniques to generate a broad spectrum of white neutrons.Equipped with advanced detectors such as the light particle detector array and the fission ionization chamber detector,the facility achieves high-precision data acquisition through a general-purpose electronics system.Data were managed and stored in a hierarchical system supported by the National High Energy Physics Science Data Center,ensuring long-term preservation and efficient access.The data from the Back-n experiments significantly contribute to nuclear physics,reactor design,astrophysics,and medical physics,enhancing the understanding of nuclear processes and supporting interdisciplinary research.展开更多
The deep learning algorithm,which has been increasingly applied in the field of petroleum geophysical prospecting,has achieved good results in improving efficiency and accuracy based on test applications.To play a gre...The deep learning algorithm,which has been increasingly applied in the field of petroleum geophysical prospecting,has achieved good results in improving efficiency and accuracy based on test applications.To play a greater role in actual production,these algorithm modules must be integrated into software systems and used more often in actual production projects.Deep learning frameworks,such as TensorFlow and PyTorch,basically take Python as the core architecture,while the application program mainly uses Java,C#,and other programming languages.During integration,the seismic data read by the Java and C#data interfaces must be transferred to the Python main program module.The data exchange methods between Java,C#,and Python include shared memory,shared directory,and so on.However,these methods have the disadvantages of low transmission efficiency and unsuitability for asynchronous networks.Considering the large volume of seismic data and the need for network support for deep learning,this paper proposes a method of transmitting seismic data based on Socket.By maximizing Socket’s cross-network and efficient longdistance transmission,this approach solves the problem of inefficient transmission of underlying data while integrating the deep learning algorithm module into a software system.Furthermore,the actual production application shows that this method effectively solves the shortage of data transmission in shared memory,shared directory,and other modes while simultaneously improving the transmission efficiency of massive seismic data across modules at the bottom of the software.展开更多
The accurate prediction of battery pack capacity in electric vehicles(EVs)is crucial for ensuring safety and optimizing performance.Despite extensive research on predicting cell capacity using laboratory data,predicti...The accurate prediction of battery pack capacity in electric vehicles(EVs)is crucial for ensuring safety and optimizing performance.Despite extensive research on predicting cell capacity using laboratory data,predicting the capacity of onboard battery packs from field data remains challenging due to complex operating conditions and irregular EV usage in real-world settings.Most existing methods rely on extracting health feature parameters from raw data for capacity prediction of onboard battery packs,however,selecting specific parameters often results in a loss of critical information,which reduces prediction accuracy.To this end,this paper introduces a novel framework combining deep learning and data compression techniques to accurately predict battery pack capacity onboard.The proposed data compression method converts monthly EV charging data into feature maps,which preserve essential data characteristics while reducing the volume of raw data.To address missing capacity labels in field data,a capacity labeling method is proposed,which calculates monthly battery capacity by transforming the ampere-hour integration formula and applying linear regression.Subsequently,a deep learning model is proposed to build a capacity prediction model,using feature maps from historical months to predict the battery capacity of future months,thus facilitating accurate forecasts.The proposed framework,evaluated using field data from 20 EVs,achieves a mean absolute error of 0.79 Ah,a mean absolute percentage error of 0.65%,and a root mean square error of 1.02 Ah,highlighting its potential for real-world EV applications.展开更多
In the face of data scarcity in the optimization of maintenance strategies for civil aircraft,traditional failure data-driven methods are encountering challenges owing to the increasing reliability of aircraft design....In the face of data scarcity in the optimization of maintenance strategies for civil aircraft,traditional failure data-driven methods are encountering challenges owing to the increasing reliability of aircraft design.This study addresses this issue by presenting a novel combined data fusion algorithm,which serves to enhance the accuracy and reliability of failure rate analysis for a specific aircraft model by integrating historical failure data from similar models as supplementary information.Through a comprehensive analysis of two different maintenance projects,this study illustrates the application process of the algorithm.Building upon the analysis results,this paper introduces the innovative equal integral value method as a replacement for the conventional equal interval method in the context of maintenance schedule optimization.The Monte Carlo simulation example validates that the equivalent essential value method surpasses the traditional method by over 20%in terms of inspection efficiency ratio.This discovery indicates that the equal critical value method not only upholds maintenance efficiency but also substantially decreases workload and maintenance costs.The findings of this study open up novel perspectives for airlines grappling with data scarcity,offer fresh strategies for the optimization of aviation maintenance practices,and chart a new course toward achieving more efficient and cost-effective maintenance schedule optimization through refined data analysis.展开更多
Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This s...Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.展开更多
Industrial data mining usually deals with data from different sources.These heterogeneous datasets describe the same object in different views.However,samples from some of the datasets may be lost.Then the remaining s...Industrial data mining usually deals with data from different sources.These heterogeneous datasets describe the same object in different views.However,samples from some of the datasets may be lost.Then the remaining samples do not correspond one-to-one correctly.Mismatched datasets caused by missing samples make the industrial data unavailable for further machine learning.In order to align the mismatched samples,this article presents a cooperative iteration matching method(CIMM)based on the modified dynamic time warping(DTW).The proposed method regards the sequentially accumulated industrial data as the time series.Mismatched samples are aligned by the DTW.In addition,dynamic constraints are applied to the warping distance of the DTW process to make the alignment more efficient.Then a series of models are trained with the cumulated samples iteratively.Several groups of numerical experiments on different missing patterns and missing locations are designed and analyzed to prove the effectiveness and the applicability of the proposed method.展开更多
As pivotal supporting technologies for smart manufacturing and digital engineering,model-based and data-driven methods have been widely applied in many industrial fields,such as product design,process monitoring,and s...As pivotal supporting technologies for smart manufacturing and digital engineering,model-based and data-driven methods have been widely applied in many industrial fields,such as product design,process monitoring,and smart maintenance.While promising,both methods have issues that need to be addressed.For example,model-based methods are limited by low computational accuracy and a high computational burden,and data-driven methods always suffer from poor interpretability and redundant features.To address these issues,the concept of data-model fusion(DMF)emerges as a promising solution.DMF involves integrating model-based methods with data-driven methods by incorporating big data into model-based methods or embedding relevant domain knowledge into data-driven methods.Despite growing efforts in the field of DMF,a unanimous definition of DMF remains elusive,and a general framework of DMF has been rarely discussed.This paper aims to address this gap by providing a thorough overview and categorization of both data-driven methods and model-based methods.Subsequently,this paper also presents the definition and categorization of DMF and discusses the general framework of DMF.Moreover,the primary seven applications of DMF are reviewed within the context of smart manufacturing and digital engineering.Finally,this paper directs the future directions of DMF.展开更多
Data collection serves as the cornerstone in the study of clinical research questions.Two types of data are commonly utilized in medicine:(1)Qualitative;and(2)Quantitative.Several methods are commonly employed to gath...Data collection serves as the cornerstone in the study of clinical research questions.Two types of data are commonly utilized in medicine:(1)Qualitative;and(2)Quantitative.Several methods are commonly employed to gather data,regardless of whether retrospective or prospective studies are used:(1)Interviews;(2)Observational methods;(3)Questionnaires;(4)Investigation parameters;(5)Medical records;and(6)Electronic chart reviews.Each source type has its own advantages and cons in terms of the accuracy and availability of the data to be extracted.We will focus on the important parts of the research methodology:(1)Data collection;and(2)Subgroup analyses.Errors in research can arise from various sources,including investigators,instruments,and subjects,making the validation and reliability of research tools crucial for ensuring the credibility of findings.Subgroup analyses can either be planned before or emerge after(post-hoc)treatment.The interpretation of subgroup effects should consider the interaction between treatment effect and various patient variables with caution.展开更多
With the rapid development of the industrial Internet,the network security environment has become increasingly complex and variable.Intrusion detection,a core technology for ensuring the security of industrial control...With the rapid development of the industrial Internet,the network security environment has become increasingly complex and variable.Intrusion detection,a core technology for ensuring the security of industrial control systems,faces the challenge of unbalanced data samples,particularly the low detection rates for minority class attack samples.Therefore,this paper proposes a data enhancement method for intrusion detection in the industrial Internet based on a Self-Attention Wasserstein Generative Adversarial Network(SA-WGAN)to address the low detection rates of minority class attack samples in unbalanced intrusion detection scenarios.The proposed method integrates a selfattention mechanism with a Wasserstein Generative Adversarial Network(WGAN).The self-attention mechanism automatically learns important features from the input data and assigns different weights to emphasize the key features related to intrusion behaviors,providing strong guidance for subsequent data generation.The WGAN generates new data samples through adversarial training to expand the original dataset.In the SA-WGAN framework,the WGAN directs the data generation process based on the key features extracted by the self-attention mechanism,ensuring that the generated samples exhibit both diversity and similarity to real data.Experimental results demonstrate that the SA-WGAN-based data enhancement method significantly improves detection performance for attack samples from minority classes,addresses issues of insufficient data and category imbalance,and enhances the generalization ability and overall performance of the intrusion detection model.展开更多
Missing data handling is vital for multi-sensor information fusion fault diagnosis of motors to prevent the accuracy decay or even model failure,and some promising results have been gained in several current studies.T...Missing data handling is vital for multi-sensor information fusion fault diagnosis of motors to prevent the accuracy decay or even model failure,and some promising results have been gained in several current studies.These studies,however,have the following limitations:1)effective supervision is neglected for missing data across different fault types and 2)imbalance in missing rates among fault types results in inadequate learning during model training.To overcome the above limitations,this paper proposes a dynamic relative advantagedriven multi-fault synergistic diagnosis method to accomplish accurate fault diagnosis of motors under imbalanced missing data rates.Firstly,a cross-fault-type generalized synergistic diagnostic strategy is established based on variational information bottleneck theory,which is able to ensure sufficient supervision in handling missing data.Then,a dynamic relative advantage assessment technique is designed to reduce diagnostic accuracy decay caused by imbalanced missing data rates.The proposed method is validated using multi-sensor data from motor fault simulation experiments,and experimental results demonstrate its effectiveness and superiority in improving diagnostic accuracy and generalization under imbalanced missing data rates.展开更多
Snow cover in mountainous areas is characterized by high reflectivity,strong spatial heterogeneity,rapid changes,and susceptibility to cloud interference.However,due to the limitations of a single sensor,it is challen...Snow cover in mountainous areas is characterized by high reflectivity,strong spatial heterogeneity,rapid changes,and susceptibility to cloud interference.However,due to the limitations of a single sensor,it is challenging to obtain high-resolution satellite remote sensing data for monitoring the dynamic changes of snow cover within a day.This study focuses on two typical data fusion methods for polar-orbiting satellites(Sentinel-3 SLSTR)and geostationary satellites(Himawari-9 AHI),and explores the snow cover detection accuracy of a multitemporal cloud-gap snow cover identification model(Loose data fusion)and the ESTARFM(Spatiotemporal data fusion).Taking the Qilian Mountains as the research area,the accuracy of two data fusion results was verified using the snow cover extracted from Landsat-8 SR products.The results showed that both data fusion models could effectively capture the spatiotemporal variations of snow cover,but the ESTARFM demonstrated superior performance.It not only obtained fusion images at any target time,but also extracted snow cover that was closer to the spatial distribution of real satellite images.Therefore,the ESTARFM was utilized to fuse images for hourly reconstruction of the snow cover on February 14–15,2023.It was found that the maximum snow cover area of this snowfall reached 83.84%of the Qilian Mountains area,and the melting rate of the snow was extremely rapid,with a change of up to 4.30%per hour of the study area.This study offers reliable high spatiotemporal resolution satellite remote sensing data for monitoring snow cover changes in mountainous areas,contributing to more accurate and timely assessments.展开更多
This research introduces a unique approach to segmenting breast cancer images using a U-Net-based architecture.However,the computational demand for image processing is very high.Therefore,we have conducted this resear...This research introduces a unique approach to segmenting breast cancer images using a U-Net-based architecture.However,the computational demand for image processing is very high.Therefore,we have conducted this research to build a system that enables image segmentation training with low-power machines.To accomplish this,all data are divided into several segments,each being trained separately.In the case of prediction,the initial output is predicted from each trained model for an input,where the ultimate output is selected based on the pixel-wise majority voting of the expected outputs,which also ensures data privacy.In addition,this kind of distributed training system allows different computers to be used simultaneously.That is how the training process takes comparatively less time than typical training approaches.Even after completing the training,the proposed prediction system allows a newly trained model to be included in the system.Thus,the prediction is consistently more accurate.We evaluated the effectiveness of the ultimate output based on four performance matrices:average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy.The experimental results show that the scores of average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy are 0.9216,0.0687,0.9477,and 0.8674,respectively.In addition,the proposed method was compared with four other state-of-the-art models in terms of total training time and usage of computational resources.And it outperformed all of them in these aspects.展开更多
An improved cycle-consistent generative adversarial network(CycleGAN) method for defect data augmentation based on feature fusion and self attention residual module is proposed to address the insufficiency of defect s...An improved cycle-consistent generative adversarial network(CycleGAN) method for defect data augmentation based on feature fusion and self attention residual module is proposed to address the insufficiency of defect sample data for light guide plate(LGP) in production,as well as the problem of minor defects.Two optimizations are made to the generator of CycleGAN:fusion of low resolution features obtained from partial up-sampling and down-sampling with high-resolution features,combination of self attention mechanism with residual network structure to replace the original residual module.Qualitative and quantitative experiments were conducted to compare different data augmentation methods,and the results show that the defect images of the LGP generated by the improved network were more realistic,and the accuracy of the you only look once version 5(YOLOv5) detection network for the LGP was improved by 5.6%,proving the effectiveness and accuracy of the proposed method.展开更多
基金supported in part by NIH grants R01NS39600,U01MH114829RF1MH128693(to GAA)。
文摘Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.
基金funded by the Research Project:THTETN.05/24-25,VietnamAcademy of Science and Technology.
文摘Multi-view clustering is a critical research area in computer science aimed at effectively extracting meaningful patterns from complex,high-dimensional data that single-view methods cannot capture.Traditional fuzzy clustering techniques,such as Fuzzy C-Means(FCM),face significant challenges in handling uncertainty and the dependencies between different views.To overcome these limitations,we introduce a new multi-view fuzzy clustering approach that integrates picture fuzzy sets with a dual-anchor graph method for multi-view data,aiming to enhance clustering accuracy and robustness,termed Multi-view Picture Fuzzy Clustering(MPFC).In particular,the picture fuzzy set theory extends the capability to represent uncertainty by modeling three membership levels:membership degrees,neutral degrees,and refusal degrees.This allows for a more flexible representation of uncertain and conflicting data than traditional fuzzy models.Meanwhile,dual-anchor graphs exploit the similarity relationships between data points and integrate information across views.This combination improves stability,scalability,and robustness when handling noisy and heterogeneous data.Experimental results on several benchmark datasets demonstrate significant improvements in clustering accuracy and efficiency,outperforming traditional methods.Specifically,the MPFC algorithm demonstrates outstanding clustering performance on a variety of datasets,attaining a Purity(PUR)score of 0.6440 and an Accuracy(ACC)score of 0.6213 for the 3 Sources dataset,underscoring its robustness and efficiency.The proposed approach significantly contributes to fields such as pattern recognition,multi-view relational data analysis,and large-scale clustering problems.Future work will focus on extending the method for semi-supervised multi-view clustering,aiming to enhance adaptability,scalability,and performance in real-world applications.
基金Supported by the PetroChina Science and Technology Project(2023ZZ0202)。
文摘The formation water sample in oil and gas fields may be polluted in processes of testing, trial production, collection, storage, transportation and analysis, making the properties of formation water not be reflected truly. This paper discusses identification methods and the data credibility evaluation method for formation water in oil and gas fields of petroliferous basins within China. The results of the study show that: (1) the identification methods of formation water include the basic methods of single factors such as physical characteristics, water composition characteristics, water type characteristics, and characteristic coefficients, as well as the comprehensive evaluation method of data credibility proposed on this basis, which mainly relies on the correlation analysis sodium chloride coefficient and desulfurization coefficient and combines geological background evaluation;(2) The basic identifying methods for formation water enable the preliminary identification of hydrochemical data and the preliminary screening of data on site, the proposed comprehensive method realizes the evaluation by classifying the CaCl2-type water into types A-I to A-VI and the NaHCO3-type water into types B-I to B-IV, so that researchers can make in-depth evaluation on the credibility of hydrochemical data and analysis of influencing factors;(3) When the basic methods are used to identify the formation water, the formation water containing anions such as CO_(3)^(2-), OH- and NO_(3)^(-), or the formation water with the sodium chloride coefficient and desulphurization coefficient not matching the geological setting, are all invaded with surface water or polluted by working fluid;(4) When the comprehensive method is used, the data credibility of A-I, A-II, B-I and B-II formation water can be evaluated effectively and accurately only if the geological setting analysis in respect of the factors such as formation environment, sampling conditions, condensate water, acid fluid, leaching of ancient weathering crust, and ancient atmospheric fresh water, is combined, although such formation water is believed with high credibility.
文摘Data hiding methods involve embedding secret messages into cover objects to enable covert communication in a way that is difficult to detect.In data hiding methods based on image interpolation,the image size is reduced and then enlarged through interpolation,followed by the embedding of secret data into the newly generated pixels.A general improving approach for embedding secret messages is proposed.The approach may be regarded a general model for enhancing the data embedding capacity of various existing image interpolation-based data hiding methods.This enhancement is achieved by expanding the range of pixel values available for embedding secret messages,removing the limitations of many existing methods,where the range is restricted to powers of two to facilitate the direct embedding of bit-based messages.This improvement is accomplished through the application of multiple-based number conversion to the secret message data.The method converts the message bits into a multiple-based number and uses an algorithm to embed each digit of this number into an individual pixel,thereby enhancing the message embedding efficiency,as proved by a theorem derived in this study.The proposed improvement method has been tested through experiments on three well-known image interpolation-based data hiding methods.The results show that the proposed method can enhance the three data embedding rates by approximately 14%,13%,and 10%,respectively,create stego-images with good quality,and resist RS steganalysis attacks.These experimental results indicate that the use of the multiple-based number conversion technique to improve the three interpolation-based methods for embedding secret messages increases the number of message bits embedded in the images.For many image interpolation-based data hiding methods,which use power-of-two pixel-value ranges for message embedding,other than the three tested ones,the proposed improvement method is also expected to be effective for enhancing their data embedding capabilities.
基金supported by the National Key Research and Development Program of China(Project No.2023YFC2307500).
文摘Influenza,an acute respiratory infectious disease caused by the influenza virus,exhibits distinct seasonal patterns in China,with peak activity occurring in winter and spring in northern regions,and in winter and summer in southern areas[1].The World Health Organization(WHO)emphasizes that early warning and epidemic intensity assessments are critical public health strategies for influenza prevention and control.Internet-based flu surveillance,with real-time data and low costs,effectively complements traditional methods.The Baidu Search Index,which reflects flu-related queries,strongly correlates with influenza trends,aiding in regional activity assessment and outbreak tracking[2].
基金supported by National Natural Science Foundation of China(12273080).
文摘In response to the issue of fuzzy matching and association when optical observation data are matched with the orbital elements in a catalog database,this paper proposes a matching and association strategy based on the arcsegment difference method.First,a matching error threshold is set to match the observation data with the known catalog database.Second,the matching results for the same day are sorted on the basis of target identity and observation residuals.Different matching error thresholds and arc-segment dynamic association thresholds are then applied to categorize the observation residuals of the same target across different arc-segments,yielding matching results under various thresholds.Finally,the orbital residual is computed through orbit determination(OD),and the positional error is derived by comparing the OD results with the orbit track from the catalog database.The appropriate matching error threshold is then selected on the basis of these results,leading to the final matching and association of the fuzzy correlation data.Experimental results showed that the correct matching rate for data arc-segments is 92.34% when the matching error threshold is set to 720″,with the arc-segment difference method processing the results of an average matching rate of 97.62% within 8 days.The remaining 5.28% of the fuzzy correlation data are correctly matched and associated,enabling identification of orbital maneuver targets through further processing and analysis.This method substantially enhances the efficiency and accuracy of space target cataloging,offering robust technical support for dynamic maintenance of the space target database.
基金the National Natural Science Foundation of China(Grant Nos.52308403 and 52079068)the Yunlong Lake Laboratory of Deep Underground Science and Engineering(No.104023005)the China Postdoctoral Science Foundation(Grant No.2023M731998)for funding provided to this work.
文摘The uniaxial compressive strength(UCS)of rocks is a vital geomechanical parameter widely used for rock mass classification,stability analysis,and engineering design in rock engineering.Various UCS testing methods and apparatuses have been proposed over the past few decades.The objective of the present study is to summarize the status and development in theories,test apparatuses,data processing of the existing testing methods for UCS measurement.It starts with elaborating the theories of these test methods.Then the test apparatus and development trends for UCS measurement are summarized,followed by a discussion on rock specimens for test apparatus,and data processing methods.Next,the method selection for UCS measurement is recommended.It reveals that the rock failure mechanism in the UCS testing methods can be divided into compression-shear,compression-tension,composite failure mode,and no obvious failure mode.The trends of these apparatuses are towards automation,digitization,precision,and multi-modal test.Two size correction methods are commonly used.One is to develop empirical correlation between the measured indices and the specimen size.The other is to use a standard specimen to calculate the size correction factor.Three to five input parameters are commonly utilized in soft computation models to predict the UCS of rocks.The selection of the test methods for the UCS measurement can be carried out according to the testing scenario and the specimen size.The engineers can gain a comprehensive understanding of the UCS testing methods and its potential developments in various rock engineering endeavors.
基金supported by the National Key Research and Development Plan(No.2023YFA1606602)。
文摘This article introduces the methodologies and instrumentation for data measurement and propagation at the Back-n white neutron facility of the China Spallation Neutron Source.The Back-n facility employs backscattering techniques to generate a broad spectrum of white neutrons.Equipped with advanced detectors such as the light particle detector array and the fission ionization chamber detector,the facility achieves high-precision data acquisition through a general-purpose electronics system.Data were managed and stored in a hierarchical system supported by the National High Energy Physics Science Data Center,ensuring long-term preservation and efficient access.The data from the Back-n experiments significantly contribute to nuclear physics,reactor design,astrophysics,and medical physics,enhancing the understanding of nuclear processes and supporting interdisciplinary research.
基金supported by the PetroChina Prospective,Basic,and Strategic Technology Research Project(No.2021ZG03-02 and No.2023DJ8402)。
文摘The deep learning algorithm,which has been increasingly applied in the field of petroleum geophysical prospecting,has achieved good results in improving efficiency and accuracy based on test applications.To play a greater role in actual production,these algorithm modules must be integrated into software systems and used more often in actual production projects.Deep learning frameworks,such as TensorFlow and PyTorch,basically take Python as the core architecture,while the application program mainly uses Java,C#,and other programming languages.During integration,the seismic data read by the Java and C#data interfaces must be transferred to the Python main program module.The data exchange methods between Java,C#,and Python include shared memory,shared directory,and so on.However,these methods have the disadvantages of low transmission efficiency and unsuitability for asynchronous networks.Considering the large volume of seismic data and the need for network support for deep learning,this paper proposes a method of transmitting seismic data based on Socket.By maximizing Socket’s cross-network and efficient longdistance transmission,this approach solves the problem of inefficient transmission of underlying data while integrating the deep learning algorithm module into a software system.Furthermore,the actual production application shows that this method effectively solves the shortage of data transmission in shared memory,shared directory,and other modes while simultaneously improving the transmission efficiency of massive seismic data across modules at the bottom of the software.
基金supported in part by the Science and Technology Department of Sichuan Province(No.2025ZNSFSC0427,No.2024ZDZX0035)the Open Project Fund of Vehicle Measurement,Control and Safety Key Laboratory of Sichuan Province(No.QCCK2024-004)the Industrial and Educational Integration Project of Yibin(No.YB-XHU-20240001)。
文摘The accurate prediction of battery pack capacity in electric vehicles(EVs)is crucial for ensuring safety and optimizing performance.Despite extensive research on predicting cell capacity using laboratory data,predicting the capacity of onboard battery packs from field data remains challenging due to complex operating conditions and irregular EV usage in real-world settings.Most existing methods rely on extracting health feature parameters from raw data for capacity prediction of onboard battery packs,however,selecting specific parameters often results in a loss of critical information,which reduces prediction accuracy.To this end,this paper introduces a novel framework combining deep learning and data compression techniques to accurately predict battery pack capacity onboard.The proposed data compression method converts monthly EV charging data into feature maps,which preserve essential data characteristics while reducing the volume of raw data.To address missing capacity labels in field data,a capacity labeling method is proposed,which calculates monthly battery capacity by transforming the ampere-hour integration formula and applying linear regression.Subsequently,a deep learning model is proposed to build a capacity prediction model,using feature maps from historical months to predict the battery capacity of future months,thus facilitating accurate forecasts.The proposed framework,evaluated using field data from 20 EVs,achieves a mean absolute error of 0.79 Ah,a mean absolute percentage error of 0.65%,and a root mean square error of 1.02 Ah,highlighting its potential for real-world EV applications.
文摘In the face of data scarcity in the optimization of maintenance strategies for civil aircraft,traditional failure data-driven methods are encountering challenges owing to the increasing reliability of aircraft design.This study addresses this issue by presenting a novel combined data fusion algorithm,which serves to enhance the accuracy and reliability of failure rate analysis for a specific aircraft model by integrating historical failure data from similar models as supplementary information.Through a comprehensive analysis of two different maintenance projects,this study illustrates the application process of the algorithm.Building upon the analysis results,this paper introduces the innovative equal integral value method as a replacement for the conventional equal interval method in the context of maintenance schedule optimization.The Monte Carlo simulation example validates that the equivalent essential value method surpasses the traditional method by over 20%in terms of inspection efficiency ratio.This discovery indicates that the equal critical value method not only upholds maintenance efficiency but also substantially decreases workload and maintenance costs.The findings of this study open up novel perspectives for airlines grappling with data scarcity,offer fresh strategies for the optimization of aviation maintenance practices,and chart a new course toward achieving more efficient and cost-effective maintenance schedule optimization through refined data analysis.
基金supported by the National Natural Science Foundation of China(Grant No.52409151)the Programme of Shenzhen Key Laboratory of Green,Efficient and Intelligent Construction of Underground Metro Station(Programme No.ZDSYS20200923105200001)the Science and Technology Major Project of Xizang Autonomous Region of China(XZ202201ZD0003G).
文摘Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.
基金the Key National Natural Science Foundation of China(No.U1864211)the National Natural Science Foundation of China(No.11772191)the Natural Science Foundation of Shanghai(No.21ZR1431500)。
文摘Industrial data mining usually deals with data from different sources.These heterogeneous datasets describe the same object in different views.However,samples from some of the datasets may be lost.Then the remaining samples do not correspond one-to-one correctly.Mismatched datasets caused by missing samples make the industrial data unavailable for further machine learning.In order to align the mismatched samples,this article presents a cooperative iteration matching method(CIMM)based on the modified dynamic time warping(DTW).The proposed method regards the sequentially accumulated industrial data as the time series.Mismatched samples are aligned by the DTW.In addition,dynamic constraints are applied to the warping distance of the DTW process to make the alignment more efficient.Then a series of models are trained with the cumulated samples iteratively.Several groups of numerical experiments on different missing patterns and missing locations are designed and analyzed to prove the effectiveness and the applicability of the proposed method.
基金supported in part by the National Natural Science Foundation of China(NSFC)under Grants(52275471 and 52120105008)the Beijing Outstanding Young Scientist Program,and the New Cornerstone Science Foundation through the XPLORER PRIZE.
文摘As pivotal supporting technologies for smart manufacturing and digital engineering,model-based and data-driven methods have been widely applied in many industrial fields,such as product design,process monitoring,and smart maintenance.While promising,both methods have issues that need to be addressed.For example,model-based methods are limited by low computational accuracy and a high computational burden,and data-driven methods always suffer from poor interpretability and redundant features.To address these issues,the concept of data-model fusion(DMF)emerges as a promising solution.DMF involves integrating model-based methods with data-driven methods by incorporating big data into model-based methods or embedding relevant domain knowledge into data-driven methods.Despite growing efforts in the field of DMF,a unanimous definition of DMF remains elusive,and a general framework of DMF has been rarely discussed.This paper aims to address this gap by providing a thorough overview and categorization of both data-driven methods and model-based methods.Subsequently,this paper also presents the definition and categorization of DMF and discusses the general framework of DMF.Moreover,the primary seven applications of DMF are reviewed within the context of smart manufacturing and digital engineering.Finally,this paper directs the future directions of DMF.
文摘Data collection serves as the cornerstone in the study of clinical research questions.Two types of data are commonly utilized in medicine:(1)Qualitative;and(2)Quantitative.Several methods are commonly employed to gather data,regardless of whether retrospective or prospective studies are used:(1)Interviews;(2)Observational methods;(3)Questionnaires;(4)Investigation parameters;(5)Medical records;and(6)Electronic chart reviews.Each source type has its own advantages and cons in terms of the accuracy and availability of the data to be extracted.We will focus on the important parts of the research methodology:(1)Data collection;and(2)Subgroup analyses.Errors in research can arise from various sources,including investigators,instruments,and subjects,making the validation and reliability of research tools crucial for ensuring the credibility of findings.Subgroup analyses can either be planned before or emerge after(post-hoc)treatment.The interpretation of subgroup effects should consider the interaction between treatment effect and various patient variables with caution.
基金supported by the National Natural Science Foundation of China(62473341)Key Technologies R&D Program of Henan Province(242102211071,252102211086,252102210166).
文摘With the rapid development of the industrial Internet,the network security environment has become increasingly complex and variable.Intrusion detection,a core technology for ensuring the security of industrial control systems,faces the challenge of unbalanced data samples,particularly the low detection rates for minority class attack samples.Therefore,this paper proposes a data enhancement method for intrusion detection in the industrial Internet based on a Self-Attention Wasserstein Generative Adversarial Network(SA-WGAN)to address the low detection rates of minority class attack samples in unbalanced intrusion detection scenarios.The proposed method integrates a selfattention mechanism with a Wasserstein Generative Adversarial Network(WGAN).The self-attention mechanism automatically learns important features from the input data and assigns different weights to emphasize the key features related to intrusion behaviors,providing strong guidance for subsequent data generation.The WGAN generates new data samples through adversarial training to expand the original dataset.In the SA-WGAN framework,the WGAN directs the data generation process based on the key features extracted by the self-attention mechanism,ensuring that the generated samples exhibit both diversity and similarity to real data.Experimental results demonstrate that the SA-WGAN-based data enhancement method significantly improves detection performance for attack samples from minority classes,addresses issues of insufficient data and category imbalance,and enhances the generalization ability and overall performance of the intrusion detection model.
文摘Missing data handling is vital for multi-sensor information fusion fault diagnosis of motors to prevent the accuracy decay or even model failure,and some promising results have been gained in several current studies.These studies,however,have the following limitations:1)effective supervision is neglected for missing data across different fault types and 2)imbalance in missing rates among fault types results in inadequate learning during model training.To overcome the above limitations,this paper proposes a dynamic relative advantagedriven multi-fault synergistic diagnosis method to accomplish accurate fault diagnosis of motors under imbalanced missing data rates.Firstly,a cross-fault-type generalized synergistic diagnostic strategy is established based on variational information bottleneck theory,which is able to ensure sufficient supervision in handling missing data.Then,a dynamic relative advantage assessment technique is designed to reduce diagnostic accuracy decay caused by imbalanced missing data rates.The proposed method is validated using multi-sensor data from motor fault simulation experiments,and experimental results demonstrate its effectiveness and superiority in improving diagnostic accuracy and generalization under imbalanced missing data rates.
基金funded by the National Natural Science Foundation of China(42361058)supported by the Science and Technology Program of Gansu Province(22YF7FA074)。
文摘Snow cover in mountainous areas is characterized by high reflectivity,strong spatial heterogeneity,rapid changes,and susceptibility to cloud interference.However,due to the limitations of a single sensor,it is challenging to obtain high-resolution satellite remote sensing data for monitoring the dynamic changes of snow cover within a day.This study focuses on two typical data fusion methods for polar-orbiting satellites(Sentinel-3 SLSTR)and geostationary satellites(Himawari-9 AHI),and explores the snow cover detection accuracy of a multitemporal cloud-gap snow cover identification model(Loose data fusion)and the ESTARFM(Spatiotemporal data fusion).Taking the Qilian Mountains as the research area,the accuracy of two data fusion results was verified using the snow cover extracted from Landsat-8 SR products.The results showed that both data fusion models could effectively capture the spatiotemporal variations of snow cover,but the ESTARFM demonstrated superior performance.It not only obtained fusion images at any target time,but also extracted snow cover that was closer to the spatial distribution of real satellite images.Therefore,the ESTARFM was utilized to fuse images for hourly reconstruction of the snow cover on February 14–15,2023.It was found that the maximum snow cover area of this snowfall reached 83.84%of the Qilian Mountains area,and the melting rate of the snow was extremely rapid,with a change of up to 4.30%per hour of the study area.This study offers reliable high spatiotemporal resolution satellite remote sensing data for monitoring snow cover changes in mountainous areas,contributing to more accurate and timely assessments.
基金the Researchers Supporting Project,King Saud University,Saudi Arabia,for funding this research work through Project No.RSPD2025R951.
文摘This research introduces a unique approach to segmenting breast cancer images using a U-Net-based architecture.However,the computational demand for image processing is very high.Therefore,we have conducted this research to build a system that enables image segmentation training with low-power machines.To accomplish this,all data are divided into several segments,each being trained separately.In the case of prediction,the initial output is predicted from each trained model for an input,where the ultimate output is selected based on the pixel-wise majority voting of the expected outputs,which also ensures data privacy.In addition,this kind of distributed training system allows different computers to be used simultaneously.That is how the training process takes comparatively less time than typical training approaches.Even after completing the training,the proposed prediction system allows a newly trained model to be included in the system.Thus,the prediction is consistently more accurate.We evaluated the effectiveness of the ultimate output based on four performance matrices:average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy.The experimental results show that the scores of average pixel accuracy,mean absolute error,average specificity,and average balanced accuracy are 0.9216,0.0687,0.9477,and 0.8674,respectively.In addition,the proposed method was compared with four other state-of-the-art models in terms of total training time and usage of computational resources.And it outperformed all of them in these aspects.
基金supported by the Jiangsu Province IUR Cooperation Project (No.BY2021258)the Wuxi Science and Technology Development Fund Project (No.G20212028)。
文摘An improved cycle-consistent generative adversarial network(CycleGAN) method for defect data augmentation based on feature fusion and self attention residual module is proposed to address the insufficiency of defect sample data for light guide plate(LGP) in production,as well as the problem of minor defects.Two optimizations are made to the generator of CycleGAN:fusion of low resolution features obtained from partial up-sampling and down-sampling with high-resolution features,combination of self attention mechanism with residual network structure to replace the original residual module.Qualitative and quantitative experiments were conducted to compare different data augmentation methods,and the results show that the defect images of the LGP generated by the improved network were more realistic,and the accuracy of the you only look once version 5(YOLOv5) detection network for the LGP was improved by 5.6%,proving the effectiveness and accuracy of the proposed method.