The Chang'e-3 (CE-3) mission is China's first exploration mission on the surface of the Moon that uses a lander and a rover. Eight instruments that form the scientific payloads have the following objectives: (1...The Chang'e-3 (CE-3) mission is China's first exploration mission on the surface of the Moon that uses a lander and a rover. Eight instruments that form the scientific payloads have the following objectives: (1) investigate the morphological features and geological structures at the landing site; (2) integrated in-situ analysis of minerals and chemical compositions; (3) integrated exploration of the structure of the lunar interior; (4) exploration of the lunar-terrestrial space environment, lunar sur- face environment and acquire Moon-based ultraviolet astronomical observations. The Ground Research and Application System (GRAS) is in charge of data acquisition and pre-processing, management of the payload in orbit, and managing the data products and their applications. The Data Pre-processing Subsystem (DPS) is a part of GRAS. The task of DPS is the pre-processing of raw data from the eight instruments that are part of CE-3, including channel processing, unpacking, package sorting, calibration and correction, identification of geographical location, calculation of probe azimuth angle, probe zenith angle, solar azimuth angle, and solar zenith angle and so on, and conducting quality checks. These processes produce Level 0, Level 1 and Level 2 data. The computing platform of this subsystem is comprised of a high-performance computing cluster, including a real-time subsystem used for processing Level 0 data and a post-time subsystem for generating Level 1 and Level 2 data. This paper de- scribes the CE-3 data pre-processing method, the data pre-processing subsystem, data classification, data validity and data products that are used for scientific studies.展开更多
There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analys...There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analysis. This paper proposes a data pre-processing model based on intelligent algorithms. Firstly, we introduce the integrated network platform of ocean observation. Next, the preprocessing model of data is presemed, and an imelligent cleaning model of data is proposed. Based on fuzzy clustering, the Kohonen clustering network is improved to fulfill the parallel calculation of fuzzy c-means clustering. The proposed dynamic algorithm can automatically f'md the new clustering center with the updated sample data. The rapid and dynamic performance of the model makes it suitable for real time calculation, and the efficiency and accuracy of the model is proved by test results through observation data analysis.展开更多
POLAR is a compact space-borne detector initially designed to measure the polarization of hard X-rays emitted from Gamma-Ray Bursts in the energy range 50–500 ke V.This instrument was launched successfully onboard th...POLAR is a compact space-borne detector initially designed to measure the polarization of hard X-rays emitted from Gamma-Ray Bursts in the energy range 50–500 ke V.This instrument was launched successfully onboard the Chinese space laboratory Tiangong-2(TG-2) on 2016 September 15.After being switched on a few days later,tens of gigabytes of raw detection data were produced in-orbit by POLAR and transferred to the ground every day.Before the launch date,a full pipeline and related software were designed and developed for the purpose of quickly pre-processing all the raw data from POLAR,which include both science data and engineering data,then to generate the high level scientific data products that are suitable for later science analysis.This pipeline has been successfully applied for use by the POLAR Science Data Center in the Institute of High Energy Physics(IHEP) after POLAR was launched and switched on.A detailed introduction to the pipeline and some of the core relevant algorithms are presented in this paper.展开更多
Distributed/parallel-processing system like sun grid engine(SGE) that utilizes multiple nodes/cores is proposed for the faster processing of large sized satellite image data. After verification, distributed process en...Distributed/parallel-processing system like sun grid engine(SGE) that utilizes multiple nodes/cores is proposed for the faster processing of large sized satellite image data. After verification, distributed process environment for pre-processing performance can be improved by up to 560.65% from single processing system. Through this, analysis performance in various fields can be improved, and moreover, near-real time service can be achieved in near future.展开更多
Microarray data is inherently noisy due to the noise contaminated from various sources during the preparation of microarray slide and thus it greatly affects the accuracy of the gene expression. How to eliminate the e...Microarray data is inherently noisy due to the noise contaminated from various sources during the preparation of microarray slide and thus it greatly affects the accuracy of the gene expression. How to eliminate the effect of the noise constitutes a challenging problem in microarray analysis. Efficient denoising is often a necessary and the first step to be taken before the image data is analyzed to compensate for data corruption and for effective utilization for these data. Hence preprocessing of microarray image is an essential to eliminate the background noise in order to enhance the image quality and effective quantification. Existing denoising techniques based on transformed domain have been utilized for microarray noise reduction with their own limitations. The objective of this paper is to introduce novel preprocessing techniques such as optimized spatial resolution (OSR) and spatial domain filtering (SDF) for reduction of noise from microarray data and reduction of error during quantification process for estimating the microarray spots accurately to determine expression level of genes. Besides combined optimized spatial resolution and spatial filtering is proposed and found improved denoising of microarray data with effective quantification of spots. The proposed method has been validated in microarray images of gene expression profiles of Myeloid Leukemia using Stanford Microarray Database with various quality measures such as signal to noise ratio, peak signal to noise ratio, image fidelity, structural content, absolute average difference and correlation quality. It was observed by quantitative analysis that the proposed technique is more efficient for denoising the microarray image which enables to make it suitable for effective quantification.展开更多
The large-scale deployment of intelligent Internet of things(IoT)devices have brought increasing needs for computation support in wireless access networks.Applying machine learning(ML)algorithms at the network edge,i....The large-scale deployment of intelligent Internet of things(IoT)devices have brought increasing needs for computation support in wireless access networks.Applying machine learning(ML)algorithms at the network edge,i.e.,edge learning,requires efficient training,in order to adapt themselves to the varying environment.However,the transmission of the training data collected by devices requires huge wireless resources.To address this issue,we exploit the fact that data samples have different importance for training,and use an influence function to represent the importance.Based on the importance metric,we propose a data pre-processing scheme combining data filtering that reduces the size of dataset and data compression that removes redundant information.As a result,the number of data samples as well as the size of every data sample to be transmitted can be substantially reduced while keeping the training accuracy.Furthermore,we propose device scheduling policies,including rate-based and Monte-Carlo-based policies,for multi-device multi-channel systems,maximizing the summation of data importance of scheduled devices.Experiments show that the proposed device scheduling policies bring more than 2%improvement in training accuracy.展开更多
Accurately assessing the relationship between tree growth and climatic factors is of great importance in dendrochronology.This study evaluated the consistency between alternative climate datasets(including station and...Accurately assessing the relationship between tree growth and climatic factors is of great importance in dendrochronology.This study evaluated the consistency between alternative climate datasets(including station and gridded data)and actual climate data(fixed-point observations near the sampling sites),in northeastern China’s warm temperate zone and analyzed differences in their correlations with tree-ring width index.The results were:(1)Gridded temperature data,as well as precipitation and relative humidity data from the Huailai meteorological station,was more consistent with the actual climate data;in contrast,gridded soil moisture content data showed significant discrepancies.(2)Horizontal distance had a greater impact on the representativeness of actual climate conditions than vertical elevation differences.(3)Differences in consistency between alternative and actual climate data also affected their correlations with tree-ring width indices.In some growing season months,correlation coefficients,both in magnitude and sign,differed significantly from those based on actual data.The selection of different alternative climate datasets can lead to biased results in assessing forest responses to climate change,which is detrimental to the management of forest ecosystems in harsh environments.Therefore,the scientific and rational selection of alternative climate data is essential for dendroecological and climatological research.展开更多
Photoacoustic-computed tomography is a novel imaging technique that combines high absorption contrast and deep tissue penetration capability,enabling comprehensive three-dimensional imaging of biological targets.Howev...Photoacoustic-computed tomography is a novel imaging technique that combines high absorption contrast and deep tissue penetration capability,enabling comprehensive three-dimensional imaging of biological targets.However,the increasing demand for higher resolution and real-time imaging results in significant data volume,limiting data storage,transmission and processing efficiency of system.Therefore,there is an urgent need for an effective method to compress the raw data without compromising image quality.This paper presents a photoacoustic-computed tomography 3D data compression method and system based on Wavelet-Transformer.This method is based on the cooperative compression framework that integrates wavelet hard coding with deep learning-based soft decoding.It combines the multiscale analysis capability of wavelet transforms with the global feature modeling advantage of Transformers,achieving high-quality data compression and reconstruction.Experimental results using k-wave simulation suggest that the proposed compression system has advantages under extreme compression conditions,achieving a raw data compression ratio of up to 1:40.Furthermore,three-dimensional data compression experiment using in vivo mouse demonstrated that the maximum peak signal-to-noise ratio(PSNR)and structural similarity index(SSIM)values of reconstructed images reached 38.60 and 0.9583,effectively overcoming detail loss and artifacts introduced by raw data compression.All the results suggest that the proposed system can significantly reduce storage requirements and hardware cost,enhancing computational efficiency and image quality.These advantages support the development of photoacoustic-computed tomography toward higher efficiency,real-time performance and intelligent functionality.展开更多
Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel a...Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.展开更多
With the accelerating aging process of China’s population,the demand for community elderly care services has shown diversified and personalized characteristics.However,problems such as insufficient total care service...With the accelerating aging process of China’s population,the demand for community elderly care services has shown diversified and personalized characteristics.However,problems such as insufficient total care service resources,uneven distribution,and prominent supply-demand contradictions have seriously affected service quality.Big data technology,with core advantages including data collection,analysis and mining,and accurate prediction,provides a new solution for the allocation of community elderly care service resources.This paper systematically studies the application value of big data technology in the allocation of community elderly care service resources from three aspects:resource allocation efficiency,service accuracy,and management intelligence.Combined with practical needs,it proposes optimal allocation strategies such as building a big data analysis platform and accurately grasping the elderly’s care needs,striving to provide operable path references for the construction of community elderly care service systems,promoting the early realization of the elderly care service goal of“adequate support and proper care for the elderly”,and boosting the high-quality development of China’s elderly care service industry.展开更多
Multivariate anomaly detection plays a critical role in maintaining the stable operation of information systems.However,in existing research,multivariate data are often influenced by various factors during the data co...Multivariate anomaly detection plays a critical role in maintaining the stable operation of information systems.However,in existing research,multivariate data are often influenced by various factors during the data collection process,resulting in temporal misalignment or displacement.Due to these factors,the node representations carry substantial noise,which reduces the adaptability of the multivariate coupled network structure and subsequently degrades anomaly detection performance.Accordingly,this study proposes a novel multivariate anomaly detection model grounded in graph structure learning.Firstly,a recommendation strategy is employed to identify strongly coupled variable pairs,which are then used to construct a recommendation-driven multivariate coupling network.Secondly,a multi-channel graph encoding layer is used to dynamically optimize the structural properties of the multivariate coupling network,while a multi-head attention mechanism enhances the spatial characteristics of the multivariate data.Finally,unsupervised anomaly detection is conducted using a dynamic threshold selection algorithm.Experimental results demonstrate that effectively integrating the structural and spatial features of multivariate data significantly mitigates anomalies caused by temporal dependency misalignment.展开更多
As an important resource in data link,time slots should be strategically allocated to enhance transmission efficiency and resist eavesdropping,especially considering the tremendous increase in the number of nodes and ...As an important resource in data link,time slots should be strategically allocated to enhance transmission efficiency and resist eavesdropping,especially considering the tremendous increase in the number of nodes and diverse communication needs.It is crucial to design control sequences with robust randomness and conflict-freeness to properly address differentiated access control in data link.In this paper,we propose a hierarchical access control scheme based on control sequences to achieve high utilization of time slots and differentiated access control.A theoretical bound of the hierarchical control sequence set is derived to characterize the constraints on the parameters of the sequence set.Moreover,two classes of optimal hierarchical control sequence sets satisfying the theoretical bound are constructed,both of which enable the scheme to achieve maximum utilization of time slots.Compared with the fixed time slot allocation scheme,our scheme reduces the symbol error rate by up to 9%,which indicates a significant improvement in anti-interference and eavesdropping capabilities.展开更多
Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness a...Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.展开更多
Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi...Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.展开更多
Among the “three data rights,” the data utilization right has been persistently overlooked, and is similar to a neglected “middle child” in the context of the data rights family. However, it is precisely during th...Among the “three data rights,” the data utilization right has been persistently overlooked, and is similar to a neglected “middle child” in the context of the data rights family. However, it is precisely during the stages of processing and utilization that data undergoes its transformations and where its economic value is ultimately created. A series of recent policy documents on treating data as a factor of production have emphasized that the building of a scientific data property rights system requires a fair and efficient mechanism for benefit distribution, which provides reasonable preference for creators of data value and use value in terms of the income generated by data elements. Constrained by the inertial thinking of property right logic, the data utilization right is often regarded as a “transitional fulcrum” wherein the holders of data resources have to authorize the operators of data products to realize data value thereby. In the future structural design and implementation of the coordination mechanism for the property right system against the backdrop of the data factor-oriented reform, the establishment of data processing and utilization as an independent right will require the implementation of two core initiatives: first, attaching importance to the independent protection of the benefit distribution;second, implementing risk regulation for data security through optimization of governance. These two initiatives will serve as the key for optimizing the data factor governance system and accelerating the release of data value.展开更多
The increasing complexity of China’s electricity market creates substantial challenges for settlement automation,data consistency,and operational scalability.Existing provincial settlement systems are fragmented,lack...The increasing complexity of China’s electricity market creates substantial challenges for settlement automation,data consistency,and operational scalability.Existing provincial settlement systems are fragmented,lack a unified data structure,and depend heavily on manual intervention to process high-frequency and retroactive transactions.To address these limitations,a graph-based unified settlement framework is proposed to enhance automation,flexibility,and adaptability in electricity market settlements.A flexible attribute-graph model is employed to represent heterogeneousmulti-market data,enabling standardized integration,rapid querying,and seamless adaptation to evolving business requirements.An extensible operator library is designed to support configurable settlement rules,and a suite of modular tools—including dataset generation,formula configuration,billing templates,and task scheduling—facilitates end-to-end automated settlement processing.A robust refund-clearing mechanism is further incorporated,utilizing sandbox execution,data-version snapshots,dynamic lineage tracing,and real-time changecapture technologies to enable rapid and accurate recalculations under dynamic policy and data revisions.Case studies based on real-world data from regional Chinese markets validate the effectiveness of the proposed approach,demonstrating marked improvements in computational efficiency,system robustness,and automation.Moreover,enhanced settlement accuracy and high temporal granularity improve price-signal fidelity,promote cost-reflective tariffs,and incentivize energy-efficient and demand-responsive behavior among market participants.The method not only supports equitable and transparent market operations but also provides a generalizable,scalable foundation for modern electricity settlement platforms in increasingly complex and dynamic market environments.展开更多
With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comp...With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.展开更多
While the Ordos Basin is recognized for its substantial hydrocarbon exploration prospects,its rugged loess tableland terrain has rendered seismic exploration exceptionally challenging[1-3].Persistent obstacles such as...While the Ordos Basin is recognized for its substantial hydrocarbon exploration prospects,its rugged loess tableland terrain has rendered seismic exploration exceptionally challenging[1-3].Persistent obstacles such as complex 3D survey planning,low signal-tonoise ratio raw data,inadequate near-surface velocity modeling,and imaging inaccuracy have long hindered the advancement of seismic exploration across this region.Through a problem-solving approach rooted in geological target analysis,this research systematically investigates the behavioral patterns of nodal seismometer-based high-density seismic acquisition in loess plateau.Tailored advancements in waveform enhancement and depth velocity modelling methodologies have been engineered.Field validations confirm that the optimized workflow demonstrates marked improvements in amplitude preservation and imaging resolution,offering novel insights for future reservoir characterization endeavors.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods...Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.展开更多
文摘The Chang'e-3 (CE-3) mission is China's first exploration mission on the surface of the Moon that uses a lander and a rover. Eight instruments that form the scientific payloads have the following objectives: (1) investigate the morphological features and geological structures at the landing site; (2) integrated in-situ analysis of minerals and chemical compositions; (3) integrated exploration of the structure of the lunar interior; (4) exploration of the lunar-terrestrial space environment, lunar sur- face environment and acquire Moon-based ultraviolet astronomical observations. The Ground Research and Application System (GRAS) is in charge of data acquisition and pre-processing, management of the payload in orbit, and managing the data products and their applications. The Data Pre-processing Subsystem (DPS) is a part of GRAS. The task of DPS is the pre-processing of raw data from the eight instruments that are part of CE-3, including channel processing, unpacking, package sorting, calibration and correction, identification of geographical location, calculation of probe azimuth angle, probe zenith angle, solar azimuth angle, and solar zenith angle and so on, and conducting quality checks. These processes produce Level 0, Level 1 and Level 2 data. The computing platform of this subsystem is comprised of a high-performance computing cluster, including a real-time subsystem used for processing Level 0 data and a post-time subsystem for generating Level 1 and Level 2 data. This paper de- scribes the CE-3 data pre-processing method, the data pre-processing subsystem, data classification, data validity and data products that are used for scientific studies.
基金Key Science and Technology Project of the Shanghai Committee of Science and Technology, China (No.06dz1200921)Major Basic Research Project of the Shanghai Committee of Science and Technology(No.08JC1400100)+1 种基金Shanghai Talent Developing Foundation, China(No.001)Specialized Foundation for Excellent Talent of Shanghai,China
文摘There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analysis. This paper proposes a data pre-processing model based on intelligent algorithms. Firstly, we introduce the integrated network platform of ocean observation. Next, the preprocessing model of data is presemed, and an imelligent cleaning model of data is proposed. Based on fuzzy clustering, the Kohonen clustering network is improved to fulfill the parallel calculation of fuzzy c-means clustering. The proposed dynamic algorithm can automatically f'md the new clustering center with the updated sample data. The rapid and dynamic performance of the model makes it suitable for real time calculation, and the efficiency and accuracy of the model is proved by test results through observation data analysis.
基金financial support from the Joint Research Fund in Astronomy under a cooperative agreement between the National Natural Science Foundation of China and the Chinese Academy of Sciences (Grant No. U1631242)the National Natural Science Foundation of China (Grant Nos. 11503028 and 11403028)+1 种基金the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB23040400)the National Basic Research Program (973 Program) of China (2014CB845800)
文摘POLAR is a compact space-borne detector initially designed to measure the polarization of hard X-rays emitted from Gamma-Ray Bursts in the energy range 50–500 ke V.This instrument was launched successfully onboard the Chinese space laboratory Tiangong-2(TG-2) on 2016 September 15.After being switched on a few days later,tens of gigabytes of raw detection data were produced in-orbit by POLAR and transferred to the ground every day.Before the launch date,a full pipeline and related software were designed and developed for the purpose of quickly pre-processing all the raw data from POLAR,which include both science data and engineering data,then to generate the high level scientific data products that are suitable for later science analysis.This pipeline has been successfully applied for use by the POLAR Science Data Center in the Institute of High Energy Physics(IHEP) after POLAR was launched and switched on.A detailed introduction to the pipeline and some of the core relevant algorithms are presented in this paper.
基金supported by the Sharing and Diffusion of National R&D Outcome funded by the Korea Institute of Science and Technology Information
文摘Distributed/parallel-processing system like sun grid engine(SGE) that utilizes multiple nodes/cores is proposed for the faster processing of large sized satellite image data. After verification, distributed process environment for pre-processing performance can be improved by up to 560.65% from single processing system. Through this, analysis performance in various fields can be improved, and moreover, near-real time service can be achieved in near future.
文摘Microarray data is inherently noisy due to the noise contaminated from various sources during the preparation of microarray slide and thus it greatly affects the accuracy of the gene expression. How to eliminate the effect of the noise constitutes a challenging problem in microarray analysis. Efficient denoising is often a necessary and the first step to be taken before the image data is analyzed to compensate for data corruption and for effective utilization for these data. Hence preprocessing of microarray image is an essential to eliminate the background noise in order to enhance the image quality and effective quantification. Existing denoising techniques based on transformed domain have been utilized for microarray noise reduction with their own limitations. The objective of this paper is to introduce novel preprocessing techniques such as optimized spatial resolution (OSR) and spatial domain filtering (SDF) for reduction of noise from microarray data and reduction of error during quantification process for estimating the microarray spots accurately to determine expression level of genes. Besides combined optimized spatial resolution and spatial filtering is proposed and found improved denoising of microarray data with effective quantification of spots. The proposed method has been validated in microarray images of gene expression profiles of Myeloid Leukemia using Stanford Microarray Database with various quality measures such as signal to noise ratio, peak signal to noise ratio, image fidelity, structural content, absolute average difference and correlation quality. It was observed by quantitative analysis that the proposed technique is more efficient for denoising the microarray image which enables to make it suitable for effective quantification.
基金This work is sponsored in part by the National Natural Science Foundation of China under grants of 62022049,62111530197,and 61871254Hitachi Ltd.Part of this work has been presented in IEEE ICC 2020[1].
文摘The large-scale deployment of intelligent Internet of things(IoT)devices have brought increasing needs for computation support in wireless access networks.Applying machine learning(ML)algorithms at the network edge,i.e.,edge learning,requires efficient training,in order to adapt themselves to the varying environment.However,the transmission of the training data collected by devices requires huge wireless resources.To address this issue,we exploit the fact that data samples have different importance for training,and use an influence function to represent the importance.Based on the importance metric,we propose a data pre-processing scheme combining data filtering that reduces the size of dataset and data compression that removes redundant information.As a result,the number of data samples as well as the size of every data sample to be transmitted can be substantially reduced while keeping the training accuracy.Furthermore,we propose device scheduling policies,including rate-based and Monte-Carlo-based policies,for multi-device multi-channel systems,maximizing the summation of data importance of scheduled devices.Experiments show that the proposed device scheduling policies bring more than 2%improvement in training accuracy.
基金supported by the International Partnership program of the Chinese Academy of Sciences(170GJHZ2023074GC)National Natural Science Foundation of China(42425706 and 42488201)+1 种基金National Key Research and Development Program of China(2024YFF0807902)Beijing Natural Science Foundation(8242041),and China Postdoctoral Science Foundation(2025M770353).
文摘Accurately assessing the relationship between tree growth and climatic factors is of great importance in dendrochronology.This study evaluated the consistency between alternative climate datasets(including station and gridded data)and actual climate data(fixed-point observations near the sampling sites),in northeastern China’s warm temperate zone and analyzed differences in their correlations with tree-ring width index.The results were:(1)Gridded temperature data,as well as precipitation and relative humidity data from the Huailai meteorological station,was more consistent with the actual climate data;in contrast,gridded soil moisture content data showed significant discrepancies.(2)Horizontal distance had a greater impact on the representativeness of actual climate conditions than vertical elevation differences.(3)Differences in consistency between alternative and actual climate data also affected their correlations with tree-ring width indices.In some growing season months,correlation coefficients,both in magnitude and sign,differed significantly from those based on actual data.The selection of different alternative climate datasets can lead to biased results in assessing forest responses to climate change,which is detrimental to the management of forest ecosystems in harsh environments.Therefore,the scientific and rational selection of alternative climate data is essential for dendroecological and climatological research.
基金supported by the National Key R&D Program of China[Grant No.2023YFF0713600]the National Natural Science Foundation of China[Grant No.62275062]+3 种基金Project of Shandong Innovation and Startup Community of High-end Medical Apparatus and Instruments[Grant No.2023-SGTTXM-002 and 2024-SGTTXM-005]the Shandong Province Technology Innovation Guidance Plan(Central Leading Local Science and Technology Development Fund)[Grant No.YDZX2023115]the Taishan Scholar Special Funding Project of Shandong Provincethe Shandong Laboratory of Advanced Biomaterials and Medical Devices in Weihai[Grant No.ZL202402].
文摘Photoacoustic-computed tomography is a novel imaging technique that combines high absorption contrast and deep tissue penetration capability,enabling comprehensive three-dimensional imaging of biological targets.However,the increasing demand for higher resolution and real-time imaging results in significant data volume,limiting data storage,transmission and processing efficiency of system.Therefore,there is an urgent need for an effective method to compress the raw data without compromising image quality.This paper presents a photoacoustic-computed tomography 3D data compression method and system based on Wavelet-Transformer.This method is based on the cooperative compression framework that integrates wavelet hard coding with deep learning-based soft decoding.It combines the multiscale analysis capability of wavelet transforms with the global feature modeling advantage of Transformers,achieving high-quality data compression and reconstruction.Experimental results using k-wave simulation suggest that the proposed compression system has advantages under extreme compression conditions,achieving a raw data compression ratio of up to 1:40.Furthermore,three-dimensional data compression experiment using in vivo mouse demonstrated that the maximum peak signal-to-noise ratio(PSNR)and structural similarity index(SSIM)values of reconstructed images reached 38.60 and 0.9583,effectively overcoming detail loss and artifacts introduced by raw data compression.All the results suggest that the proposed system can significantly reduce storage requirements and hardware cost,enhancing computational efficiency and image quality.These advantages support the development of photoacoustic-computed tomography toward higher efficiency,real-time performance and intelligent functionality.
文摘Missing data presents a crucial challenge in data analysis,especially in high-dimensional datasets,where missing data often leads to biased conclusions and degraded model performance.In this study,we present a novel autoencoder-based imputation framework that integrates a composite loss function to enhance robustness and precision.The proposed loss combines(i)a guided,masked mean squared error focusing on missing entries;(ii)a noise-aware regularization term to improve resilience against data corruption;and(iii)a variance penalty to encourage expressive yet stable reconstructions.We evaluate the proposed model across four missingness mechanisms,such as Missing Completely at Random,Missing at Random,Missing Not at Random,and Missing Not at Random with quantile censorship,under systematically varied feature counts,sample sizes,and missingness ratios ranging from 5%to 60%.Four publicly available real-world datasets(Stroke Prediction,Pima Indians Diabetes,Cardiovascular Disease,and Framingham Heart Study)were used,and the obtained results show that our proposed model consistently outperforms baseline methods,including traditional and deep learning-based techniques.An ablation study reveals the additive value of each component in the loss function.Additionally,we assessed the downstream utility of imputed data through classification tasks,where datasets imputed by the proposed method yielded the highest receiver operating characteristic area under the curve scores across all scenarios.The model demonstrates strong scalability and robustness,improving performance with larger datasets and higher feature counts.These results underscore the capacity of the proposed method to produce not only numerically accurate but also semantically useful imputations,making it a promising solution for robust data recovery in clinical applications.
文摘With the accelerating aging process of China’s population,the demand for community elderly care services has shown diversified and personalized characteristics.However,problems such as insufficient total care service resources,uneven distribution,and prominent supply-demand contradictions have seriously affected service quality.Big data technology,with core advantages including data collection,analysis and mining,and accurate prediction,provides a new solution for the allocation of community elderly care service resources.This paper systematically studies the application value of big data technology in the allocation of community elderly care service resources from three aspects:resource allocation efficiency,service accuracy,and management intelligence.Combined with practical needs,it proposes optimal allocation strategies such as building a big data analysis platform and accurately grasping the elderly’s care needs,striving to provide operable path references for the construction of community elderly care service systems,promoting the early realization of the elderly care service goal of“adequate support and proper care for the elderly”,and boosting the high-quality development of China’s elderly care service industry.
基金supported by Natural Science Foundation of Qinghai Province(2025-ZJ-994M)Scientific Research Innovation Capability Support Project for Young Faculty(SRICSPYF-BS2025007)National Natural Science Foundation of China(62566050).
文摘Multivariate anomaly detection plays a critical role in maintaining the stable operation of information systems.However,in existing research,multivariate data are often influenced by various factors during the data collection process,resulting in temporal misalignment or displacement.Due to these factors,the node representations carry substantial noise,which reduces the adaptability of the multivariate coupled network structure and subsequently degrades anomaly detection performance.Accordingly,this study proposes a novel multivariate anomaly detection model grounded in graph structure learning.Firstly,a recommendation strategy is employed to identify strongly coupled variable pairs,which are then used to construct a recommendation-driven multivariate coupling network.Secondly,a multi-channel graph encoding layer is used to dynamically optimize the structural properties of the multivariate coupling network,while a multi-head attention mechanism enhances the spatial characteristics of the multivariate data.Finally,unsupervised anomaly detection is conducted using a dynamic threshold selection algorithm.Experimental results demonstrate that effectively integrating the structural and spatial features of multivariate data significantly mitigates anomalies caused by temporal dependency misalignment.
基金supported by the National Science Foundation of China(No.62171387)the Science and Technology Program of Sichuan Province(No.2024NSFSC0468)the China Postdoctoral Science Foundation(No.2019M663475).
文摘As an important resource in data link,time slots should be strategically allocated to enhance transmission efficiency and resist eavesdropping,especially considering the tremendous increase in the number of nodes and diverse communication needs.It is crucial to design control sequences with robust randomness and conflict-freeness to properly address differentiated access control in data link.In this paper,we propose a hierarchical access control scheme based on control sequences to achieve high utilization of time slots and differentiated access control.A theoretical bound of the hierarchical control sequence set is derived to characterize the constraints on the parameters of the sequence set.Moreover,two classes of optimal hierarchical control sequence sets satisfying the theoretical bound are constructed,both of which enable the scheme to achieve maximum utilization of time slots.Compared with the fixed time slot allocation scheme,our scheme reduces the symbol error rate by up to 9%,which indicates a significant improvement in anti-interference and eavesdropping capabilities.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R104)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Modern intrusion detection systems(MIDS)face persistent challenges in coping with the rapid evolution of cyber threats,high-volume network traffic,and imbalanced datasets.Traditional models often lack the robustness and explainability required to detect novel and sophisticated attacks effectively.This study introduces an advanced,explainable machine learning framework for multi-class IDS using the KDD99 and IDS datasets,which reflects real-world network behavior through a blend of normal and diverse attack classes.The methodology begins with sophisticated data preprocessing,incorporating both RobustScaler and QuantileTransformer to address outliers and skewed feature distributions,ensuring standardized and model-ready inputs.Critical dimensionality reduction is achieved via the Harris Hawks Optimization(HHO)algorithm—a nature-inspired metaheuristic modeled on hawks’hunting strategies.HHO efficiently identifies the most informative features by optimizing a fitness function based on classification performance.Following feature selection,the SMOTE is applied to the training data to resolve class imbalance by synthetically augmenting underrepresented attack types.The stacked architecture is then employed,combining the strengths of XGBoost,SVM,and RF as base learners.This layered approach improves prediction robustness and generalization by balancing bias and variance across diverse classifiers.The model was evaluated using standard classification metrics:precision,recall,F1-score,and overall accuracy.The best overall performance was recorded with an accuracy of 99.44%for UNSW-NB15,demonstrating the model’s effectiveness.After balancing,the model demonstrated a clear improvement in detecting the attacks.We tested the model on four datasets to show the effectiveness of the proposed approach and performed the ablation study to check the effect of each parameter.Also,the proposed model is computationaly efficient.To support transparency and trust in decision-making,explainable AI(XAI)techniques are incorporated that provides both global and local insight into feature contributions,and offers intuitive visualizations for individual predictions.This makes it suitable for practical deployment in cybersecurity environments that demand both precision and accountability.
基金funded by University of Transport and Communications(UTC)under grant number T2025-CN-004.
文摘Reversible data hiding(RDH)enables secret data embedding while preserving complete cover image recovery,making it crucial for applications requiring image integrity.The pixel value ordering(PVO)technique used in multi-stego images provides good image quality but often results in low embedding capability.To address these challenges,this paper proposes a high-capacity RDH scheme based on PVO that generates three stego images from a single cover image.The cover image is partitioned into non-overlapping blocks with pixels sorted in ascending order.Four secret bits are embedded into each block’s maximum pixel value,while three additional bits are embedded into the second-largest value when the pixel difference exceeds a predefined threshold.A similar embedding strategy is also applied to the minimum side of the block,including the second-smallest pixel value.This design enables each block to embed up to 14 bits of secret data.Experimental results demonstrate that the proposed method achieves significantly higher embedding capacity and improved visual quality compared to existing triple-stego RDH approaches,advancing the field of reversible steganography.
文摘Among the “three data rights,” the data utilization right has been persistently overlooked, and is similar to a neglected “middle child” in the context of the data rights family. However, it is precisely during the stages of processing and utilization that data undergoes its transformations and where its economic value is ultimately created. A series of recent policy documents on treating data as a factor of production have emphasized that the building of a scientific data property rights system requires a fair and efficient mechanism for benefit distribution, which provides reasonable preference for creators of data value and use value in terms of the income generated by data elements. Constrained by the inertial thinking of property right logic, the data utilization right is often regarded as a “transitional fulcrum” wherein the holders of data resources have to authorize the operators of data products to realize data value thereby. In the future structural design and implementation of the coordination mechanism for the property right system against the backdrop of the data factor-oriented reform, the establishment of data processing and utilization as an independent right will require the implementation of two core initiatives: first, attaching importance to the independent protection of the benefit distribution;second, implementing risk regulation for data security through optimization of governance. These two initiatives will serve as the key for optimizing the data factor governance system and accelerating the release of data value.
基金funded by the Science and Technology Project of State Grid Corporation of China(5108-202355437A-3-2-ZN).
文摘The increasing complexity of China’s electricity market creates substantial challenges for settlement automation,data consistency,and operational scalability.Existing provincial settlement systems are fragmented,lack a unified data structure,and depend heavily on manual intervention to process high-frequency and retroactive transactions.To address these limitations,a graph-based unified settlement framework is proposed to enhance automation,flexibility,and adaptability in electricity market settlements.A flexible attribute-graph model is employed to represent heterogeneousmulti-market data,enabling standardized integration,rapid querying,and seamless adaptation to evolving business requirements.An extensible operator library is designed to support configurable settlement rules,and a suite of modular tools—including dataset generation,formula configuration,billing templates,and task scheduling—facilitates end-to-end automated settlement processing.A robust refund-clearing mechanism is further incorporated,utilizing sandbox execution,data-version snapshots,dynamic lineage tracing,and real-time changecapture technologies to enable rapid and accurate recalculations under dynamic policy and data revisions.Case studies based on real-world data from regional Chinese markets validate the effectiveness of the proposed approach,demonstrating marked improvements in computational efficiency,system robustness,and automation.Moreover,enhanced settlement accuracy and high temporal granularity improve price-signal fidelity,promote cost-reflective tariffs,and incentivize energy-efficient and demand-responsive behavior among market participants.The method not only supports equitable and transparent market operations but also provides a generalizable,scalable foundation for modern electricity settlement platforms in increasingly complex and dynamic market environments.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.RS-2023-00235509Development of security monitoring technology based network behavior against encrypted cyber threats in ICT convergence environment).
文摘With the increasing emphasis on personal information protection,encryption through security protocols has emerged as a critical requirement in data transmission and reception processes.Nevertheless,IoT ecosystems comprise heterogeneous networks where outdated systems coexist with the latest devices,spanning a range of devices from non-encrypted ones to fully encrypted ones.Given the limited visibility into payloads in this context,this study investigates AI-based attack detection methods that leverage encrypted traffic metadata,eliminating the need for decryption and minimizing system performance degradation—especially in light of these heterogeneous devices.Using the UNSW-NB15 and CICIoT-2023 dataset,encrypted and unencrypted traffic were categorized according to security protocol,and AI-based intrusion detection experiments were conducted for each traffic type based on metadata.To mitigate the problem of class imbalance,eight different data sampling techniques were applied.The effectiveness of these sampling techniques was then comparatively analyzed using two ensemble models and three Deep Learning(DL)models from various perspectives.The experimental results confirmed that metadata-based attack detection is feasible using only encrypted traffic.In the UNSW-NB15 dataset,the f1-score of encrypted traffic was approximately 0.98,which is 4.3%higher than that of unencrypted traffic(approximately 0.94).In addition,analysis of the encrypted traffic in the CICIoT-2023 dataset using the same method showed a significantly lower f1-score of roughly 0.43,indicating that the quality of the dataset and the preprocessing approach have a substantial impact on detection performance.Furthermore,when data sampling techniques were applied to encrypted traffic,the recall in the UNSWNB15(Encrypted)dataset improved by up to 23.0%,and in the CICIoT-2023(Encrypted)dataset by 20.26%,showing a similar level of improvement.Notably,in CICIoT-2023,f1-score and Receiver Operation Characteristic-Area Under the Curve(ROC-AUC)increased by 59.0%and 55.94%,respectively.These results suggest that data sampling can have a positive effect even in encrypted environments.However,the extent of the improvement may vary depending on data quality,model architecture,and sampling strategy.
文摘While the Ordos Basin is recognized for its substantial hydrocarbon exploration prospects,its rugged loess tableland terrain has rendered seismic exploration exceptionally challenging[1-3].Persistent obstacles such as complex 3D survey planning,low signal-tonoise ratio raw data,inadequate near-surface velocity modeling,and imaging inaccuracy have long hindered the advancement of seismic exploration across this region.Through a problem-solving approach rooted in geological target analysis,this research systematically investigates the behavioral patterns of nodal seismometer-based high-density seismic acquisition in loess plateau.Tailored advancements in waveform enhancement and depth velocity modelling methodologies have been engineered.Field validations confirm that the optimized workflow demonstrates marked improvements in amplitude preservation and imaging resolution,offering novel insights for future reservoir characterization endeavors.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.
基金supported by the project“Romanian Hub for Artificial Intelligence-HRIA”,Smart Growth,Digitization and Financial Instruments Program,2021–2027,MySMIS No.334906.
文摘Objective expertise evaluation of individuals,as a prerequisite stage for team formation,has been a long-term desideratum in large software development companies.With the rapid advancements in machine learning methods,based on reliable existing data stored in project management tools’datasets,automating this evaluation process becomes a natural step forward.In this context,our approach focuses on quantifying software developer expertise by using metadata from the task-tracking systems.For this,we mathematically formalize two categories of expertise:technology-specific expertise,which denotes the skills required for a particular technology,and general expertise,which encapsulates overall knowledge in the software industry.Afterward,we automatically classify the zones of expertise associated with each task a developer has worked on using Bidirectional Encoder Representations from Transformers(BERT)-like transformers to handle the unique characteristics of project tool datasets effectively.Finally,our method evaluates the proficiency of each software specialist across already completed projects from both technology-specific and general perspectives.The method was experimentally validated,yielding promising results.