Cross-matching is a key technique to achieve fusion of multi-band astronomical catalogs. Due to different equipment such as various astronomical telescopes, the existence of measurement errors, and proper motions of t...Cross-matching is a key technique to achieve fusion of multi-band astronomical catalogs. Due to different equipment such as various astronomical telescopes, the existence of measurement errors, and proper motions of the celestial bodies, the same celestial object will have different positions in different catalogs, making it difficult to integrate multi-band or full-band astronomical data. In this study, we propose an online cross-matching method based on pseudo-spherical indexing techniques and develop a service combining with high performance computing system(Taurus) to improve cross-matching efficiency, which is designed for the Data Center of Xinjiang Astronomical Observatory. Specifically, we use Quad Tree Cube to divide the spherical blocks of the celestial object and map the 2D space composed of R.A. and decl. to 1D space and achieve correspondence between real celestial objects and spherical patches. Finally, we verify the performance of the service using Gaia 3 and PPMXL catalogs. Meanwhile, we send the matching results to VO tools-Topcat and Aladin respectively to get visual results. The experimental results show that the service effectively solves the speed bottleneck problem of crossmatching caused by frequent I/O, and significantly improves the retrieval and matching speed of massive astronomical data.展开更多
DEAR EDITOR,Since the first reported severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)infection in December 2019,coronavirus disease 2019(COVID-19)has become a global pandemic,spreading to more than 200 coun...DEAR EDITOR,Since the first reported severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)infection in December 2019,coronavirus disease 2019(COVID-19)has become a global pandemic,spreading to more than 200 countries and regions worldwide.With continued research progress and virus detection,SARS-CoV-2 genomes and sequencing data have been reported and accumulated at an unprecedented rate.展开更多
The Xinjiang Astronomical Observatory Data Center faces issues related to delay-affected services. As a result, these services cannot be implemented in a timely manner due to the overloading of transmission links. In ...The Xinjiang Astronomical Observatory Data Center faces issues related to delay-affected services. As a result, these services cannot be implemented in a timely manner due to the overloading of transmission links. In this paper, the software-defined network technology is applied to the Xinjiang Astronomical Observatory Data Center Network(XAODCN). Specifically, a novel reconfiguration method is proposed to realise the software-defined Xinjiang Astronomical Observatory Data Center Network(SDXAO-DCN), and a network model is constructed. To overcome the congestion problem, a traffic load-balancing algorithm is designed for fast transmission of the service traffic by combining three factors: network structure, congestion level and transmission service. The proposed algorithm is compared with current commonly load-balancing algorithms which are used in data center to verify its efficiency. Simulation experiments show that the algorithm improved transmission performance and transmission quality for the SDXAO-DCN.展开更多
Genomic data serve as an invaluable resource for unraveling the intricacies of the higher plant systems,including the constituent elements within and among species.Through various efforts in genomic data archiving,int...Genomic data serve as an invaluable resource for unraveling the intricacies of the higher plant systems,including the constituent elements within and among species.Through various efforts in genomic data archiving,integrative analysis and value-added curation,the National Genomics Data Center(NGDC),which is a part of the China National Center for Bioinformation(CNCB),has successfully established and currently maintains a vast amount of database resources.This dedicated initiative of the NGDC facilitates a data-rich ecosystem that greatly strengthens and supports genomic research efforts.Here,we present a comprehensive overview of central repositories dedicated to archiving,presenting,and sharing plant omics data,introduce knowledgebases focused on variants or gene-based functional insights,highlight species-specific multiple omics database resources,and briefly review the online application tools.We intend that this review can be used as a guide map for plant researchers wishing to select effective data resources from the NGDC for their specific areas of study.展开更多
On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th Nation...On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th National Congress of the Communist Party of China,China has vigorously promoted the integration and implementation of the Healthy China and Digital China strategies.The National Health Commission has prioritized the development of health and medical big data,issuing policies to promote standardized applica-tions and foster innovation in"Internet+Healthcare."Biomedical data has significantly contributed to preci-sion medicine,personalized health management,drug development,disease diagnosis,public health monitor-ing,and epidemic prediction capabilities.展开更多
【Objective】Medical imaging data has great value,but it contains a significant amount of sensitive information about patients.At present,laws and regulations regarding to the de-identification of medical imaging data...【Objective】Medical imaging data has great value,but it contains a significant amount of sensitive information about patients.At present,laws and regulations regarding to the de-identification of medical imaging data are not clearly defined around the world.This study aims to develop a tool that meets compliance-driven desensitization requirements tailored to diverse research needs.【Methods】To enhance the security of medical image data,we designed and implemented a DICOM format medical image de-identification system on the Windows operating system.【Results】Our custom de-identification system is adaptable to the legal standards of different countries and can accommodate specific research demands.The system offers both web-based online and desktop offline de-identification capabilities,enabling customization of de-identification rules and facilitating batch processing to improve efficiency.【Conclusions】This medical image de-identification system robustly strengthens the stewardship of sensitive medical data,aligning with data security protection requirements while facilitating the sharing and utilization of medical image data.This approach unlocks the intrinsic value inherent in such datasets.展开更多
In this study,we developed a high-resolution(3 arcsec,approximately 90 m)V_(S30) map and associated open-access dataset for the 140 km×200 km region affected by the January 2025 M6.8 Dingri Xizang,China earthquak...In this study,we developed a high-resolution(3 arcsec,approximately 90 m)V_(S30) map and associated open-access dataset for the 140 km×200 km region affected by the January 2025 M6.8 Dingri Xizang,China earthquake.This map provides a significantly finer resolution compared to existing V_(S30) maps,which typically use a 30 arcsec grid.The V_(S30) values were estimated using the Cokriging-based V_(S30) proxy model(SCK model),which integrates V_(S30) measurements as primary constraints and utilizes topographic slope as a secondary parameter.The findings indicate that the V_(S30) values range from 200 to 250 m/s in the sedimentary deposit areas near the earthquake’s epicenter and from 400 to 600 m/s in the surrounding mountainous regions.This study showcases the capability of the SCK model to efficiently generate V_(S30) estimations across various spatial resolutions and demonstrates its effectiveness in producing reliable estimations in data-sparse regions.展开更多
With the rise of data-intensive research,data literacy has become a critical capability for improving scientific data quality and achieving artificial intelligence(AI)readiness.In the biomedical domain,data are charac...With the rise of data-intensive research,data literacy has become a critical capability for improving scientific data quality and achieving artificial intelligence(AI)readiness.In the biomedical domain,data are characterized by high complexity and privacy sensitivity,calling for robust and systematic data management skills.This paper reviews current trends in scientific data governance and the evolving policy landscape,highlighting persistent challenges such as inconsistent standards,semantic misalignment,and limited awareness of compliance.These issues are largely rooted in the lack of structured training and practical support for researchers.In response,this study builds on existing data literacy frameworks and integrates the specific demands of biomedical research to propose a comprehensive,lifecycle-oriented data literacy competency model with an emphasis on ethics and regulatory awareness.Furthermore,it outlines a tiered training strategy tailored to different research stages—undergraduate,graduate,and professional,offering theoretical foundations and practical pathways for universities and research institutions to advance data literacy education.展开更多
Astronomical spectroscopy is crucial for exploring the physical properties,chemical composition,and kinematic behavior of celestial objects.With continuous advancements in observational technology,astronomical spectro...Astronomical spectroscopy is crucial for exploring the physical properties,chemical composition,and kinematic behavior of celestial objects.With continuous advancements in observational technology,astronomical spectroscopy faces the dual challenges of rapidly expanding data volumes and relatively lagging data processing capabilities.In this context,the rise of artificial intelligence technologies offers an innovative solution to address these challenges.This paper analyzes the latest developments in the application of machine learning for astronomical spectral data mining and discusses future research directions in AI-based spectral studies.However,the application of machine learning technologies presents several challenges.The high complexity of models often comes with insufficient interpretability,complicating scientific understanding.Moreover,the large-scale computational demands place higher requirements on hardware resources,leading to a significant increase in computational costs.AI-based astronomical spectroscopy research should advance in the following key directions.First,develop efficient data augmentation techniques to enhance model generalization capabilities.Second,explore more interpretable model designs to ensure the reliability and transparency of scientific conclusions.Third,optimize computational efficiency and reduce the threshold for deep-learning applications through collaborative innovations in algorithms and hardware.Furthermore,promoting the integration of cross-band data processing is essential to achieve seamless integration and comprehensive analysis of multi-source data,providing richer,multidimensional information to uncover the mysteries of the universe.展开更多
Since meteorological conditions are the main factor driving the transport and dispersion of air pollutants,an accurate simulation of the meteorological field will directly affect the accuracy of the atmospheric chemic...Since meteorological conditions are the main factor driving the transport and dispersion of air pollutants,an accurate simulation of the meteorological field will directly affect the accuracy of the atmospheric chemical transport model in simulating PM_(2.5).Based on the NASM joint chemical data assimilation system,the authors quantified the impacts of different meteorological fields on the pollutant simulations as well as revealed the role of meteorological conditions in the accumulation,maintenance,and dissipation of heavy haze pollution.During the two heavy pollution processes from 10 to 24 November 2018,the meteorological fields were obtained using NCEP FNL and ERA5 reanalysis data,each used to drive the WRF model,to analyze the differences in the simulated PM_(2.5) concentration.The results show that the meteorological field has a strong influence on the concentration levels and spatial distribution of the pollution simulations.The ERA5 group had relatively small simulation errors,and more accurate PM_(2.5) simulation results could be obtained.The RMSE was 11.86𝜇g m^(-3)lower than that of the FNL group before assimilation,and 5.77𝜇g m^(-3)lower after joint assimilation.The authors used the PM_(2.5) simulation results obtained by ERA5 data to discuss the role of the wind field and circulation situation on the pollution process,to analyze the correlation between wind speed,temperature,relative humidity,and boundary layer height and pollutant concentrations,and to further clarify the key formation mechanism of this pollution process.展开更多
The corrosion degradation of organic coatings in tropical marine atmospheric environments results in substantial economic losses across various industries.The complexity of a dynamic environment,combined with high cos...The corrosion degradation of organic coatings in tropical marine atmospheric environments results in substantial economic losses across various industries.The complexity of a dynamic environment,combined with high costs,extended experimental periods,and limited data,places a limit on the comprehension of this process.This study addresses this challenge by investigating the corrosion de-gradation of damaged organic coatings in a tropical marine environment using an atmospheric corrosion monitoring sensor and a random forest(RF)model.For damage simulation,a polyurethane coating applied to a Fe/graphite corrosion sensor was intentionally scratched and exposed to the marine atmosphere for over one year.Pearson correlation analysis was performed for the collection and filtering of en-vironmental and corrosion current data.According to the RF model,the following specific conditions contributed to accelerated degrada-tion:relative humidity(RH)above 80%and temperatures below 22.5℃,with the risk increasing significantly when RH exceeded 90%.High RH and temperature exhibited a cumulative effect on coating degradation.A high risk of corrosion occurred in the nighttime.The RF model was also used to predict the coating degradation process using environmental data as input parameters,with the accuracy show-ing improvement when the duration of influential environmental ranges was considered.展开更多
Despite the widespread use of Decision trees (DT) across various applications, their performance tends to suffer when dealing with imbalanced datasets, where the distribution of certain classes significantly outweighs...Despite the widespread use of Decision trees (DT) across various applications, their performance tends to suffer when dealing with imbalanced datasets, where the distribution of certain classes significantly outweighs others. Cost-sensitive learning is a strategy to solve this problem, and several cost-sensitive DT algorithms have been proposed to date. However, existing algorithms, which are heuristic, tried to greedily select either a better splitting point or feature node, leading to local optima for tree nodes and ignoring the cost of the whole tree. In addition, determination of the costs is difficult and often requires domain expertise. This study proposes a DT for imbalanced data, called Swarm-based Cost-sensitive DT (SCDT), using the cost-sensitive learning strategy and an enhanced swarm-based algorithm. The DT is encoded using a hybrid individual representation. A hybrid artificial bee colony approach is designed to optimize rules, considering specified costs in an F-Measure-based fitness function. Experimental results using datasets compared with state-of-the-art DT algorithms show that the SCDT method achieved the highest performance on most datasets. Moreover, SCDT also excels in other critical performance metrics, such as recall, precision, F1-score, and AUC, with notable results with average values of 83%, 87.3%, 85%, and 80.7%, respectively.展开更多
Thermal analysis of data centers is in urgent need to ensure that computer chips remain below the critical temperature while the energy consumption for cooling can be reduced.It is difficult to obtain detailed hotspot...Thermal analysis of data centers is in urgent need to ensure that computer chips remain below the critical temperature while the energy consumption for cooling can be reduced.It is difficult to obtain detailed hotspot locations and temperatures of chips in large data centers containing hundreds of racks or more by direct measurement.In this paper,a multi-scale thermal analysis method is proposed that can predict the temperature distribution of chips and solder balls in data centers.The multi-scale model is divided into six scales:room,rack,server,Insulated-Gate Bipolar Transistor(IGBT),chip and solder ball.A concept of sub-model is proposed and the six levels are organized into four simulation sub-models.Sub-model 1 contains Room,Rack and Server(RRS);Sub-model 2 contains Server and IGBT(SI);Sub-model 3 contains IGBT and Chip(IC),and Sub-model 4 contains Chip and Solder-ball(CS).These four sub-models are one-way coupled by transmitting their results as boundary conditions between levels.The full-field simulation method is employed to verify the efficiency and accuracy of multi-scale simulation method for a single-rack data center.The two simulation results show that the highest temperature emerges in the same location.The Single-rack Full-field Model(SRFFM)costs 2.5 times more computational time than that with Single-rack Multi-scale Model(SRMSM).The deviation of the highest temperature of chips and solder balls are 1.57℃and 0.2℃between the two models which indicates that the multi-scale simulation method has good prospect in the data center thermal simulation.Finally,the multi-scale thermal analysis method is applied to a ship data center with 15 racks.展开更多
With the rapid advancement of sequencing technologies and the growing volume of omics data in plants, there is much anticipation in digging out the treasure from such big data and accordingly refining the current agri...With the rapid advancement of sequencing technologies and the growing volume of omics data in plants, there is much anticipation in digging out the treasure from such big data and accordingly refining the current agricultural practice to be applied in the near future. Toward this end, database resources that deliver web services for plant omics data submission, archiving, and integration are urgently needed. As a part of Beijing Institute of Genomics (BIG) of the Chinese Academy of Sciences (CAS), BIG Data Center (http://bigd.big.ac.cn) provides open access to a suite of database resources (Table 1), with the aim of supporting plant research activities for domestic and international users in both academia and industry to translate big data into big discoveries (BIG Data Center Members, 2017;BIG Data Center Members, 2018;BIG Data Center Members, 2019). Here, we give a brief introduction of plant-related database resources in BIG Data Center and appeal to plant research com丒 munities to make full use of these resources for plant data submission, archiving, and integration.展开更多
Nitrogen dioxide(NO_(2))poses a critical potential risk to environmental quality and public health.A reliable machine learning(ML)forecasting framework will be useful to provide valuable information to support governm...Nitrogen dioxide(NO_(2))poses a critical potential risk to environmental quality and public health.A reliable machine learning(ML)forecasting framework will be useful to provide valuable information to support government decision-making.Based on the data from1609 air quality monitors across China from 2014-2020,this study designed an ensemble ML model by integrating multiple types of spatial-temporal variables and three sub-models for time-sensitive prediction over a wide range.The ensemble ML model incorporates a residual connection to the gated recurrent unit(GRU)network and adopts the advantage of Transformer,extreme gradient boosting(XGBoost)and GRU with residual connection network,resulting in a 4.1%±1.0%lower root mean square error over XGBoost for the test results.The ensemble model shows great prediction performance,with coefficient of determination of 0.91,0.86,and 0.77 for 1-hr,3-hr,and 24-hr averages for the test results,respectively.In particular,this model has achieved excellent performance with low spatial uncertainty in Central,East,and North China,the major site-dense zones.Through the interpretability analysis based on the Shapley value for different temporal resolutions,we found that the contribution of atmospheric chemical processes is more important for hourly predictions compared with the daily scale predictions,while the impact of meteorological conditions would be ever-prominent for the latter.Compared with existing models for different spatiotemporal scales,the present model can be implemented at any air quality monitoring station across China to facilitate achieving rapid and dependable forecast of NO_(2),which will help developing effective control policies.展开更多
Artificial Intelligence(AI)is an interdisciplinary research field with widespread applications.It aims at developing theoretical,methodological,technological,and applied systems that simulate,enhance,and assist human ...Artificial Intelligence(AI)is an interdisciplinary research field with widespread applications.It aims at developing theoretical,methodological,technological,and applied systems that simulate,enhance,and assist human intelligence.Recently,notable accomplishments of artificial intelligence technology have been achieved in astronomical data processing,establishing this technology as central to numerous astronomical research areas such as radio astronomy,stellar and galactic(Milky Way)studies,exoplanets surveys,cosmology,and solar physics.This article systematically reviews representative applications of artificial intelligence technology to astronomical data processing,with comprehensive description of specific cases:pulsar candidate identification,fast radio burst detection,gravitational wave detection,spectral classification,and radio frequency interference mitigation.Furthermore,it discusses possible future applications to provide perspectives for astronomical research in the artificial intelligence era.展开更多
Background Big data challenges In the late 1980s and early 1990s,three major international biological data centers were created:the DNA Database of Japan(DDBJ)[1],the European Bioinformatics Institute(EMBL-EBI)in the ...Background Big data challenges In the late 1980s and early 1990s,three major international biological data centers were created:the DNA Database of Japan(DDBJ)[1],the European Bioinformatics Institute(EMBL-EBI)in the United Kingdom(UK)[2],and the National Center for Biotechnology Information(NCBI)in the United States(US)[3].展开更多
With the popularisation of intelligent power,power devices have different shapes,numbers and specifications.This means that the power data has distributional variability,the model learning process cannot achieve suffi...With the popularisation of intelligent power,power devices have different shapes,numbers and specifications.This means that the power data has distributional variability,the model learning process cannot achieve sufficient extraction of data features,which seriously affects the accuracy and performance of anomaly detection.Therefore,this paper proposes a deep learning-based anomaly detection model for power data,which integrates a data alignment enhancement technique based on random sampling and an adaptive feature fusion method leveraging dimension reduction.Aiming at the distribution variability of power data,this paper developed a sliding window-based data adjustment method for this model,which solves the problem of high-dimensional feature noise and low-dimensional missing data.To address the problem of insufficient feature fusion,an adaptive feature fusion method based on feature dimension reduction and dictionary learning is proposed to improve the anomaly data detection accuracy of the model.In order to verify the effectiveness of the proposed method,we conducted effectiveness comparisons through elimination experiments.The experimental results show that compared with the traditional anomaly detection methods,the method proposed in this paper not only has an advantage in model accuracy,but also reduces the amount of parameter calculation of the model in the process of feature matching and improves the detection speed.展开更多
For real-time processing of ultra-wide bandwidth low-frequency pulsar baseband data,we designed and implemented an ultra-wide bandwidth low-frequency pulsar data processing pipeline(UWLPIPE)based on the shared ringbuf...For real-time processing of ultra-wide bandwidth low-frequency pulsar baseband data,we designed and implemented an ultra-wide bandwidth low-frequency pulsar data processing pipeline(UWLPIPE)based on the shared ringbuffer and GPU parallel technology.UWLPIPE runs on the GPU cluster and can simultaneously receive multiple 128 MHz dual-polarization VDIF data packets preprocessed by the front-end FPGA.After aligning the dual-polarization data,multiple 128M subband data are packaged into PSRDADA baseband data or multi-channel coherent dispersion filterbank data,and multiple subband filterbank data can be spliced into wideband data after time alignment.We used the Nanshan 26 m radio telescope with the L-band receiver at964~1732 MHz to observe multiple pulsars.Finally,we processed the data using DSPSR software,and the results showed that each subband could correctly fold out the pulse profile,and the wideband pulse profile accumulated by multiple subbands could be correctly aligned.展开更多
To address the problem of real-time processing of ultra-wide bandwidth pulsar baseband data,we designed and implemented a pulsar baseband data processing algorithm(PSRDP)based on GPU parallel computing technology.PSRD...To address the problem of real-time processing of ultra-wide bandwidth pulsar baseband data,we designed and implemented a pulsar baseband data processing algorithm(PSRDP)based on GPU parallel computing technology.PSRDP can perform operations such as baseband data unpacking,channel separation,coherent dedispersion,Stokes detection,phase and folding period prediction,and folding integration in GPU clusters.We tested the algorithm using the J0437-4715 pulsar baseband data generated by the CASPSR and Medusa backends of the Parkes,and the J0332+5434 pulsar baseband data generated by the self-developed backend of the Nan Shan Radio Telescope.We obtained the pulse profiles of each baseband data.Through experimental analysis,we have found that the pulse profiles generated by the PSRDP algorithm in this paper are essentially consistent with the processing results of Digital Signal Processing Software for Pulsar Astronomy(DSPSR),which verified the effectiveness of the PSRDP algorithm.Furthermore,using the same baseband data,we compared the processing speed of PSRDP with DSPSR,and the results showed that PSRDP was not slower than DSPSR in terms of speed.The theoretical and technical experience gained from the PSRDP algorithm research in this article lays a technical foundation for the real-time processing of QTT(Qi Tai radio Telescope)ultra-wide bandwidth pulsar baseband data.展开更多
基金supported by the National Key R&D Program of China (Nos. 2022YFF0711502 and 2021YFC2203502)the National Natural Science Foundation of China (NSFC)(12173077 and 12003062)+6 种基金the Tianshan Innovation Team Plan of Xinjiang Uygur Autonomous Region (2022D14020)the Tianshan Talent Project of Xinjiang Uygur Autonomous Region(2022TSYCCX0095)the Scientific Instrument Developing Project of the Chinese Academy of Sciences (grant No. PTYQ2022YZZD01)China National Astronomical Data Center (NADC)the Operation,Maintenance and Upgrading Fund for Astronomical Telescopes and Facility Instruments,budgeted from the Ministry of Finance of China (MOF)and administrated by the Chinese Academy of Sciences (CAS)Natural Science Foundation of Xinjiang Uygur Autonomous Region (2022D01A360)supported by Astronomical Big Data Joint Research Center,co-founded by National Astronomical Observatories,Chinese Academy of Sciences。
文摘Cross-matching is a key technique to achieve fusion of multi-band astronomical catalogs. Due to different equipment such as various astronomical telescopes, the existence of measurement errors, and proper motions of the celestial bodies, the same celestial object will have different positions in different catalogs, making it difficult to integrate multi-band or full-band astronomical data. In this study, we propose an online cross-matching method based on pseudo-spherical indexing techniques and develop a service combining with high performance computing system(Taurus) to improve cross-matching efficiency, which is designed for the Data Center of Xinjiang Astronomical Observatory. Specifically, we use Quad Tree Cube to divide the spherical blocks of the celestial object and map the 2D space composed of R.A. and decl. to 1D space and achieve correspondence between real celestial objects and spherical patches. Finally, we verify the performance of the service using Gaia 3 and PPMXL catalogs. Meanwhile, we send the matching results to VO tools-Topcat and Aladin respectively to get visual results. The experimental results show that the service effectively solves the speed bottleneck problem of crossmatching caused by frequent I/O, and significantly improves the retrieval and matching speed of massive astronomical data.
基金supported by the Strategic Priority Research Program of the Chinese Academy of Sciences(XDB38030200,XDB38050300,XDA19090116,XDA19050302)National Key R&D Program of China(2020YFC0848900,2020YFC0847000)。
文摘DEAR EDITOR,Since the first reported severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)infection in December 2019,coronavirus disease 2019(COVID-19)has become a global pandemic,spreading to more than 200 countries and regions worldwide.With continued research progress and virus detection,SARS-CoV-2 genomes and sequencing data have been reported and accumulated at an unprecedented rate.
基金supported by National Key R&D Program of China No.2021YFC2203502the National Natural Science Foundation of China (NSFC)(11803080,12173077,11873082,12003062)+2 种基金the Tianshan Innovation Team Plan of Xinjiang Uygur Autonomous Region (2022D14020)the Youth Innovation Promotion Association CASNational Key R&D Program of China No.2018 YFA0404704。
文摘The Xinjiang Astronomical Observatory Data Center faces issues related to delay-affected services. As a result, these services cannot be implemented in a timely manner due to the overloading of transmission links. In this paper, the software-defined network technology is applied to the Xinjiang Astronomical Observatory Data Center Network(XAODCN). Specifically, a novel reconfiguration method is proposed to realise the software-defined Xinjiang Astronomical Observatory Data Center Network(SDXAO-DCN), and a network model is constructed. To overcome the congestion problem, a traffic load-balancing algorithm is designed for fast transmission of the service traffic by combining three factors: network structure, congestion level and transmission service. The proposed algorithm is compared with current commonly load-balancing algorithms which are used in data center to verify its efficiency. Simulation experiments show that the algorithm improved transmission performance and transmission quality for the SDXAO-DCN.
基金supported by Technological Innovation 2030 (2022ZD0401701)National Natural Science Foundation of China (32000475,32030021)+1 种基金Strategic Priority Research Program of the Chinese Academy of Sciences (XDA24040201)Youth Innovation Promotion Association of the Chinese Academy of Sciences (Y2021038).
文摘Genomic data serve as an invaluable resource for unraveling the intricacies of the higher plant systems,including the constituent elements within and among species.Through various efforts in genomic data archiving,integrative analysis and value-added curation,the National Genomics Data Center(NGDC),which is a part of the China National Center for Bioinformation(CNCB),has successfully established and currently maintains a vast amount of database resources.This dedicated initiative of the NGDC facilitates a data-rich ecosystem that greatly strengthens and supports genomic research efforts.Here,we present a comprehensive overview of central repositories dedicated to archiving,presenting,and sharing plant omics data,introduce knowledgebases focused on variants or gene-based functional insights,highlight species-specific multiple omics database resources,and briefly review the online application tools.We intend that this review can be used as a guide map for plant researchers wishing to select effective data resources from the NGDC for their specific areas of study.
文摘On October 18,2017,the 19th National Congress Report called for the implementation of the Healthy China Strategy.The development of biomedical data plays a pivotal role in advancing this strategy.Since the 18th National Congress of the Communist Party of China,China has vigorously promoted the integration and implementation of the Healthy China and Digital China strategies.The National Health Commission has prioritized the development of health and medical big data,issuing policies to promote standardized applica-tions and foster innovation in"Internet+Healthcare."Biomedical data has significantly contributed to preci-sion medicine,personalized health management,drug development,disease diagnosis,public health monitor-ing,and epidemic prediction capabilities.
基金CAMS Innovation Fund for Medical Sciences(CIFMS):“Construction of an Intelligent Management and Efficient Utilization Technology System for Big Data in Population Health Science.”(2021-I2M-1-057)Key Projects of the Innovation Fund of the National Clinical Research Center for Orthopedics and Sports Rehabilitation:“National Orthopedics and Sports Rehabilitation Real-World Research Platform System Construction”(23-NCRC-CXJJ-ZD4)。
文摘【Objective】Medical imaging data has great value,but it contains a significant amount of sensitive information about patients.At present,laws and regulations regarding to the de-identification of medical imaging data are not clearly defined around the world.This study aims to develop a tool that meets compliance-driven desensitization requirements tailored to diverse research needs.【Methods】To enhance the security of medical image data,we designed and implemented a DICOM format medical image de-identification system on the Windows operating system.【Results】Our custom de-identification system is adaptable to the legal standards of different countries and can accommodate specific research demands.The system offers both web-based online and desktop offline de-identification capabilities,enabling customization of de-identification rules and facilitating batch processing to improve efficiency.【Conclusions】This medical image de-identification system robustly strengthens the stewardship of sensitive medical data,aligning with data security protection requirements while facilitating the sharing and utilization of medical image data.This approach unlocks the intrinsic value inherent in such datasets.
基金supported by the National Natural Science Foundation of China(No.42120104002).
文摘In this study,we developed a high-resolution(3 arcsec,approximately 90 m)V_(S30) map and associated open-access dataset for the 140 km×200 km region affected by the January 2025 M6.8 Dingri Xizang,China earthquake.This map provides a significantly finer resolution compared to existing V_(S30) maps,which typically use a 30 arcsec grid.The V_(S30) values were estimated using the Cokriging-based V_(S30) proxy model(SCK model),which integrates V_(S30) measurements as primary constraints and utilizes topographic slope as a secondary parameter.The findings indicate that the V_(S30) values range from 200 to 250 m/s in the sedimentary deposit areas near the earthquake’s epicenter and from 400 to 600 m/s in the surrounding mountainous regions.This study showcases the capability of the SCK model to efficiently generate V_(S30) estimations across various spatial resolutions and demonstrates its effectiveness in producing reliable estimations in data-sparse regions.
文摘With the rise of data-intensive research,data literacy has become a critical capability for improving scientific data quality and achieving artificial intelligence(AI)readiness.In the biomedical domain,data are characterized by high complexity and privacy sensitivity,calling for robust and systematic data management skills.This paper reviews current trends in scientific data governance and the evolving policy landscape,highlighting persistent challenges such as inconsistent standards,semantic misalignment,and limited awareness of compliance.These issues are largely rooted in the lack of structured training and practical support for researchers.In response,this study builds on existing data literacy frameworks and integrates the specific demands of biomedical research to propose a comprehensive,lifecycle-oriented data literacy competency model with an emphasis on ethics and regulatory awareness.Furthermore,it outlines a tiered training strategy tailored to different research stages—undergraduate,graduate,and professional,offering theoretical foundations and practical pathways for universities and research institutions to advance data literacy education.
基金supported by the National Key R&D Program of China(2021YFC2203502 and 2022YFF0711502)the National Natural Science Foundation of China(NSFC)(12173077)+4 种基金the Tianshan Talent Project of Xinjiang Uygur Autonomous Region(2022TSYCCX0095 and 2023TSYCCX0112)the Scientific Instrument Developing Project of the Chinese Academy of Sciences(PTYQ2022YZZD01)China National Astronomical Data Center(NADC)the Operation,Maintenance and Upgrading Fund for Astronomical Telescopes and Facility Instruments,budgeted from the Ministry of Finance of China(MOF)and administrated by the Chinese Academy of SciencesNatural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01A360).
文摘Astronomical spectroscopy is crucial for exploring the physical properties,chemical composition,and kinematic behavior of celestial objects.With continuous advancements in observational technology,astronomical spectroscopy faces the dual challenges of rapidly expanding data volumes and relatively lagging data processing capabilities.In this context,the rise of artificial intelligence technologies offers an innovative solution to address these challenges.This paper analyzes the latest developments in the application of machine learning for astronomical spectral data mining and discusses future research directions in AI-based spectral studies.However,the application of machine learning technologies presents several challenges.The high complexity of models often comes with insufficient interpretability,complicating scientific understanding.Moreover,the large-scale computational demands place higher requirements on hardware resources,leading to a significant increase in computational costs.AI-based astronomical spectroscopy research should advance in the following key directions.First,develop efficient data augmentation techniques to enhance model generalization capabilities.Second,explore more interpretable model designs to ensure the reliability and transparency of scientific conclusions.Third,optimize computational efficiency and reduce the threshold for deep-learning applications through collaborative innovations in algorithms and hardware.Furthermore,promoting the integration of cross-band data processing is essential to achieve seamless integration and comprehensive analysis of multi-source data,providing richer,multidimensional information to uncover the mysteries of the universe.
基金supported by the Second Tibetan Plateau Scientific Expedition and Research Program of Ministry of Science and Technology of the People's Republic of China[grant number 2022QZKK0101]the Science and Technology Department of the Tibet Program[grant number XZ202301ZY0035G]。
文摘Since meteorological conditions are the main factor driving the transport and dispersion of air pollutants,an accurate simulation of the meteorological field will directly affect the accuracy of the atmospheric chemical transport model in simulating PM_(2.5).Based on the NASM joint chemical data assimilation system,the authors quantified the impacts of different meteorological fields on the pollutant simulations as well as revealed the role of meteorological conditions in the accumulation,maintenance,and dissipation of heavy haze pollution.During the two heavy pollution processes from 10 to 24 November 2018,the meteorological fields were obtained using NCEP FNL and ERA5 reanalysis data,each used to drive the WRF model,to analyze the differences in the simulated PM_(2.5) concentration.The results show that the meteorological field has a strong influence on the concentration levels and spatial distribution of the pollution simulations.The ERA5 group had relatively small simulation errors,and more accurate PM_(2.5) simulation results could be obtained.The RMSE was 11.86𝜇g m^(-3)lower than that of the FNL group before assimilation,and 5.77𝜇g m^(-3)lower after joint assimilation.The authors used the PM_(2.5) simulation results obtained by ERA5 data to discuss the role of the wind field and circulation situation on the pollution process,to analyze the correlation between wind speed,temperature,relative humidity,and boundary layer height and pollutant concentrations,and to further clarify the key formation mechanism of this pollution process.
基金supported by the National Key R&D Program of China(No.2022YFB3808803)the National Natural Science Foundation of China(No.52371049)the National Science and Technology Resources Investigation Program of China(No.2021FY100603).
文摘The corrosion degradation of organic coatings in tropical marine atmospheric environments results in substantial economic losses across various industries.The complexity of a dynamic environment,combined with high costs,extended experimental periods,and limited data,places a limit on the comprehension of this process.This study addresses this challenge by investigating the corrosion de-gradation of damaged organic coatings in a tropical marine environment using an atmospheric corrosion monitoring sensor and a random forest(RF)model.For damage simulation,a polyurethane coating applied to a Fe/graphite corrosion sensor was intentionally scratched and exposed to the marine atmosphere for over one year.Pearson correlation analysis was performed for the collection and filtering of en-vironmental and corrosion current data.According to the RF model,the following specific conditions contributed to accelerated degrada-tion:relative humidity(RH)above 80%and temperatures below 22.5℃,with the risk increasing significantly when RH exceeded 90%.High RH and temperature exhibited a cumulative effect on coating degradation.A high risk of corrosion occurred in the nighttime.The RF model was also used to predict the coating degradation process using environmental data as input parameters,with the accuracy show-ing improvement when the duration of influential environmental ranges was considered.
文摘Despite the widespread use of Decision trees (DT) across various applications, their performance tends to suffer when dealing with imbalanced datasets, where the distribution of certain classes significantly outweighs others. Cost-sensitive learning is a strategy to solve this problem, and several cost-sensitive DT algorithms have been proposed to date. However, existing algorithms, which are heuristic, tried to greedily select either a better splitting point or feature node, leading to local optima for tree nodes and ignoring the cost of the whole tree. In addition, determination of the costs is difficult and often requires domain expertise. This study proposes a DT for imbalanced data, called Swarm-based Cost-sensitive DT (SCDT), using the cost-sensitive learning strategy and an enhanced swarm-based algorithm. The DT is encoded using a hybrid individual representation. A hybrid artificial bee colony approach is designed to optimize rules, considering specified costs in an F-Measure-based fitness function. Experimental results using datasets compared with state-of-the-art DT algorithms show that the SCDT method achieved the highest performance on most datasets. Moreover, SCDT also excels in other critical performance metrics, such as recall, precision, F1-score, and AUC, with notable results with average values of 83%, 87.3%, 85%, and 80.7%, respectively.
基金supported by the National Natural Science Foundation of China(No.51806167)China Postdoctoral Science Foundation(2017M623166)+1 种基金Science and Technology on Thermal Energy and Power Laboratory Open Foundation of China(No.TPL2017BA004)the Fund of Xi’an Science and Technology Bureau(2019218714SYS002CG024).
文摘Thermal analysis of data centers is in urgent need to ensure that computer chips remain below the critical temperature while the energy consumption for cooling can be reduced.It is difficult to obtain detailed hotspot locations and temperatures of chips in large data centers containing hundreds of racks or more by direct measurement.In this paper,a multi-scale thermal analysis method is proposed that can predict the temperature distribution of chips and solder balls in data centers.The multi-scale model is divided into six scales:room,rack,server,Insulated-Gate Bipolar Transistor(IGBT),chip and solder ball.A concept of sub-model is proposed and the six levels are organized into four simulation sub-models.Sub-model 1 contains Room,Rack and Server(RRS);Sub-model 2 contains Server and IGBT(SI);Sub-model 3 contains IGBT and Chip(IC),and Sub-model 4 contains Chip and Solder-ball(CS).These four sub-models are one-way coupled by transmitting their results as boundary conditions between levels.The full-field simulation method is employed to verify the efficiency and accuracy of multi-scale simulation method for a single-rack data center.The two simulation results show that the highest temperature emerges in the same location.The Single-rack Full-field Model(SRFFM)costs 2.5 times more computational time than that with Single-rack Multi-scale Model(SRMSM).The deviation of the highest temperature of chips and solder balls are 1.57℃and 0.2℃between the two models which indicates that the multi-scale simulation method has good prospect in the data center thermal simulation.Finally,the multi-scale thermal analysis method is applied to a ship data center with 15 racks.
基金Strategic Priority Research Program of the Chinese Academy of Sciences (XDA19050302 to Z.Z.XDA08020102 to Z.Z.)+2 种基金National Natural Science Foundation of China (31871328 to Z.Z.)K.C.Wong Education Foundation (to Z.Z.)The Youth Innovation Promotion Association of Chinese Academy of Sciences (2017141 to S.S.).
文摘With the rapid advancement of sequencing technologies and the growing volume of omics data in plants, there is much anticipation in digging out the treasure from such big data and accordingly refining the current agricultural practice to be applied in the near future. Toward this end, database resources that deliver web services for plant omics data submission, archiving, and integration are urgently needed. As a part of Beijing Institute of Genomics (BIG) of the Chinese Academy of Sciences (CAS), BIG Data Center (http://bigd.big.ac.cn) provides open access to a suite of database resources (Table 1), with the aim of supporting plant research activities for domestic and international users in both academia and industry to translate big data into big discoveries (BIG Data Center Members, 2017;BIG Data Center Members, 2018;BIG Data Center Members, 2019). Here, we give a brief introduction of plant-related database resources in BIG Data Center and appeal to plant research com丒 munities to make full use of these resources for plant data submission, archiving, and integration.
基金supported by the Taishan Scholars (No.ts201712003)。
文摘Nitrogen dioxide(NO_(2))poses a critical potential risk to environmental quality and public health.A reliable machine learning(ML)forecasting framework will be useful to provide valuable information to support government decision-making.Based on the data from1609 air quality monitors across China from 2014-2020,this study designed an ensemble ML model by integrating multiple types of spatial-temporal variables and three sub-models for time-sensitive prediction over a wide range.The ensemble ML model incorporates a residual connection to the gated recurrent unit(GRU)network and adopts the advantage of Transformer,extreme gradient boosting(XGBoost)and GRU with residual connection network,resulting in a 4.1%±1.0%lower root mean square error over XGBoost for the test results.The ensemble model shows great prediction performance,with coefficient of determination of 0.91,0.86,and 0.77 for 1-hr,3-hr,and 24-hr averages for the test results,respectively.In particular,this model has achieved excellent performance with low spatial uncertainty in Central,East,and North China,the major site-dense zones.Through the interpretability analysis based on the Shapley value for different temporal resolutions,we found that the contribution of atmospheric chemical processes is more important for hourly predictions compared with the daily scale predictions,while the impact of meteorological conditions would be ever-prominent for the latter.Compared with existing models for different spatiotemporal scales,the present model can be implemented at any air quality monitoring station across China to facilitate achieving rapid and dependable forecast of NO_(2),which will help developing effective control policies.
基金This work is supported by National Key R&D Program of China No.2021YFC2203502 and 2022YFF0711502the National Natural Science Foundation of China(NSFC)(12173077 and 12003062)+5 种基金the Tianshan Innovation Team Plan of Xinjiang Uygur Autonomous Region(2022D14020)the Tianshan Talent Project of Xinjiang Uygur Autonomous Region(2022TSYCCX0095)the Scientific Instrument Developing Project of the Chinese Academy of Sciences,Grant No.PTYQ2022YZZD01China National Astronomical Data Center(NADC)the Operation,Maintenance and Upgrading Fund for Astronomical Telescopes and Facility Instruments,budgeted from the Ministry of Finance of China(MOF)and administrated by the Chinese Academy of Sciences(CAS)Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01A360).
文摘Artificial Intelligence(AI)is an interdisciplinary research field with widespread applications.It aims at developing theoretical,methodological,technological,and applied systems that simulate,enhance,and assist human intelligence.Recently,notable accomplishments of artificial intelligence technology have been achieved in astronomical data processing,establishing this technology as central to numerous astronomical research areas such as radio astronomy,stellar and galactic(Milky Way)studies,exoplanets surveys,cosmology,and solar physics.This article systematically reviews representative applications of artificial intelligence technology to astronomical data processing,with comprehensive description of specific cases:pulsar candidate identification,fast radio burst detection,gravitational wave detection,spectral classification,and radio frequency interference mitigation.Furthermore,it discusses possible future applications to provide perspectives for astronomical research in the artificial intelligence era.
基金funded by the“Strategic Priority Research Program”of CAS(Grant No.XDB38030200)the Open Biodiversity and Health Big Data Programme of International Union of Biological Sciences awarded to YB.
文摘Background Big data challenges In the late 1980s and early 1990s,three major international biological data centers were created:the DNA Database of Japan(DDBJ)[1],the European Bioinformatics Institute(EMBL-EBI)in the United Kingdom(UK)[2],and the National Center for Biotechnology Information(NCBI)in the United States(US)[3].
文摘With the popularisation of intelligent power,power devices have different shapes,numbers and specifications.This means that the power data has distributional variability,the model learning process cannot achieve sufficient extraction of data features,which seriously affects the accuracy and performance of anomaly detection.Therefore,this paper proposes a deep learning-based anomaly detection model for power data,which integrates a data alignment enhancement technique based on random sampling and an adaptive feature fusion method leveraging dimension reduction.Aiming at the distribution variability of power data,this paper developed a sliding window-based data adjustment method for this model,which solves the problem of high-dimensional feature noise and low-dimensional missing data.To address the problem of insufficient feature fusion,an adaptive feature fusion method based on feature dimension reduction and dictionary learning is proposed to improve the anomaly data detection accuracy of the model.In order to verify the effectiveness of the proposed method,we conducted effectiveness comparisons through elimination experiments.The experimental results show that compared with the traditional anomaly detection methods,the method proposed in this paper not only has an advantage in model accuracy,but also reduces the amount of parameter calculation of the model in the process of feature matching and improves the detection speed.
基金supported by the National Key R&D Program of China Nos.2021YFC2203502 and 2022YFF0711502the National Natural Science Foundation of China(NSFC)(12173077)+4 种基金the Tianshan Talent Project of Xinjiang Uygur Autonomous Region(2022TSYCCX0095 and2023TSYCCX0112)the Scientific Instrument Developing Project of the Chinese Academy of Sciences,grant No.PTYQ2022YZZD01China National Astronomical Data Center(NADC)the Operation,Maintenance and Upgrading Fund for Astronomical Telescopes and Facility Instruments,budgeted from the Ministry of Finance of China(MOF)and administrated by the Chinese Academy of Sciences(CAS)Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01A360)。
文摘For real-time processing of ultra-wide bandwidth low-frequency pulsar baseband data,we designed and implemented an ultra-wide bandwidth low-frequency pulsar data processing pipeline(UWLPIPE)based on the shared ringbuffer and GPU parallel technology.UWLPIPE runs on the GPU cluster and can simultaneously receive multiple 128 MHz dual-polarization VDIF data packets preprocessed by the front-end FPGA.After aligning the dual-polarization data,multiple 128M subband data are packaged into PSRDADA baseband data or multi-channel coherent dispersion filterbank data,and multiple subband filterbank data can be spliced into wideband data after time alignment.We used the Nanshan 26 m radio telescope with the L-band receiver at964~1732 MHz to observe multiple pulsars.Finally,we processed the data using DSPSR software,and the results showed that each subband could correctly fold out the pulse profile,and the wideband pulse profile accumulated by multiple subbands could be correctly aligned.
基金supported by the National Key R&D Program of China Nos.2021YFC2203502 and 2022YFF0711502the National Natural Science Foundation of China(NSFC)(12173077 and 12003062)+5 种基金the Tianshan Innovation Team Plan of Xinjiang Uygur Autonomous Region(2022D14020)the Tianshan Talent Project of Xinjiang Uygur Autonomous Region(2022TSYCCX0095)the Scientific Instrument Developing Project of the Chinese Academy of Sciences,grant No.PTYQ2022YZZD01China National Astronomical Data Center(NADC)the Operation,Maintenance and Upgrading Fund for Astronomical Telescopes and Facility Instruments,budgeted from the Ministry of Finance of China(MOF)and administrated by the Chinese Academy of Sciences(CAS)Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01A360)。
文摘To address the problem of real-time processing of ultra-wide bandwidth pulsar baseband data,we designed and implemented a pulsar baseband data processing algorithm(PSRDP)based on GPU parallel computing technology.PSRDP can perform operations such as baseband data unpacking,channel separation,coherent dedispersion,Stokes detection,phase and folding period prediction,and folding integration in GPU clusters.We tested the algorithm using the J0437-4715 pulsar baseband data generated by the CASPSR and Medusa backends of the Parkes,and the J0332+5434 pulsar baseband data generated by the self-developed backend of the Nan Shan Radio Telescope.We obtained the pulse profiles of each baseband data.Through experimental analysis,we have found that the pulse profiles generated by the PSRDP algorithm in this paper are essentially consistent with the processing results of Digital Signal Processing Software for Pulsar Astronomy(DSPSR),which verified the effectiveness of the PSRDP algorithm.Furthermore,using the same baseband data,we compared the processing speed of PSRDP with DSPSR,and the results showed that PSRDP was not slower than DSPSR in terms of speed.The theoretical and technical experience gained from the PSRDP algorithm research in this article lays a technical foundation for the real-time processing of QTT(Qi Tai radio Telescope)ultra-wide bandwidth pulsar baseband data.