Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spread...Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spreads and transforms more industries,the lack of data is a significant obstacle:the best methods for teaching machines how real-world processes work.This paper explores the considerable implications of data scarcity for the AI industry,which threatens to restrict its growth and potential,and proposes plausible solutions and perspectives.In addition,this article focuses highly on different ethical considerations:privacy,consent,and non-discrimination principles during AI model developments under limited conditions.Besides,innovative technologies are investigated through the paper in aspects that need implementation by incorporating transfer learning,few-shot learning,and data augmentation to adapt models so they could fit effective use processes in low-resource settings.This thus emphasizes the need for collaborative frameworks and sound methodologies that ensure applicability and fairness,tackling the technical and ethical challenges associated with data scarcity in AI.This article also discusses prospective approaches to dealing with data scarcity,emphasizing the blend of synthetic data and traditional models and the use of advanced machine learning techniques such as transfer learning and few-shot learning.These techniques aim to enhance the flexibility and effectiveness of AI systems across various industries while ensuring sustainable AI technology development amid ongoing data scarcity.展开更多
Artificial intelligence(AI)has become an increasingly important propellant for energy materials and energy chemistry research,such as accelerating advanced energy materials discovery[1],analyzing vast amounts of data ...Artificial intelligence(AI)has become an increasingly important propellant for energy materials and energy chemistry research,such as accelerating advanced energy materials discovery[1],analyzing vast amounts of data from both experiments and computations[2],process optimization for materials syntheses,management and monitoring of energy storage devices such as lithium batteries,and algorithm-optimized grid load forecasting.Looking back at recent pioneering works of AI-driven energy chemistry research,constructing a dataset with both large quantity and high quality is almost the first step and largely determines the following success of training AI models and figuring out corresponding scientific issues.展开更多
The widespread usage of rechargeable batteries in portable devices,electric vehicles,and energy storage systems has underscored the importance for accurately predicting their lifetimes.However,data scarcity often limi...The widespread usage of rechargeable batteries in portable devices,electric vehicles,and energy storage systems has underscored the importance for accurately predicting their lifetimes.However,data scarcity often limits the accuracy of prediction models,which is escalated by the incompletion of data induced by the issues such as sensor failures.To address these challenges,we propose a novel approach to accommodate data insufficiency through achieving external information from incomplete data samples,which are usually discarded in existing studies.In order to fully unleash the prediction power of incomplete data,we have investigated the Multiple Imputation by Chained Equations(MICE)method that diversifies the training data through exploring the potential data patterns.The experimental results demonstrate that the proposed method significantly outperforms the baselines in the most considered scenarios while reducing the prediction root mean square error(RMSE)by up to 18.9%.Furthermore,we have also observed that the penetration of incomplete data benefits the explainability of the prediction model through facilitating the feature selection.展开更多
Landslide susceptibility evaluation plays an important role in disaster prevention and reduction.Feature-based transfer learning(TL)is an effective method for solving landslide susceptibility mapping(LSM)in target reg...Landslide susceptibility evaluation plays an important role in disaster prevention and reduction.Feature-based transfer learning(TL)is an effective method for solving landslide susceptibility mapping(LSM)in target regions with no available samples.However,as the study area expands,the distribution of land-slide types and triggering mechanisms becomes more diverse,leading to performance degradation in models relying on landslide evaluation knowledge from a single source domain due to domain feature shift.To address this,this study proposes a Multi-source Domain Adaptation Convolutional Neural Network(MDACNN),which combines the landslide prediction knowledge learned from two source domains to perform cross-regional LSM in complex large-scale areas.The method is validated through case studies in three regions located in southeastern coastal China and compared with single-source domain TL models(TCA-based models).The results demonstrate that MDACNN effectively integrates transfer knowledge from multiple source domains to learn diverse landslide-triggering mechanisms,thereby significantly reducing prediction bias inherent to single-source domain TL models,achieving an average improvement of 16.58%across all metrics.Moreover,the landslide susceptibility maps gener-ated by MDACNN accurately quantify the spatial distribution of landslide risks in the target area,provid-ing a powerful scientific and technological tool for landslide disaster management and prevention.展开更多
The imbalance in global streamflow gauge distribution and regional data scarcity,especially in large transboundary basins,challenge regional water resource management.Effectively utilizing these limited data to constr...The imbalance in global streamflow gauge distribution and regional data scarcity,especially in large transboundary basins,challenge regional water resource management.Effectively utilizing these limited data to construct reliable models is of crucial practical importance.This study employs a transfer learning(TL)framework to simulate daily streamflow in the Dulong-lrrawaddy River Basin(DIRB),a less-studied transboundary basin shared by Myanmar,China,and India.Our results show that TL significantly improves streamflow predictions:the optimal TL model achieves an average Nash-Sutcliffe efficiency of 0.872,showing a marked improvement in the Hkamti sub-basin.Despite data scarcity,TL achieves a mean NSE of 0.817,surpassing the 0.655 of the process-based model MIKE SHE.Additionally,our study reveals the importance of source model selection in TL,as different parts of the flow are affected by the diversity and similarity of data in the source model.Deep learning models,particularly TL,exhibit complex sensitivities to meteorological inputs,more accurately capturing non-linear relationships among multiple variables than the process-based model.Integrated gradients(IG)analysis furtherillustrates TL's ability to capture spatial het-erogeneity in upstream and downstream sub-basins and its adeptness in characterizing different flow regimes.This study underscores the potential of TL in enhancing the understanding of hydrological processes in large-scale catchments and highlights its value for water resource management in transboundary basins under data scarcity.展开更多
Intelligent condition monitoring of wind turbines is essential for reducing downtimes.Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and,eventually,operation fault...Intelligent condition monitoring of wind turbines is essential for reducing downtimes.Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and,eventually,operation faults.However,data-driven normal behavior models(NBMs)require a substantial amount of training data,as NBMs trained with scarce data may result in unreliable fault detection.To overcome this limitation,we present a novel generative deep transfer learning approach to make SCADA samples from one wind turbine lacking training data resemble SCADA data from wind turbines with representative training data.Through based CycleGAN-domain mapping,our method enables the application of an NBM trained on an existing wind turbine to a new one with severely limited data.We demonstrate our approach on field data mapping SCADA samples across 7 substantially different WTs.Our findings show significantly improved fault detection in wind turbines with scarce data.Our method achieves the most similar anomaly scores to an NBM trained with abundant data,outperforming NBMs trained on scarce training data with improvements of+10.3%in F1-score when 1 month of training data is available and+16.8%when 2 weeks are available.The domain mapping approach outperforms conventional fine-tuning at all considered degrees of data scarcity,ranging from 1 to 8 weeks of training data.The proposed technique enables earlier and more reliable fault detection in newly installed wind farms,demonstrating a novel and promising research direction to improve anomaly detection when faced with training data scarcity.展开更多
The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,...The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,which necessitates a substantial collection of data for effective training,a requirement frequently unfulfilled in practical power systems due to limited data availability.To address the critical issue of data scarcity in training AI models,this paper proposes a novel transfer-learning-based(TL-based)Wasserstein generative adversarial network(WGAN)approach for synthetic data generation of SSO in wind farms.To improve the capability of WGAN to capture the bidirectional temporal features inherent in oscillation data,a bidirectional long short-term memory(BiLSTM)layer is introduced.Additionally,to address the training instability caused by few-shot learning scenarios,the discriminator is augmented with mini-batch discrimination(MBD)layers and gradient penalty(GP)terms.Finally,TL is leveraged to finetune the model,effectively bridging the gap between the training data and real-world system data.To evaluate the quality of the synthetic data,two indexes are proposed based on dynamic time warping(DTW)and frequency domain analysis,followed by a classification task.Case studies demonstrate the effectiveness of the proposed approach in swiftly generating a large volume of synthetic SSO data,thereby significantly mitigating the issue of data scarcity prevalent in SSO research.展开更多
Building consumption data is integral to numerous applications including retrofit analysis,Smart Grid integration and optimization,and load forecasting.Still,due to technical limitations,privacy concerns and the propr...Building consumption data is integral to numerous applications including retrofit analysis,Smart Grid integration and optimization,and load forecasting.Still,due to technical limitations,privacy concerns and the proprietary nature of the industry,usable data is often unavailable for research and development.Generative adversarial networks(GANs)-which generate synthetic instances that resemble those from an original training dataset-have been proposed to help address this issue.Previous studies use GANs to generate building sequence data,but the models are not typically designed for time series problems,they often require relatively large amounts of input data(at least 20,000 sequences)and it is unclear whether they correctly capture the temporal behaviour of the buildings.In this work we implement a conditional temporal GAN that addresses these issues,and we show that it exhibits state-of-the-art performance on small datasets.22 different experiments that vary according to their data inputs are benchmarked using Jensen-Shannon divergence(JSD)and predictive forecasting validation error.Of these,the best performing is also evaluated using a curated set of metrics that extends those of previous work to include PCA,deep-learning based forecasting and measurements of trend and seasonality.Two case studies are included:one for residential and one for commercial buildings.The model achieves a JSD of 0.012 on the former data and 0.037 on the latter,using only 396 and 156 original load sequences,respectively.展开更多
Recent studies indicate dwindling groundwater quantity and quality of the largest regional aquifer system in North West India,raising concern over freshwater availability to about 182 million population residing in th...Recent studies indicate dwindling groundwater quantity and quality of the largest regional aquifer system in North West India,raising concern over freshwater availability to about 182 million population residing in this region.Widespread agricultural activities have resulted severe groundwater pollution in this area,demanding a systematic vulnerability assessment for proactive measures.Conventional vulnerability assessment models encounter drawbacks due to subjectivity,complexity,data-prerequisites,and spatial-temporal constraints.This study incorporates isotopic information into a weighted-overlay framework to overcome the above-mentioned limitations and proposes a novel vulnerability assessment model.The isotope methodology provides crucial insights on groundwater recharge mechanisms(18O and 2H)and dynamics(3H)-often ignored in vulnerability assessment.Isotopic characterisation of precipitation helped in establishing Local Meteoric Water Line(LMWL)as well as inferring contrasting recharge mechanisms operating in different aquifers.Shallow aquifer(depth<60 m)showed significant evaporative signature with evaporation loss accounting up to 18.04%based on Rayleigh distillation equations.Inter-aquifer connections were apparent from Kernel Density Estimate(KDE)and isotope correlations.A weighted overlay isotope-geospatial model was developed combining 18O,3H,aquifer permeability,and water level data.The central and northern parts of study area fall under least(0.29%)and extremely(1.79%)vulnerable zones respectively,while majority of the study area fall under moderate(42.71%)and highly vulnerable zones(55.20%).Model validation was performed using groundwater NO3-concentration,which showed an overall accuracy up to 82%.Monte Carlo Simulation(MCS)was performed for sensitivity analysis and permeability was found to be the most sensitive input parameter,followed by 3H,18O,and water level.Comparing the vulnerability map with Land Use Land Cover(LULC)and population density maps helped in precisely identifying the high-risk sites,warranting a prompt attention.The model developed in this study integrates isotopic information with vulnerability assessment and resulted in model output with good accuracy,scientific basis,and widespread relevance,which highlights its crucial role in formulating proactive water resource management plans,especially in less explored data-scarce locations.展开更多
基金supported by Internal Research Support Program(IRSPG202202).
文摘Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spreads and transforms more industries,the lack of data is a significant obstacle:the best methods for teaching machines how real-world processes work.This paper explores the considerable implications of data scarcity for the AI industry,which threatens to restrict its growth and potential,and proposes plausible solutions and perspectives.In addition,this article focuses highly on different ethical considerations:privacy,consent,and non-discrimination principles during AI model developments under limited conditions.Besides,innovative technologies are investigated through the paper in aspects that need implementation by incorporating transfer learning,few-shot learning,and data augmentation to adapt models so they could fit effective use processes in low-resource settings.This thus emphasizes the need for collaborative frameworks and sound methodologies that ensure applicability and fairness,tackling the technical and ethical challenges associated with data scarcity in AI.This article also discusses prospective approaches to dealing with data scarcity,emphasizing the blend of synthetic data and traditional models and the use of advanced machine learning techniques such as transfer learning and few-shot learning.These techniques aim to enhance the flexibility and effectiveness of AI systems across various industries while ensuring sustainable AI technology development amid ongoing data scarcity.
基金supported by the National Key Research and Development Program of China(2021YFB2500300)the National Natural Science Foundation of China(T2322015,92472101,22393903,22393900,52394170)the Beijing Municipal Natural Science Foundation(L247015,L233004)。
文摘Artificial intelligence(AI)has become an increasingly important propellant for energy materials and energy chemistry research,such as accelerating advanced energy materials discovery[1],analyzing vast amounts of data from both experiments and computations[2],process optimization for materials syntheses,management and monitoring of energy storage devices such as lithium batteries,and algorithm-optimized grid load forecasting.Looking back at recent pioneering works of AI-driven energy chemistry research,constructing a dataset with both large quantity and high quality is almost the first step and largely determines the following success of training AI models and figuring out corresponding scientific issues.
文摘The widespread usage of rechargeable batteries in portable devices,electric vehicles,and energy storage systems has underscored the importance for accurately predicting their lifetimes.However,data scarcity often limits the accuracy of prediction models,which is escalated by the incompletion of data induced by the issues such as sensor failures.To address these challenges,we propose a novel approach to accommodate data insufficiency through achieving external information from incomplete data samples,which are usually discarded in existing studies.In order to fully unleash the prediction power of incomplete data,we have investigated the Multiple Imputation by Chained Equations(MICE)method that diversifies the training data through exploring the potential data patterns.The experimental results demonstrate that the proposed method significantly outperforms the baselines in the most considered scenarios while reducing the prediction root mean square error(RMSE)by up to 18.9%.Furthermore,we have also observed that the penetration of incomplete data benefits the explainability of the prediction model through facilitating the feature selection.
基金the National Natural Science Foundation of China(Grant No.42301002,and 52109118)Fujian Provincial Water Resources Science and Technology Project(Grant No.MSK202524)Guidance fund for Science and Technology Program,Fujian province(Grant No.2024Y0002).
文摘Landslide susceptibility evaluation plays an important role in disaster prevention and reduction.Feature-based transfer learning(TL)is an effective method for solving landslide susceptibility mapping(LSM)in target regions with no available samples.However,as the study area expands,the distribution of land-slide types and triggering mechanisms becomes more diverse,leading to performance degradation in models relying on landslide evaluation knowledge from a single source domain due to domain feature shift.To address this,this study proposes a Multi-source Domain Adaptation Convolutional Neural Network(MDACNN),which combines the landslide prediction knowledge learned from two source domains to perform cross-regional LSM in complex large-scale areas.The method is validated through case studies in three regions located in southeastern coastal China and compared with single-source domain TL models(TCA-based models).The results demonstrate that MDACNN effectively integrates transfer knowledge from multiple source domains to learn diverse landslide-triggering mechanisms,thereby significantly reducing prediction bias inherent to single-source domain TL models,achieving an average improvement of 16.58%across all metrics.Moreover,the landslide susceptibility maps gener-ated by MDACNN accurately quantify the spatial distribution of landslide risks in the target area,provid-ing a powerful scientific and technological tool for landslide disaster management and prevention.
基金National Key Research and Development Program of China,No.2022YFF1302405National Natural Science Foundation of China,No.42201040+1 种基金The National Key Research and Development Program of China,No.2016YFA0601601The China Postdoctoral Science Foundation,No.2023M733006。
文摘The imbalance in global streamflow gauge distribution and regional data scarcity,especially in large transboundary basins,challenge regional water resource management.Effectively utilizing these limited data to construct reliable models is of crucial practical importance.This study employs a transfer learning(TL)framework to simulate daily streamflow in the Dulong-lrrawaddy River Basin(DIRB),a less-studied transboundary basin shared by Myanmar,China,and India.Our results show that TL significantly improves streamflow predictions:the optimal TL model achieves an average Nash-Sutcliffe efficiency of 0.872,showing a marked improvement in the Hkamti sub-basin.Despite data scarcity,TL achieves a mean NSE of 0.817,surpassing the 0.655 of the process-based model MIKE SHE.Additionally,our study reveals the importance of source model selection in TL,as different parts of the flow are affected by the diversity and similarity of data in the source model.Deep learning models,particularly TL,exhibit complex sensitivities to meteorological inputs,more accurately capturing non-linear relationships among multiple variables than the process-based model.Integrated gradients(IG)analysis furtherillustrates TL's ability to capture spatial het-erogeneity in upstream and downstream sub-basins and its adeptness in characterizing different flow regimes.This study underscores the potential of TL in enhancing the understanding of hydrological processes in large-scale catchments and highlights its value for water resource management in transboundary basins under data scarcity.
文摘Intelligent condition monitoring of wind turbines is essential for reducing downtimes.Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and,eventually,operation faults.However,data-driven normal behavior models(NBMs)require a substantial amount of training data,as NBMs trained with scarce data may result in unreliable fault detection.To overcome this limitation,we present a novel generative deep transfer learning approach to make SCADA samples from one wind turbine lacking training data resemble SCADA data from wind turbines with representative training data.Through based CycleGAN-domain mapping,our method enables the application of an NBM trained on an existing wind turbine to a new one with severely limited data.We demonstrate our approach on field data mapping SCADA samples across 7 substantially different WTs.Our findings show significantly improved fault detection in wind turbines with scarce data.Our method achieves the most similar anomaly scores to an NBM trained with abundant data,outperforming NBMs trained on scarce training data with improvements of+10.3%in F1-score when 1 month of training data is available and+16.8%when 2 weeks are available.The domain mapping approach outperforms conventional fine-tuning at all considered degrees of data scarcity,ranging from 1 to 8 weeks of training data.The proposed technique enables earlier and more reliable fault detection in newly installed wind farms,demonstrating a novel and promising research direction to improve anomaly detection when faced with training data scarcity.
基金supported by the National Natural Science Foundation of China(No.52377084)the Zhishan Young Scholar Program of Southeast University,China(No.2242024RCB0019)。
文摘The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,which necessitates a substantial collection of data for effective training,a requirement frequently unfulfilled in practical power systems due to limited data availability.To address the critical issue of data scarcity in training AI models,this paper proposes a novel transfer-learning-based(TL-based)Wasserstein generative adversarial network(WGAN)approach for synthetic data generation of SSO in wind farms.To improve the capability of WGAN to capture the bidirectional temporal features inherent in oscillation data,a bidirectional long short-term memory(BiLSTM)layer is introduced.Additionally,to address the training instability caused by few-shot learning scenarios,the discriminator is augmented with mini-batch discrimination(MBD)layers and gradient penalty(GP)terms.Finally,TL is leveraged to finetune the model,effectively bridging the gap between the training data and real-world system data.To evaluate the quality of the synthetic data,two indexes are proposed based on dynamic time warping(DTW)and frequency domain analysis,followed by a classification task.Case studies demonstrate the effectiveness of the proposed approach in swiftly generating a large volume of synthetic SSO data,thereby significantly mitigating the issue of data scarcity prevalent in SSO research.
文摘Building consumption data is integral to numerous applications including retrofit analysis,Smart Grid integration and optimization,and load forecasting.Still,due to technical limitations,privacy concerns and the proprietary nature of the industry,usable data is often unavailable for research and development.Generative adversarial networks(GANs)-which generate synthetic instances that resemble those from an original training dataset-have been proposed to help address this issue.Previous studies use GANs to generate building sequence data,but the models are not typically designed for time series problems,they often require relatively large amounts of input data(at least 20,000 sequences)and it is unclear whether they correctly capture the temporal behaviour of the buildings.In this work we implement a conditional temporal GAN that addresses these issues,and we show that it exhibits state-of-the-art performance on small datasets.22 different experiments that vary according to their data inputs are benchmarked using Jensen-Shannon divergence(JSD)and predictive forecasting validation error.Of these,the best performing is also evaluated using a curated set of metrics that extends those of previous work to include PCA,deep-learning based forecasting and measurements of trend and seasonality.Two case studies are included:one for residential and one for commercial buildings.The model achieves a JSD of 0.012 on the former data and 0.037 on the latter,using only 396 and 156 original load sequences,respectively.
文摘Recent studies indicate dwindling groundwater quantity and quality of the largest regional aquifer system in North West India,raising concern over freshwater availability to about 182 million population residing in this region.Widespread agricultural activities have resulted severe groundwater pollution in this area,demanding a systematic vulnerability assessment for proactive measures.Conventional vulnerability assessment models encounter drawbacks due to subjectivity,complexity,data-prerequisites,and spatial-temporal constraints.This study incorporates isotopic information into a weighted-overlay framework to overcome the above-mentioned limitations and proposes a novel vulnerability assessment model.The isotope methodology provides crucial insights on groundwater recharge mechanisms(18O and 2H)and dynamics(3H)-often ignored in vulnerability assessment.Isotopic characterisation of precipitation helped in establishing Local Meteoric Water Line(LMWL)as well as inferring contrasting recharge mechanisms operating in different aquifers.Shallow aquifer(depth<60 m)showed significant evaporative signature with evaporation loss accounting up to 18.04%based on Rayleigh distillation equations.Inter-aquifer connections were apparent from Kernel Density Estimate(KDE)and isotope correlations.A weighted overlay isotope-geospatial model was developed combining 18O,3H,aquifer permeability,and water level data.The central and northern parts of study area fall under least(0.29%)and extremely(1.79%)vulnerable zones respectively,while majority of the study area fall under moderate(42.71%)and highly vulnerable zones(55.20%).Model validation was performed using groundwater NO3-concentration,which showed an overall accuracy up to 82%.Monte Carlo Simulation(MCS)was performed for sensitivity analysis and permeability was found to be the most sensitive input parameter,followed by 3H,18O,and water level.Comparing the vulnerability map with Land Use Land Cover(LULC)and population density maps helped in precisely identifying the high-risk sites,warranting a prompt attention.The model developed in this study integrates isotopic information with vulnerability assessment and resulted in model output with good accuracy,scientific basis,and widespread relevance,which highlights its crucial role in formulating proactive water resource management plans,especially in less explored data-scarce locations.