期刊文献+
共找到38篇文章
< 1 2 >
每页显示 20 50 100
Synthetic data as an investigative tool in hypertension and renal diseases research
1
作者 Aleena Jamal Som Singh Fawad Qureshi 《World Journal of Methodology》 2025年第1期9-13,共5页
There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful... There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful aids in their clinical decision-making while also preserving patient privacy.This is especially important given the epidemiology of chronic kidney disease,renal oncology,and hypertension worldwide.However,there remains a need to create a framework for guidance regarding how to better utilize synthetic data as a practical application in this research. 展开更多
关键词 synthetic data Artificial intelligence NEPHROLOGY Blood pressure RESEARCH EDITORIAL
暂未订购
Real-time earthquake magnitude prediction using designed machine learning ensemble trained on real and CTGAN generated synthetic data
2
作者 Anushka Joshi Balasubramanian Raman C.Krishna Mohan 《Geodesy and Geodynamics》 2025年第3期350-368,共19页
The earthquake early warning(EEW)system provides advance notice of potentially damaging ground shaking.In EEW,early estimation of magnitude is crucial for timely rescue operations.A set of thirty-four features is extr... The earthquake early warning(EEW)system provides advance notice of potentially damaging ground shaking.In EEW,early estimation of magnitude is crucial for timely rescue operations.A set of thirty-four features is extracted using the primary wave earthquake precursor signal and site-specific information.In Japan's earthquake magnitude dataset,there is a chance of a high imbalance concerning the earthquakes above strong impact.This imbalance causes a high prediction error while training advanced machine learning or deep learning models.In this work,Conditional Tabular Generative Adversarial Networks(CTGAN),a deep machine learning tool,is utilized to learn the characteristics of the first arrival of earthquake P-waves and generate a synthetic dataset based on this information.The result obtained using actual and mixed(synthetic and actual)datasets will be used for training the stacked ensemble magnitude prediction model,MagPred,designed specifically for this study.There are 13295,3989,and1710 records designated for training,testing,and validation.The mean absolute error of the test dataset for single station magnitude detection using early three,four,and five seconds of P wave are 0.41,0.40,and 0.38 MJMA.The study demonstrates that the Generative Adversarial Networks(GANs)can provide a good result for single-station magnitude prediction.The study can be effective where less seismic data is available.The study shows that the machine learning method yields better magnitude detection results compared with the several regression models.The multi-station magnitude prediction study has been conducted on prominent Osaka,Off Fukushima,and Kumamoto earthquakes.Furthermore,to validate the performance of the model,an inter-region study has been performed on the earthquakes of the India or Nepal region.The study demonstrates that GANs can discover effective magnitude estimation compared with non-GAN-based methods.This has a high potential for wide application in earthquake early warning systems. 展开更多
关键词 MAGNITUDE synthetic data Machine learning EARTHQUAKE CTGAN
原文传递
Generating Synthetic Data for Machine Learning Models from the Pediatric Heart Network Fontan I Dataset
3
作者 Vatche Bahudian John Valdovinos 《Congenital Heart Disease》 2025年第1期115-127,共13页
Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcia... Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcial intelligence models predicting short-term and long-term health outcomes forpatients with the Fontan circulation are needed. Generative adversarial networks (GANs) provide a solutionfor generating realistic and useful synthetic data that can be used to train such models. Methods: Despitetheir promise, GANs have not been widely adopted in the congenital heart disease research communitydue, in some part, to a lack of knowledge on how to employ them. In this research study, a GAN was usedto generate synthetic data from the Pediatric Heart Network Fontan I dataset. A subset of data consistingof the echocardiographic and BNP measures collected from Fontan patients was used to train the GAN.Two sets of synthetic data were created to understand the effect of data missingness on synthetic datageneration. Synthetic data was created from real data in which the missing values were imputed usingMultiple Imputation by Chained Equations (MICE) (referred to as synthetic from imputed real samples). Inaddition, synthetic data was created from real data in which the missing values were dropped (referred to assynthetic from dropped real samples). Both synthetic datasets were evaluated for ffdelity by using visualmethods which involved comparing histograms and principal component analysis (PCA) plots. Fidelitywas measured quantitatively by (1) comparing synthetic and real data using the Kolmogorov-Smirnovtest to evaluate the similarity between two distributions and (2) training a neural network to distinguishbetween real and synthetic samples. Both synthetic datasets were evaluated for utility by training aneural network with synthetic data and testing the neural network on its ability to classify patients thathave ventricular dysfunction using echocardiograph measures and serological measures. Results: Usinghistograms, associated probability density functions, and (PCA), both synthetic datasets showed visualresemblance in distribution and variance to real Fontan data. Quantitatively, synthetic data from droppedreal samples had higher similarity scores, as demonstrated by the Kolmogorov–Smirnov statistic, for all butone feature (age at Fontan) compared to synthetic data from imputed real samples, which demonstrateddissimilar scores for three features (Echo SV, Echo tda, and BNP). In addition, synthetic data from droppedreal samples resembled real data to a larger extent (49.3% classiffcation error) than synthetic data fromimputed real samples (65.28% classiffcation error). Classiffcation errors approximating 50% represent datasetsthat are indistinguishable. In terms of utility, synthetic data created from real data in which the missingvalues were imputed classiffed ventricular dysfunction in real data with a classiffcation error of 10.99%.Similarly, utility of the generated synthetic data by showing that a neural network trained on synthetic dataderived from real data in which the missing values were dropped could classify ventricular dysfunction inreal data with a classiffcation error of 9.44%. Conclusions: Although representing a limited subset of thevast data available on the Pediatric Heart Network, generative adversarial networks can create syntheticdata that mimics the probability distribution of real Fontan echocardiographic measures. Clinicians can usethese synthetic data to create models that predict health outcomes for Fontan patients. 展开更多
关键词 synthetic data congenital heart disease Fontan circulation
暂未订购
Generating Synthetic Data to Reduce Prediction Error of Energy Consumption
4
作者 Debapriya Hazra Wafa Shafqat Yung-Cheol Byun 《Computers, Materials & Continua》 SCIE EI 2022年第2期3151-3167,共17页
Renewable and nonrenewable energy sources are widely incorporated for solar and wind energy that produces electricity without increasing carbon dioxide emissions.Energy industries worldwide are trying hard to predict ... Renewable and nonrenewable energy sources are widely incorporated for solar and wind energy that produces electricity without increasing carbon dioxide emissions.Energy industries worldwide are trying hard to predict future energy consumption that could eliminate over or under contracting energy resources and unnecessary financing.Machine learning techniques for predicting energy are the trending solution to overcome the challenges faced by energy companies.The basic need for machine learning algorithms to be trained for accurate prediction requires a considerable amount of data.Another critical factor is balancing the data for enhanced prediction.Data Augmentation is a technique used for increasing the data available for training.Synthetic data are the generation of new data which can be trained to improve the accuracy of prediction models.In this paper,we propose a model that takes time series energy consumption data as input,pre-processes the data,and then uses multiple augmentation techniques and generative adversarial networks to generate synthetic data which when combined with the original data,reduces energy consumption prediction error.We propose TGAN-skip-Improved-WGAN-GP to generate synthetic energy consumption time series tabular data.We modify TGANwith skip connections,then improveWGANGPby defining a consistency term,and finally use the architecture of improved WGAN-GP for training TGAN-skip.We used various evaluation metrics and visual representation to compare the performance of our proposed model.We also measured prediction accuracy along with mean and maximum error generated while predicting with different variations of augmented and synthetic data with original data.The mode collapse problemcould be handled by TGAN-skip-Improved-WGAN-GP model and it also converged faster than existing GAN models for synthetic data generation.The experiment result shows that our proposed technique of combining synthetic data with original data could significantly reduce the prediction error rate and increase the prediction accuracy of energy consumption. 展开更多
关键词 Energy consumption generative adversarial networks synthetic data time series data TGAN WGAN-GP TGAN-skip prediction error augmentation
在线阅读 下载PDF
Robust high frequency seismic bandwidth extension with a deep neural network trained using synthetic data
5
作者 Paul Zwartjes Jewoo Yoo 《Artificial Intelligence in Geosciences》 2024年第1期64-81,共18页
Geophysicists interpreting seismic reflection data aim for the highest resolution possible as this facilitates the interpretation and discrimination of subtle geological features.Various deterministic methods based on... Geophysicists interpreting seismic reflection data aim for the highest resolution possible as this facilitates the interpretation and discrimination of subtle geological features.Various deterministic methods based on Wiener filtering exist to increase the temporal frequency bandwidth and compress the seismic wavelet in a process called spectral shaping.Auto-encoder neural networks with convolutional layers have been applied to this problem,with encouraging results,but the problem of generalization to unseen data remains.Most published works have used supervised learning with training data constructed from field seismic data or synthetic seismic data generated based on measured well logs or based on seismic wavefield modelling.This leads to satisfactory results on datasets similar to the training data but requires re-training of the networks for unseen data with different characteristics.In this work seek to improve the generalization,not by experimenting with network architecture(we use a conventional U-net with some small modifications),but by adopting a different approach to creating the training data for the supervised learning process.Although the network is important,at this stage of development we see more improvement in prediction results by altering the design of the training data than by architectural changes.The approach we take is to create synthetic training data consisting of simple geometric shapes convolved with a seismic wavelet.We created a very diverse training dataset consisting of 9000 seismic images with between 5 and 300 seismic events resembling seismic reflections that have geophysically motived perturbations in terms of shape and character.The 2D U-net we have trained can boost robustly and recursively the dominant frequency by 50%.We demonstrate this on unseen field data with different bandwidths and signal-to-noise ratios.Additionally,this 2D U-net can handle non-stationary wavelets and overlapping events of different bandwidth without creating excessive ringing.It is also robust in the presence of noise.The significance of this result is that it simplifies the effort of bandwidth extension and demonstrates the usefulness of auto-encoder neural network for geophysical data processing. 展开更多
关键词 RESOLUTION Seismic data u-Net Bandwidth extension synthetic data
在线阅读 下载PDF
Size or diversity?Synthetic dataset recommendations for machine learning heating energy prediction models in early design stages for residential buildings
6
作者 Xinyue Wang Yinan Yu +1 位作者 Robin Teigland Alexander Hollberg 《Energy and AI》 2025年第3期480-501,共22页
One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation,yet today’s simulation engines are computationally intensive.Re... One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation,yet today’s simulation engines are computationally intensive.Recently,machine learning(ML)energy prediction models have shown promise in replacing these simulation engines.However,it is often difficult to develop such ML models due to the lack of proper datasets.Synthetic datasets can provide a solution,but determining the optimal quantity and diversity of synthetic data remains a challenging task.Furthermore,there is a lack of understanding of the compatibility between different ML algorithms and the characteristics of synthetic datasets.To fill these gaps,this study conducted multiple ML experiments using residential buildings in Sweden to determine the best-performing ML algorithm,as well as the characteristics of the corresponding synthetic dataset.A parametric model was developed to generate a wide range of synthetic datasets varying in size and building shape,referred to as diversity.Five ML algorithms selected through a literature review were trained using the different datasets.Results show that the Support Vector Machine performed the best overall.Multiple Linear Regression performed well with small and lowdiverse datasets,while the Artificial Neural Network performed well with large and high-diverse datasets.We conclude that developers should focus more on increasing diversity instead of size once the dataset size reaches around 1440 when generating synthetic training datasets.This study offers insights for researchers and practitioners,such as software tool developers,when developing ML building energy prediction models in early-stage optimization. 展开更多
关键词 Machine learning Early-stage optimization Building energy synthetic data Training size data diversity
在线阅读 下载PDF
Transfer-learning-based BiLSTM-WGAN Approach for Synthetic Data Generation of Sub-synchronous Oscillations in Wind Farms
7
作者 Shuang Feng Zhirui Zhang +2 位作者 Yuhang Zheng Jiaxing Lei Yi Tang 《Journal of Modern Power Systems and Clean Energy》 2025年第4期1199-1210,共12页
The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,... The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,which necessitates a substantial collection of data for effective training,a requirement frequently unfulfilled in practical power systems due to limited data availability.To address the critical issue of data scarcity in training AI models,this paper proposes a novel transfer-learning-based(TL-based)Wasserstein generative adversarial network(WGAN)approach for synthetic data generation of SSO in wind farms.To improve the capability of WGAN to capture the bidirectional temporal features inherent in oscillation data,a bidirectional long short-term memory(BiLSTM)layer is introduced.Additionally,to address the training instability caused by few-shot learning scenarios,the discriminator is augmented with mini-batch discrimination(MBD)layers and gradient penalty(GP)terms.Finally,TL is leveraged to finetune the model,effectively bridging the gap between the training data and real-world system data.To evaluate the quality of the synthetic data,two indexes are proposed based on dynamic time warping(DTW)and frequency domain analysis,followed by a classification task.Case studies demonstrate the effectiveness of the proposed approach in swiftly generating a large volume of synthetic SSO data,thereby significantly mitigating the issue of data scarcity prevalent in SSO research. 展开更多
关键词 Sub-synchronous oscillation wind farm data scarcity synthetic data generation artifical intelligence Wasserstein generative adversarial network transfer learning
原文传递
A Survey of Synthetic Data Augmentation Methods in Machine Vision 被引量:1
8
作者 Alhassan Mumuni Fuseini Mumuni Nana Kobina Gerrar 《Machine Intelligence Research》 EI CSCD 2024年第5期831-869,共39页
The standard approach to tackling computer vision problems is to train deep convolutional neural network(CNN)models using large-scale image datasets that are representative of the target task.However,in many scenarios... The standard approach to tackling computer vision problems is to train deep convolutional neural network(CNN)models using large-scale image datasets that are representative of the target task.However,in many scenarios,it is often challenging to obtain sufficient image data for the target task.Data augmentation is a way to mitigate this challenge.A common practice is to explicitly transform existing images in desired ways to create the required volume and variability of training data necessary to achieve good generalization performance.In situations where data for the target domain are not accessible,a viable workaround is to synthesize training data from scratch,i.e.,synthetic data augmentation.This paper presents an extensive review of synthetic data augmentation techniques.It covers data synthesis approaches based on realistic 3D graphics modelling,neural style transfer(NST),differential neural rendering,and generative modelling using generative adversarial networks(GANs)and variational autoencoders(VAEs).For each of these classes of methods,we focus on the important data generation and augmentation techniques,general scope of application and specific use-cases,as well as existing limitations and possible workarounds.Additionally,we provide a summary of common synthetic datasets for training computer vision models,highlighting the main features,application domains and supported tasks.Finally,we discuss the effectiveness of synthetic data augmentation methods.Since this is the first paper to explore synthetic data augmentation methods in great detail,we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues. 展开更多
关键词 data augmentation generative modelling neural rendering data synthesis synthetic data neural style transfer(NsT)
原文传递
Sarve:synthetic data and local differential privacy for private frequency estimation 被引量:1
9
作者 Gatha Varmal Ritu Chauhan Dhananjay Singh 《Cybersecurity》 EI CSCD 2022年第4期97-116,共20页
The collection of user attributes by service providers is a double-edged sword.They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders.The analysis of the colle... The collection of user attributes by service providers is a double-edged sword.They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders.The analysis of the collected user data includes frequency estimation for categorical attributes.Nonetheless,the users deserve privacy guarantees against inadvertent identity disclosures.Therefore algorithms called frequency oracles were developed to randomize or perturb user attributes and estimate the frequencies of their values.We propose Sarve,a frequency oracle that used Randomized Aggregatable Privacy-Preserving Ordinal Response(RAPPOR)and Hadamard Response(HR)for randomization in combination with fake data.The design of a service-oriented architecture must consider two types of complexities,namely computational and communication.The functions of such systems aim to minimize the two complexities and therefore,the choice of privacy-enhancing methods must be a calculated decision.The variant of RAPPOR we had used was realized through bloom flters.A bloom filter is a memory-efficient data structure that offers time complexity of O(1).On the other hand,HR has been proven to give the best communication costs of the order of log(b)for b-bits communication.Therefore,Sarve is a step towards frequency oracles that exhibit how privacy provisions of existing methods can be combined with those of fake data to achieve statistical results comparable to the original data.Sarve also implemented an adaptive solution enhanced from the work of Arcolezi et al.The use of RAPPOR was found to provide better privacy-utility tradeoffs for specific privacy budgets in both high and general privacyregimes. 展开更多
关键词 synthetic data Differential privacy Frequency estimation Frequency oracle PRIVACY
原文传递
Transfer learning from synthetic data for open-circuit voltage curve reconstruction and state of health estimation of lithium-ion batteries from partial charging segments 被引量:1
10
作者 Tobias Hofmann Jacob Hamar +2 位作者 Bastian Mager Simon Erhard Jan Philipp Schmidt 《Energy and AI》 EI 2024年第3期80-97,共18页
Data-driven models for battery state estimation require extensive experimental training data,which may not be available or suitable for specific tasks like open-circuit voltage(OCV)reconstruction and subsequent state ... Data-driven models for battery state estimation require extensive experimental training data,which may not be available or suitable for specific tasks like open-circuit voltage(OCV)reconstruction and subsequent state of health(SOH)estimation.This study addresses this issue by developing a transfer-learning-based OCV reconstruction model using a temporal convolutional long short-term memory(TCN-LSTM)network trained on synthetic data from an automotive nickel cobalt aluminium oxide(NCA)cell generated through a mechanistic model approach.The data consists of voltage curves at constant temperature,C-rates between C/30 to 1C,and a SOH-range from 70%to 100%.The model is refined via Bayesian optimization and then applied to four use cases with reduced experimental nickel manganese cobalt oxide(NMC)cell training data for higher use cases.The TL models’performances are compared with models trained solely on experimental data,focusing on different C-rates and voltage windows.The results demonstrate that the OCV reconstruction mean absolute error(MAE)within the average battery electric vehicle(BEV)home charging window(30%to 85%state of charge(SOC))is less than 22 mV for the first three use cases across all C-rates.The SOH estimated from the reconstructed OCV exhibits an mean absolute percentage error(MAPE)below 2.2%for these cases.The study further investigates the impact of the source domain on TL by incorporating two additional synthetic datasets,a lithium iron phosphate(LFP)cell and an entirely artificial,non-existing,cell,showing that solely the shifting and scaling of gradient changes in the charging curve suffice to transfer knowledge,even between different cell chemistries.A key limitation with respect to extrapolation capability is identified and evidenced in our fourth use case,where the absence of such comprehensive data hindered the TL process. 展开更多
关键词 Lithium-ion battery State of health estimation Transfer learning OCV curve Partial charging synthetic data
在线阅读 下载PDF
Synthetic Data Generation and Shuffled Multi-Round Training Based Offline Handwritten Mathematical Expression Recognition
11
作者 Lan-Fang Dong Han-Chao Liu Xin-Ming Zhang 《Journal of Computer Science & Technology》 SCIE EI CSCD 2022年第6期1427-1443,共17页
Offline handwritten mathematical expression recognition is a challenging optical character recognition(OCR)task due to various ambiguities of handwritten symbols and complicated two-dimensional structures.Recent work ... Offline handwritten mathematical expression recognition is a challenging optical character recognition(OCR)task due to various ambiguities of handwritten symbols and complicated two-dimensional structures.Recent work in this area usually constructs deeper and deeper neural networks trained with end-to-end approaches to improve the performance.However,the higher the complexity of the network,the more the computing resources and time required.To improve the performance without more computing requirements,we concentrate on the training data and the training strategy in this paper.We propose a data augmentation method which can generate synthetic samples with new LaTeX notations by only using the official training data of CROHME.Moreover,we propose a novel training strategy called Shuffled Multi-Round Training(SMRT)to regularize the model.With the generated data and the shuffled multi-round training strategy,we achieve the state-of-the-art result in expression accuracy,i.e.,59.74%and 61.57%on CROHME 2014 and 2016,respectively,by using attention-based encoder-decoder models for offline handwritten mathematical expression recognition. 展开更多
关键词 handwritten mathematical expression recognition OFFLINE synthetic data generation training strategy
原文传递
A Study of Using Synthetic Data for Effective Association Knowledge Learning
12
作者 Yuchi Liu Zhongdao Wang +1 位作者 Xiangxin Zhou Liang Zheng 《Machine Intelligence Research》 EI CSCD 2023年第2期194-206,共13页
Association,aiming to link bounding boxes of the same identity in a video sequence,is a central component in multi-object tracking(MOT).To train association modules,e.g.,parametric networks,real video data are usually... Association,aiming to link bounding boxes of the same identity in a video sequence,is a central component in multi-object tracking(MOT).To train association modules,e.g.,parametric networks,real video data are usually used.However,annotating person tracks in consecutive video frames is expensive,and such real data,due to its inflexibility,offer us limited opportunities to evaluate the system performance w.r.t.changing tracking scenarios.In this paper,we study whether 3D synthetic data can replace real-world videos for association training.Specifically,we introduce a large-scale synthetic data engine named MOTX,where the motion characteristics of cameras and objects are manually configured to be similar to those of real-world datasets.We show that,compared with real data,association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques.Our intriguing observation is credited to two factors.First and foremost,3D engines can well simulate motion factors such as camera movement,camera view,and object movement so that the simulated videos can provide association modules with effective motion features.Second,the experimental results show that the appearance domain gap hardly harms the learning of association knowledge.In addition,the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT,which brings new insights to the community. 展开更多
关键词 Multi-object tracking(MOT) data association synthetic data motion simulation association knowledge learning
原文传递
Review on synergizing the Metaverse and AI-driven synthetic data:enhancing virtual realms and activity recognition in computer vision
13
作者 Megani Rajendran Chek Tien Tan +2 位作者 Indriyati Atmosukarto Aik Beng Ng Simon See 《Visual Intelligence》 2024年第1期315-336,共22页
The Metaverse’s emergence is redefining digital interaction,enabling seamless engagement in immersive virtual realms.This trend’s integration with AI and virtual reality(VR)is gaining momentum,albeit with challenges... The Metaverse’s emergence is redefining digital interaction,enabling seamless engagement in immersive virtual realms.This trend’s integration with AI and virtual reality(VR)is gaining momentum,albeit with challenges in acquiring extensive human action datasets.Real-world activities involve complex intricate behaviors,making accurate capture and annotation difficult.VR compounds this difficulty by requiring meticulous simulation of natural movements and interactions.As the Metaverse bridges the physical and digital realms,the demand for diverse human action data escalates,requiring innovative solutions to enrich AI and VR capabilities.This need is underscored by state-of-the-art models that excel but are hampered by limited real-world data.The overshadowing of synthetic data benefits further complicates the issue.This paper systematically examines both real-world and synthetic datasets for activity detection and recognition in computer vision.Introducing Metaverse-enabled advancements,we unveil SynDa’s novel streamlined pipeline using photorealistic rendering and AI pose estimation.By fusing real-life video datasets,large-scale synthetic datasets are generated to augment training and mitigate real data scarcity and costs.Our preliminary experiments reveal promising results in terms of mean average precision(mAP),where combining real data and synthetic video data generated using this pipeline to train models presents an improvement in mAP(32.35%),compared to the mAP of the same model when trained on real data(29.95%).This demonstrates the transformative synergy between Metaverse and AI-driven synthetic data augmentation. 展开更多
关键词 synthetic data Virtual reality datasets Human action Metaverse
在线阅读 下载PDF
“stppSim”: A Novel Analytical Tool for Creating Synthetic Spatio-Temporal Point Data
14
作者 Monsuru Adepeju 《Open Journal of Modelling and Simulation》 2023年第4期99-116,共18页
In crime science, understanding the dynamics and interactions between crime events is crucial for comprehending the underlying factors that drive their occurrences. Nonetheless, gaining access to detailed spatiotempor... In crime science, understanding the dynamics and interactions between crime events is crucial for comprehending the underlying factors that drive their occurrences. Nonetheless, gaining access to detailed spatiotemporal crime records from law enforcement faces significant challenges due to confidentiality concerns. In response to these challenges, this paper introduces an innovative analytical tool named “stppSim,” designed to synthesize fine-grained spatiotemporal point records while safeguarding the privacy of individual locations. By utilizing the open-source R platform, this tool ensures easy accessibility for researchers, facilitating download, re-use, and potential advancements in various research domains beyond crime science. 展开更多
关键词 OPEN-SOURCE synthetic data CRIME Spatio-Temporal Patterns data Privacy
在线阅读 下载PDF
A Study of EM Algorithm as an Imputation Method: A Model-Based Simulation Study with Application to a Synthetic Compositional Data
15
作者 Yisa Adeniyi Abolade Yichuan Zhao 《Open Journal of Modelling and Simulation》 2024年第2期33-42,共10页
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode... Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance. 展开更多
关键词 Compositional data Linear Regression Model Least Square Method Robust Least Square Method synthetic data Aitchison Distance Maximum Likelihood Estimation Expectation-Maximization Algorithm k-Nearest Neighbor and Mean imputation
在线阅读 下载PDF
The Future of Artificial Intelligence in the Face of Data Scarcity
16
作者 Hemn Barzan Abdalla Yulia Kumar +4 位作者 Jose Marchena Stephany Guzman Ardalan Awlla Mehdi Gheisari Maryam Cheraghy 《Computers, Materials & Continua》 2025年第7期1073-1099,共27页
Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spread... Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spreads and transforms more industries,the lack of data is a significant obstacle:the best methods for teaching machines how real-world processes work.This paper explores the considerable implications of data scarcity for the AI industry,which threatens to restrict its growth and potential,and proposes plausible solutions and perspectives.In addition,this article focuses highly on different ethical considerations:privacy,consent,and non-discrimination principles during AI model developments under limited conditions.Besides,innovative technologies are investigated through the paper in aspects that need implementation by incorporating transfer learning,few-shot learning,and data augmentation to adapt models so they could fit effective use processes in low-resource settings.This thus emphasizes the need for collaborative frameworks and sound methodologies that ensure applicability and fairness,tackling the technical and ethical challenges associated with data scarcity in AI.This article also discusses prospective approaches to dealing with data scarcity,emphasizing the blend of synthetic data and traditional models and the use of advanced machine learning techniques such as transfer learning and few-shot learning.These techniques aim to enhance the flexibility and effectiveness of AI systems across various industries while ensuring sustainable AI technology development amid ongoing data scarcity. 展开更多
关键词 data scarcity artificial intelligence application of artificial intelligence ethical considerations artificial general intelligence synthetic data
在线阅读 下载PDF
Resolving co- and early post-seismic slip variations of the 2021 MW 7.4 Madoi earthquake in east Bayan Har block with a block-wide distributed deformation mode from satellite synthetic aperture radar data 被引量:15
17
作者 Shuai Wang Chuang Song +1 位作者 ShanShan Li Xing Li 《Earth and Planetary Physics》 CSCD 2022年第1期108-122,共15页
On 21 May 2021(UTC),an MW 7.4 earthquake jolted the east Bayan Har block in the Tibetan Plateau.The earthquake received widespread attention as it is the largest event in the Tibetan Plateau and its surroundings since... On 21 May 2021(UTC),an MW 7.4 earthquake jolted the east Bayan Har block in the Tibetan Plateau.The earthquake received widespread attention as it is the largest event in the Tibetan Plateau and its surroundings since the 2008 Wenchuan earthquake,and especially in proximity to the seismic gaps on the east Kunlun fault.Here we use satellite interferometric synthetic aperture radar data and subpixel offset observations along the range directions to characterize the coseismic deformation of the earthquake.Range offset displacements depict clear surface ruptures with a total length of~170 km involving two possible activated fault segments in the earthquake.Coseismic modeling results indicate that the earthquake was dominated by left-lateral strike-slip motions of up to 7 m within the top 12 km of the crust.The well-resolved slip variations are characterized by five major slip patches along strike and 64%of shallow slip deficit,suggesting a young seismogenic structure.Spatial-temporal changes of the postseismic deformation are mapped from early 6-day and 24-day InSAR observations,and are well explained by time-dependent afterslip models.Analysis of Global Navigation Satellite System(GNSS)velocity profiles and strain rates suggests that the eastward extrusion of plateau is diffusely distributed across the east Bayan Har block,but exhibits significant lateral heterogeneities,as evidenced by magnetotelluric observations.The block-wide distributed deformation of the east Bayan Har block along with the significant co-and post-seismic stress loadings from the Madoi earthquake imply high seismic risks along regional faults,especially the Tuosuo Lake and Maqên-Maqu segments of the Kunlun fault that are known as seismic gaps. 展开更多
关键词 Madoi earthquake Bayan Har block synthetic aperture radar data co-and post-seismic slip block-wide distributed deformation seismic risk
在线阅读 下载PDF
Enhancing crop yield prediction in Senegal using advanced machine learning techniques and synthetic data
18
作者 Mohammad Amin Razavi A.Pouyan Nejadhashemi +5 位作者 Babak Majidi Hoda S.Razavi Josue Kpodo Rasu Eeswaran Ignacio Ciampitti P.V.Vara Prasad 《Artificial Intelligence in Agriculture》 2024年第4期99-114,共16页
In this study,we employ advanced data-driven techniques to investigate the complex relationships between the yields of five major crops and various geographical and spatiotemporal features in Senegal.We analyze how th... In this study,we employ advanced data-driven techniques to investigate the complex relationships between the yields of five major crops and various geographical and spatiotemporal features in Senegal.We analyze how these features influence crop yields by utilizing remotely sensed data.Our methodology incorporates clustering algorithms and correlation matrix analysis to identify significant patterns and dependencies,offering a comprehensive understanding of the factors affecting agricultural productivity in Senegal.To optimize the model's performance and identify the optimal hyperparameters,we implemented a comprehensive grid search across four distinct machine learning regressors:Random Forest,Extreme Gradient Boosting(XGBoost),Categorical Boosting(CatBoost),and Light Gradient-Boosting Machine(LightGBM).Each regressor offers unique functionalities,enhancing our exploration of potential model configurations.The top-performing models were selected based on evaluating multiple performance metrics,ensuring robust and accurate predictive capabilities.The results demonstrated that XGBoost and CatBoost perform better than the other two.We introduce synthetic crop data generated using a Variational Auto Encoder to address the challenges posed by limited agricultural datasets.By achieving high similarity scores with real-world data,our synthetic samples enhance model robustness,mitigate overfitting,and provide a viable solution for small dataset issues in agriculture.Our approach distinguishes itself by creating a flexible model applicable to various crops together.By integrating five crop datasets and generating high-quality synthetic data,we improve model performance,reduce overfitting,and enhance realism.Our findings provide crucial insights for productivity drivers in key cropping systems,enabling robust recommendations and strengthening the decision-making capabilities of policymakers and farmers in datascarce regions. 展开更多
关键词 Crop yield prediction Variational auto encoder Pattern recognition on spatiotemporal and physiographical variables synthetic tabular data generation Ensemble learning
原文传递
Spatial and temporal classification of synthetic satellite imagery:land cover mapping and accuracy validation 被引量:3
19
作者 Yong XU Bo HUANG 《Geo-Spatial Information Science》 SCIE EI 2014年第1期1-7,共7页
This study focused on land cover mapping based on synthetic images,especially using the method of spatial and temporal classification as well as the accuracy validation of their results.Our experimental results indica... This study focused on land cover mapping based on synthetic images,especially using the method of spatial and temporal classification as well as the accuracy validation of their results.Our experimental results indicate that the accuracy of land cover map based on synthetic imagery and actual observation has a similar standard compared with actual land cover survey data.These findings facilitate land cover mapping with synthetic data in the area where actual observation is missing.Furthermore,in order to improve the quality of the land cover mapping,this research employed the spatial and temporal Markov random field classification approach.Test results show that overall mapping accuracy can be increased by approximately 5% after applying spatial and temporal classification.This finding contributes towards the achievement of higher quality land cover mapping of areas with missing data by using spatial and temporal information. 展开更多
关键词 land cover mapping synthetic data spatial and temporal classification
原文传递
Storm surge model based on variational data assimilation method 被引量:1
20
作者 Shi-li HUANG Jian XU +1 位作者 De-guan WANG Dong-yan LU 《Water Science and Engineering》 EI CAS 2010年第2期166-173,共8页
By combining computation and observation information, the variational data assimilation method has the ability to eliminate errors caused by the uncertainty of parameters in practical forecasting. It was applied to a ... By combining computation and observation information, the variational data assimilation method has the ability to eliminate errors caused by the uncertainty of parameters in practical forecasting. It was applied to a storm surge model based on unstructured grids with high spatial resolution meant for improving the forecasting accuracy of the storm surge. By controlling the wind stress drag coefficient, the variation-based model was developed and validated through data assimilation tests in an actual storm surge induced by a typhoon. In the data assimilation tests, the model accurately identified the wind stress drag coefficient and obtained results close to the true state. Then, the actual storm surge induced by Typhoon 0515 was forecast by the developed model, and the results demonstrate its efficiency in practical application. 展开更多
关键词 storm surge variational data assimilation synthetic data unstructured grid
在线阅读 下载PDF
上一页 1 2 下一页 到第
使用帮助 返回顶部