There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful...There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful aids in their clinical decision-making while also preserving patient privacy.This is especially important given the epidemiology of chronic kidney disease,renal oncology,and hypertension worldwide.However,there remains a need to create a framework for guidance regarding how to better utilize synthetic data as a practical application in this research.展开更多
The earthquake early warning(EEW)system provides advance notice of potentially damaging ground shaking.In EEW,early estimation of magnitude is crucial for timely rescue operations.A set of thirty-four features is extr...The earthquake early warning(EEW)system provides advance notice of potentially damaging ground shaking.In EEW,early estimation of magnitude is crucial for timely rescue operations.A set of thirty-four features is extracted using the primary wave earthquake precursor signal and site-specific information.In Japan's earthquake magnitude dataset,there is a chance of a high imbalance concerning the earthquakes above strong impact.This imbalance causes a high prediction error while training advanced machine learning or deep learning models.In this work,Conditional Tabular Generative Adversarial Networks(CTGAN),a deep machine learning tool,is utilized to learn the characteristics of the first arrival of earthquake P-waves and generate a synthetic dataset based on this information.The result obtained using actual and mixed(synthetic and actual)datasets will be used for training the stacked ensemble magnitude prediction model,MagPred,designed specifically for this study.There are 13295,3989,and1710 records designated for training,testing,and validation.The mean absolute error of the test dataset for single station magnitude detection using early three,four,and five seconds of P wave are 0.41,0.40,and 0.38 MJMA.The study demonstrates that the Generative Adversarial Networks(GANs)can provide a good result for single-station magnitude prediction.The study can be effective where less seismic data is available.The study shows that the machine learning method yields better magnitude detection results compared with the several regression models.The multi-station magnitude prediction study has been conducted on prominent Osaka,Off Fukushima,and Kumamoto earthquakes.Furthermore,to validate the performance of the model,an inter-region study has been performed on the earthquakes of the India or Nepal region.The study demonstrates that GANs can discover effective magnitude estimation compared with non-GAN-based methods.This has a high potential for wide application in earthquake early warning systems.展开更多
Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcia...Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcial intelligence models predicting short-term and long-term health outcomes forpatients with the Fontan circulation are needed. Generative adversarial networks (GANs) provide a solutionfor generating realistic and useful synthetic data that can be used to train such models. Methods: Despitetheir promise, GANs have not been widely adopted in the congenital heart disease research communitydue, in some part, to a lack of knowledge on how to employ them. In this research study, a GAN was usedto generate synthetic data from the Pediatric Heart Network Fontan I dataset. A subset of data consistingof the echocardiographic and BNP measures collected from Fontan patients was used to train the GAN.Two sets of synthetic data were created to understand the effect of data missingness on synthetic datageneration. Synthetic data was created from real data in which the missing values were imputed usingMultiple Imputation by Chained Equations (MICE) (referred to as synthetic from imputed real samples). Inaddition, synthetic data was created from real data in which the missing values were dropped (referred to assynthetic from dropped real samples). Both synthetic datasets were evaluated for ffdelity by using visualmethods which involved comparing histograms and principal component analysis (PCA) plots. Fidelitywas measured quantitatively by (1) comparing synthetic and real data using the Kolmogorov-Smirnovtest to evaluate the similarity between two distributions and (2) training a neural network to distinguishbetween real and synthetic samples. Both synthetic datasets were evaluated for utility by training aneural network with synthetic data and testing the neural network on its ability to classify patients thathave ventricular dysfunction using echocardiograph measures and serological measures. Results: Usinghistograms, associated probability density functions, and (PCA), both synthetic datasets showed visualresemblance in distribution and variance to real Fontan data. Quantitatively, synthetic data from droppedreal samples had higher similarity scores, as demonstrated by the Kolmogorov–Smirnov statistic, for all butone feature (age at Fontan) compared to synthetic data from imputed real samples, which demonstrateddissimilar scores for three features (Echo SV, Echo tda, and BNP). In addition, synthetic data from droppedreal samples resembled real data to a larger extent (49.3% classiffcation error) than synthetic data fromimputed real samples (65.28% classiffcation error). Classiffcation errors approximating 50% represent datasetsthat are indistinguishable. In terms of utility, synthetic data created from real data in which the missingvalues were imputed classiffed ventricular dysfunction in real data with a classiffcation error of 10.99%.Similarly, utility of the generated synthetic data by showing that a neural network trained on synthetic dataderived from real data in which the missing values were dropped could classify ventricular dysfunction inreal data with a classiffcation error of 9.44%. Conclusions: Although representing a limited subset of thevast data available on the Pediatric Heart Network, generative adversarial networks can create syntheticdata that mimics the probability distribution of real Fontan echocardiographic measures. Clinicians can usethese synthetic data to create models that predict health outcomes for Fontan patients.展开更多
Geophysicists interpreting seismic reflection data aim for the highest resolution possible as this facilitates the interpretation and discrimination of subtle geological features.Various deterministic methods based on...Geophysicists interpreting seismic reflection data aim for the highest resolution possible as this facilitates the interpretation and discrimination of subtle geological features.Various deterministic methods based on Wiener filtering exist to increase the temporal frequency bandwidth and compress the seismic wavelet in a process called spectral shaping.Auto-encoder neural networks with convolutional layers have been applied to this problem,with encouraging results,but the problem of generalization to unseen data remains.Most published works have used supervised learning with training data constructed from field seismic data or synthetic seismic data generated based on measured well logs or based on seismic wavefield modelling.This leads to satisfactory results on datasets similar to the training data but requires re-training of the networks for unseen data with different characteristics.In this work seek to improve the generalization,not by experimenting with network architecture(we use a conventional U-net with some small modifications),but by adopting a different approach to creating the training data for the supervised learning process.Although the network is important,at this stage of development we see more improvement in prediction results by altering the design of the training data than by architectural changes.The approach we take is to create synthetic training data consisting of simple geometric shapes convolved with a seismic wavelet.We created a very diverse training dataset consisting of 9000 seismic images with between 5 and 300 seismic events resembling seismic reflections that have geophysically motived perturbations in terms of shape and character.The 2D U-net we have trained can boost robustly and recursively the dominant frequency by 50%.We demonstrate this on unseen field data with different bandwidths and signal-to-noise ratios.Additionally,this 2D U-net can handle non-stationary wavelets and overlapping events of different bandwidth without creating excessive ringing.It is also robust in the presence of noise.The significance of this result is that it simplifies the effort of bandwidth extension and demonstrates the usefulness of auto-encoder neural network for geophysical data processing.展开更多
Renewable and nonrenewable energy sources are widely incorporated for solar and wind energy that produces electricity without increasing carbon dioxide emissions.Energy industries worldwide are trying hard to predict ...Renewable and nonrenewable energy sources are widely incorporated for solar and wind energy that produces electricity without increasing carbon dioxide emissions.Energy industries worldwide are trying hard to predict future energy consumption that could eliminate over or under contracting energy resources and unnecessary financing.Machine learning techniques for predicting energy are the trending solution to overcome the challenges faced by energy companies.The basic need for machine learning algorithms to be trained for accurate prediction requires a considerable amount of data.Another critical factor is balancing the data for enhanced prediction.Data Augmentation is a technique used for increasing the data available for training.Synthetic data are the generation of new data which can be trained to improve the accuracy of prediction models.In this paper,we propose a model that takes time series energy consumption data as input,pre-processes the data,and then uses multiple augmentation techniques and generative adversarial networks to generate synthetic data which when combined with the original data,reduces energy consumption prediction error.We propose TGAN-skip-Improved-WGAN-GP to generate synthetic energy consumption time series tabular data.We modify TGANwith skip connections,then improveWGANGPby defining a consistency term,and finally use the architecture of improved WGAN-GP for training TGAN-skip.We used various evaluation metrics and visual representation to compare the performance of our proposed model.We also measured prediction accuracy along with mean and maximum error generated while predicting with different variations of augmented and synthetic data with original data.The mode collapse problemcould be handled by TGAN-skip-Improved-WGAN-GP model and it also converged faster than existing GAN models for synthetic data generation.The experiment result shows that our proposed technique of combining synthetic data with original data could significantly reduce the prediction error rate and increase the prediction accuracy of energy consumption.展开更多
One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation,yet today’s simulation engines are computationally intensive.Re...One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation,yet today’s simulation engines are computationally intensive.Recently,machine learning(ML)energy prediction models have shown promise in replacing these simulation engines.However,it is often difficult to develop such ML models due to the lack of proper datasets.Synthetic datasets can provide a solution,but determining the optimal quantity and diversity of synthetic data remains a challenging task.Furthermore,there is a lack of understanding of the compatibility between different ML algorithms and the characteristics of synthetic datasets.To fill these gaps,this study conducted multiple ML experiments using residential buildings in Sweden to determine the best-performing ML algorithm,as well as the characteristics of the corresponding synthetic dataset.A parametric model was developed to generate a wide range of synthetic datasets varying in size and building shape,referred to as diversity.Five ML algorithms selected through a literature review were trained using the different datasets.Results show that the Support Vector Machine performed the best overall.Multiple Linear Regression performed well with small and lowdiverse datasets,while the Artificial Neural Network performed well with large and high-diverse datasets.We conclude that developers should focus more on increasing diversity instead of size once the dataset size reaches around 1440 when generating synthetic training datasets.This study offers insights for researchers and practitioners,such as software tool developers,when developing ML building energy prediction models in early-stage optimization.展开更多
The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,...The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,which necessitates a substantial collection of data for effective training,a requirement frequently unfulfilled in practical power systems due to limited data availability.To address the critical issue of data scarcity in training AI models,this paper proposes a novel transfer-learning-based(TL-based)Wasserstein generative adversarial network(WGAN)approach for synthetic data generation of SSO in wind farms.To improve the capability of WGAN to capture the bidirectional temporal features inherent in oscillation data,a bidirectional long short-term memory(BiLSTM)layer is introduced.Additionally,to address the training instability caused by few-shot learning scenarios,the discriminator is augmented with mini-batch discrimination(MBD)layers and gradient penalty(GP)terms.Finally,TL is leveraged to finetune the model,effectively bridging the gap between the training data and real-world system data.To evaluate the quality of the synthetic data,two indexes are proposed based on dynamic time warping(DTW)and frequency domain analysis,followed by a classification task.Case studies demonstrate the effectiveness of the proposed approach in swiftly generating a large volume of synthetic SSO data,thereby significantly mitigating the issue of data scarcity prevalent in SSO research.展开更多
The standard approach to tackling computer vision problems is to train deep convolutional neural network(CNN)models using large-scale image datasets that are representative of the target task.However,in many scenarios...The standard approach to tackling computer vision problems is to train deep convolutional neural network(CNN)models using large-scale image datasets that are representative of the target task.However,in many scenarios,it is often challenging to obtain sufficient image data for the target task.Data augmentation is a way to mitigate this challenge.A common practice is to explicitly transform existing images in desired ways to create the required volume and variability of training data necessary to achieve good generalization performance.In situations where data for the target domain are not accessible,a viable workaround is to synthesize training data from scratch,i.e.,synthetic data augmentation.This paper presents an extensive review of synthetic data augmentation techniques.It covers data synthesis approaches based on realistic 3D graphics modelling,neural style transfer(NST),differential neural rendering,and generative modelling using generative adversarial networks(GANs)and variational autoencoders(VAEs).For each of these classes of methods,we focus on the important data generation and augmentation techniques,general scope of application and specific use-cases,as well as existing limitations and possible workarounds.Additionally,we provide a summary of common synthetic datasets for training computer vision models,highlighting the main features,application domains and supported tasks.Finally,we discuss the effectiveness of synthetic data augmentation methods.Since this is the first paper to explore synthetic data augmentation methods in great detail,we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues.展开更多
Data-driven models for battery state estimation require extensive experimental training data,which may not be available or suitable for specific tasks like open-circuit voltage(OCV)reconstruction and subsequent state ...Data-driven models for battery state estimation require extensive experimental training data,which may not be available or suitable for specific tasks like open-circuit voltage(OCV)reconstruction and subsequent state of health(SOH)estimation.This study addresses this issue by developing a transfer-learning-based OCV reconstruction model using a temporal convolutional long short-term memory(TCN-LSTM)network trained on synthetic data from an automotive nickel cobalt aluminium oxide(NCA)cell generated through a mechanistic model approach.The data consists of voltage curves at constant temperature,C-rates between C/30 to 1C,and a SOH-range from 70%to 100%.The model is refined via Bayesian optimization and then applied to four use cases with reduced experimental nickel manganese cobalt oxide(NMC)cell training data for higher use cases.The TL models’performances are compared with models trained solely on experimental data,focusing on different C-rates and voltage windows.The results demonstrate that the OCV reconstruction mean absolute error(MAE)within the average battery electric vehicle(BEV)home charging window(30%to 85%state of charge(SOC))is less than 22 mV for the first three use cases across all C-rates.The SOH estimated from the reconstructed OCV exhibits an mean absolute percentage error(MAPE)below 2.2%for these cases.The study further investigates the impact of the source domain on TL by incorporating two additional synthetic datasets,a lithium iron phosphate(LFP)cell and an entirely artificial,non-existing,cell,showing that solely the shifting and scaling of gradient changes in the charging curve suffice to transfer knowledge,even between different cell chemistries.A key limitation with respect to extrapolation capability is identified and evidenced in our fourth use case,where the absence of such comprehensive data hindered the TL process.展开更多
The Metaverse’s emergence is redefining digital interaction,enabling seamless engagement in immersive virtual realms.This trend’s integration with AI and virtual reality(VR)is gaining momentum,albeit with challenges...The Metaverse’s emergence is redefining digital interaction,enabling seamless engagement in immersive virtual realms.This trend’s integration with AI and virtual reality(VR)is gaining momentum,albeit with challenges in acquiring extensive human action datasets.Real-world activities involve complex intricate behaviors,making accurate capture and annotation difficult.VR compounds this difficulty by requiring meticulous simulation of natural movements and interactions.As the Metaverse bridges the physical and digital realms,the demand for diverse human action data escalates,requiring innovative solutions to enrich AI and VR capabilities.This need is underscored by state-of-the-art models that excel but are hampered by limited real-world data.The overshadowing of synthetic data benefits further complicates the issue.This paper systematically examines both real-world and synthetic datasets for activity detection and recognition in computer vision.Introducing Metaverse-enabled advancements,we unveil SynDa’s novel streamlined pipeline using photorealistic rendering and AI pose estimation.By fusing real-life video datasets,large-scale synthetic datasets are generated to augment training and mitigate real data scarcity and costs.Our preliminary experiments reveal promising results in terms of mean average precision(mAP),where combining real data and synthetic video data generated using this pipeline to train models presents an improvement in mAP(32.35%),compared to the mAP of the same model when trained on real data(29.95%).This demonstrates the transformative synergy between Metaverse and AI-driven synthetic data augmentation.展开更多
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spread...Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spreads and transforms more industries,the lack of data is a significant obstacle:the best methods for teaching machines how real-world processes work.This paper explores the considerable implications of data scarcity for the AI industry,which threatens to restrict its growth and potential,and proposes plausible solutions and perspectives.In addition,this article focuses highly on different ethical considerations:privacy,consent,and non-discrimination principles during AI model developments under limited conditions.Besides,innovative technologies are investigated through the paper in aspects that need implementation by incorporating transfer learning,few-shot learning,and data augmentation to adapt models so they could fit effective use processes in low-resource settings.This thus emphasizes the need for collaborative frameworks and sound methodologies that ensure applicability and fairness,tackling the technical and ethical challenges associated with data scarcity in AI.This article also discusses prospective approaches to dealing with data scarcity,emphasizing the blend of synthetic data and traditional models and the use of advanced machine learning techniques such as transfer learning and few-shot learning.These techniques aim to enhance the flexibility and effectiveness of AI systems across various industries while ensuring sustainable AI technology development amid ongoing data scarcity.展开更多
In this study,we employ advanced data-driven techniques to investigate the complex relationships between the yields of five major crops and various geographical and spatiotemporal features in Senegal.We analyze how th...In this study,we employ advanced data-driven techniques to investigate the complex relationships between the yields of five major crops and various geographical and spatiotemporal features in Senegal.We analyze how these features influence crop yields by utilizing remotely sensed data.Our methodology incorporates clustering algorithms and correlation matrix analysis to identify significant patterns and dependencies,offering a comprehensive understanding of the factors affecting agricultural productivity in Senegal.To optimize the model's performance and identify the optimal hyperparameters,we implemented a comprehensive grid search across four distinct machine learning regressors:Random Forest,Extreme Gradient Boosting(XGBoost),Categorical Boosting(CatBoost),and Light Gradient-Boosting Machine(LightGBM).Each regressor offers unique functionalities,enhancing our exploration of potential model configurations.The top-performing models were selected based on evaluating multiple performance metrics,ensuring robust and accurate predictive capabilities.The results demonstrated that XGBoost and CatBoost perform better than the other two.We introduce synthetic crop data generated using a Variational Auto Encoder to address the challenges posed by limited agricultural datasets.By achieving high similarity scores with real-world data,our synthetic samples enhance model robustness,mitigate overfitting,and provide a viable solution for small dataset issues in agriculture.Our approach distinguishes itself by creating a flexible model applicable to various crops together.By integrating five crop datasets and generating high-quality synthetic data,we improve model performance,reduce overfitting,and enhance realism.Our findings provide crucial insights for productivity drivers in key cropping systems,enabling robust recommendations and strengthening the decision-making capabilities of policymakers and farmers in datascarce regions.展开更多
The collection of user attributes by service providers is a double-edged sword.They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders.The analysis of the colle...The collection of user attributes by service providers is a double-edged sword.They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders.The analysis of the collected user data includes frequency estimation for categorical attributes.Nonetheless,the users deserve privacy guarantees against inadvertent identity disclosures.Therefore algorithms called frequency oracles were developed to randomize or perturb user attributes and estimate the frequencies of their values.We propose Sarve,a frequency oracle that used Randomized Aggregatable Privacy-Preserving Ordinal Response(RAPPOR)and Hadamard Response(HR)for randomization in combination with fake data.The design of a service-oriented architecture must consider two types of complexities,namely computational and communication.The functions of such systems aim to minimize the two complexities and therefore,the choice of privacy-enhancing methods must be a calculated decision.The variant of RAPPOR we had used was realized through bloom flters.A bloom filter is a memory-efficient data structure that offers time complexity of O(1).On the other hand,HR has been proven to give the best communication costs of the order of log(b)for b-bits communication.Therefore,Sarve is a step towards frequency oracles that exhibit how privacy provisions of existing methods can be combined with those of fake data to achieve statistical results comparable to the original data.Sarve also implemented an adaptive solution enhanced from the work of Arcolezi et al.The use of RAPPOR was found to provide better privacy-utility tradeoffs for specific privacy budgets in both high and general privacyregimes.展开更多
Offline handwritten mathematical expression recognition is a challenging optical character recognition(OCR)task due to various ambiguities of handwritten symbols and complicated two-dimensional structures.Recent work ...Offline handwritten mathematical expression recognition is a challenging optical character recognition(OCR)task due to various ambiguities of handwritten symbols and complicated two-dimensional structures.Recent work in this area usually constructs deeper and deeper neural networks trained with end-to-end approaches to improve the performance.However,the higher the complexity of the network,the more the computing resources and time required.To improve the performance without more computing requirements,we concentrate on the training data and the training strategy in this paper.We propose a data augmentation method which can generate synthetic samples with new LaTeX notations by only using the official training data of CROHME.Moreover,we propose a novel training strategy called Shuffled Multi-Round Training(SMRT)to regularize the model.With the generated data and the shuffled multi-round training strategy,we achieve the state-of-the-art result in expression accuracy,i.e.,59.74%and 61.57%on CROHME 2014 and 2016,respectively,by using attention-based encoder-decoder models for offline handwritten mathematical expression recognition.展开更多
Association,aiming to link bounding boxes of the same identity in a video sequence,is a central component in multi-object tracking(MOT).To train association modules,e.g.,parametric networks,real video data are usually...Association,aiming to link bounding boxes of the same identity in a video sequence,is a central component in multi-object tracking(MOT).To train association modules,e.g.,parametric networks,real video data are usually used.However,annotating person tracks in consecutive video frames is expensive,and such real data,due to its inflexibility,offer us limited opportunities to evaluate the system performance w.r.t.changing tracking scenarios.In this paper,we study whether 3D synthetic data can replace real-world videos for association training.Specifically,we introduce a large-scale synthetic data engine named MOTX,where the motion characteristics of cameras and objects are manually configured to be similar to those of real-world datasets.We show that,compared with real data,association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques.Our intriguing observation is credited to two factors.First and foremost,3D engines can well simulate motion factors such as camera movement,camera view,and object movement so that the simulated videos can provide association modules with effective motion features.Second,the experimental results show that the appearance domain gap hardly harms the learning of association knowledge.In addition,the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT,which brings new insights to the community.展开更多
In crime science, understanding the dynamics and interactions between crime events is crucial for comprehending the underlying factors that drive their occurrences. Nonetheless, gaining access to detailed spatiotempor...In crime science, understanding the dynamics and interactions between crime events is crucial for comprehending the underlying factors that drive their occurrences. Nonetheless, gaining access to detailed spatiotemporal crime records from law enforcement faces significant challenges due to confidentiality concerns. In response to these challenges, this paper introduces an innovative analytical tool named “stppSim,” designed to synthesize fine-grained spatiotemporal point records while safeguarding the privacy of individual locations. By utilizing the open-source R platform, this tool ensures easy accessibility for researchers, facilitating download, re-use, and potential advancements in various research domains beyond crime science.展开更多
Recently,machine learning(ML)has been considered a powerful technological element of different society areas.To transform the computer into a decision maker,several sophisticated methods and algorithms are constantly ...Recently,machine learning(ML)has been considered a powerful technological element of different society areas.To transform the computer into a decision maker,several sophisticated methods and algorithms are constantly created and analyzed.In geophysics,both supervised and unsupervised ML methods have dramatically contributed to the development of seismic and well-log data interpretation.In well-logging,ML algorithms are well-suited for lithologic reconstruction problems,once there is no analytical expressions for computing well-log data produced by a particular rock unit.Additionally,supervised ML methods are strongly dependent on a accurate-labeled training data-set,which is not a simple task to achieve,due to data absences or corruption.Once an adequate supervision is performed,the classification outputs tend to be more accurate than unsupervised methods.This work presents a supervised version of a Self-Organizing Map,named as SSOM,to solve a lithologic reconstruction problem from well-log data.Firstly,we go for a more controlled problem and simulate well-log data directly from an interpreted geologic cross-section.We then define two specific training data-sets composed by density(RHOB),sonic(DT),spontaneous potential(SP)and gamma-ray(GR)logs,all simulated through a Gaussian distribution function per lithology.Once the training data-set is created,we simulate a particular pseudo-well,referred to as classification well,for defining controlled tests.First one comprises a training data-set with no labeled log data of the simulated fault zone.In the second test,we intentionally improve the training data-set with the fault.To bespeak the obtained results for each test,we analyze confusion matrices,logplots,accuracy and precision.Apart from very thin layer misclassifications,the SSOM provides reasonable lithologic reconstructions,especially when the improved training data-set is considered for supervision.The set of numerical experiments shows that our SSOM is extremely well-suited for a supervised lithologic reconstruction,especially to recover lithotypes that are weakly-sampled in the training log-data.On the other hand,some misclassifications are also observed when the cortex could not group the slightly different lithologies.展开更多
To predict the lithium-ion(Li-ion)battery degradation trajectory in the early phase,arranging the maintenance of battery energy storage systems is of great importance.However,under different operation conditions,Li-io...To predict the lithium-ion(Li-ion)battery degradation trajectory in the early phase,arranging the maintenance of battery energy storage systems is of great importance.However,under different operation conditions,Li-ion batteries present distinct degradation patterns,and it is challenging to capture negligible capacity fade in early cycles.Despite the data-driven method showing promising performance,insufficient data is still a big issue since the ageing experiments on the batteries are too slow and expensive.In this study,we proposed twin autoencoders integrated into a two-stage method to predict the early cycles'degradation trajectories.The two-stage method can properly predict the degradation from course to fine.The twin autoencoders serve as a feature extractor and a synthetic data generator,respectively.Ultimately,a learning procedure based on the long-short term memory(LSTM)network is designed to hybridize the learning process between the real and synthetic data.The performance of the proposed method is verified on three datasets,and the experimental results show that the proposed method can achieve accurate predictions compared to its competitors.展开更多
This study focused on land cover mapping based on synthetic images,especially using the method of spatial and temporal classification as well as the accuracy validation of their results.Our experimental results indica...This study focused on land cover mapping based on synthetic images,especially using the method of spatial and temporal classification as well as the accuracy validation of their results.Our experimental results indicate that the accuracy of land cover map based on synthetic imagery and actual observation has a similar standard compared with actual land cover survey data.These findings facilitate land cover mapping with synthetic data in the area where actual observation is missing.Furthermore,in order to improve the quality of the land cover mapping,this research employed the spatial and temporal Markov random field classification approach.Test results show that overall mapping accuracy can be increased by approximately 5% after applying spatial and temporal classification.This finding contributes towards the achievement of higher quality land cover mapping of areas with missing data by using spatial and temporal information.展开更多
文摘There is a growing body of clinical research on the utility of synthetic data derivatives,an emerging research tool in medicine.In nephrology,clinicians can use machine learning and artificial intelligence as powerful aids in their clinical decision-making while also preserving patient privacy.This is especially important given the epidemiology of chronic kidney disease,renal oncology,and hypertension worldwide.However,there remains a need to create a framework for guidance regarding how to better utilize synthetic data as a practical application in this research.
基金related to grant PM-31-22-626-414 from the Prime Minister's Research Fellows(PMRF)of the Indian Institute of Technology Roorkee。
文摘The earthquake early warning(EEW)system provides advance notice of potentially damaging ground shaking.In EEW,early estimation of magnitude is crucial for timely rescue operations.A set of thirty-four features is extracted using the primary wave earthquake precursor signal and site-specific information.In Japan's earthquake magnitude dataset,there is a chance of a high imbalance concerning the earthquakes above strong impact.This imbalance causes a high prediction error while training advanced machine learning or deep learning models.In this work,Conditional Tabular Generative Adversarial Networks(CTGAN),a deep machine learning tool,is utilized to learn the characteristics of the first arrival of earthquake P-waves and generate a synthetic dataset based on this information.The result obtained using actual and mixed(synthetic and actual)datasets will be used for training the stacked ensemble magnitude prediction model,MagPred,designed specifically for this study.There are 13295,3989,and1710 records designated for training,testing,and validation.The mean absolute error of the test dataset for single station magnitude detection using early three,four,and five seconds of P wave are 0.41,0.40,and 0.38 MJMA.The study demonstrates that the Generative Adversarial Networks(GANs)can provide a good result for single-station magnitude prediction.The study can be effective where less seismic data is available.The study shows that the machine learning method yields better magnitude detection results compared with the several regression models.The multi-station magnitude prediction study has been conducted on prominent Osaka,Off Fukushima,and Kumamoto earthquakes.Furthermore,to validate the performance of the model,an inter-region study has been performed on the earthquakes of the India or Nepal region.The study demonstrates that GANs can discover effective magnitude estimation compared with non-GAN-based methods.This has a high potential for wide application in earthquake early warning systems.
文摘Background: The population of Fontan patients, patients born with a single functioningventricle, is growing. There is a growing need to develop algorithms for this population that can predicthealth outcomes. Artiffcial intelligence models predicting short-term and long-term health outcomes forpatients with the Fontan circulation are needed. Generative adversarial networks (GANs) provide a solutionfor generating realistic and useful synthetic data that can be used to train such models. Methods: Despitetheir promise, GANs have not been widely adopted in the congenital heart disease research communitydue, in some part, to a lack of knowledge on how to employ them. In this research study, a GAN was usedto generate synthetic data from the Pediatric Heart Network Fontan I dataset. A subset of data consistingof the echocardiographic and BNP measures collected from Fontan patients was used to train the GAN.Two sets of synthetic data were created to understand the effect of data missingness on synthetic datageneration. Synthetic data was created from real data in which the missing values were imputed usingMultiple Imputation by Chained Equations (MICE) (referred to as synthetic from imputed real samples). Inaddition, synthetic data was created from real data in which the missing values were dropped (referred to assynthetic from dropped real samples). Both synthetic datasets were evaluated for ffdelity by using visualmethods which involved comparing histograms and principal component analysis (PCA) plots. Fidelitywas measured quantitatively by (1) comparing synthetic and real data using the Kolmogorov-Smirnovtest to evaluate the similarity between two distributions and (2) training a neural network to distinguishbetween real and synthetic samples. Both synthetic datasets were evaluated for utility by training aneural network with synthetic data and testing the neural network on its ability to classify patients thathave ventricular dysfunction using echocardiograph measures and serological measures. Results: Usinghistograms, associated probability density functions, and (PCA), both synthetic datasets showed visualresemblance in distribution and variance to real Fontan data. Quantitatively, synthetic data from droppedreal samples had higher similarity scores, as demonstrated by the Kolmogorov–Smirnov statistic, for all butone feature (age at Fontan) compared to synthetic data from imputed real samples, which demonstrateddissimilar scores for three features (Echo SV, Echo tda, and BNP). In addition, synthetic data from droppedreal samples resembled real data to a larger extent (49.3% classiffcation error) than synthetic data fromimputed real samples (65.28% classiffcation error). Classiffcation errors approximating 50% represent datasetsthat are indistinguishable. In terms of utility, synthetic data created from real data in which the missingvalues were imputed classiffed ventricular dysfunction in real data with a classiffcation error of 10.99%.Similarly, utility of the generated synthetic data by showing that a neural network trained on synthetic dataderived from real data in which the missing values were dropped could classify ventricular dysfunction inreal data with a classiffcation error of 9.44%. Conclusions: Although representing a limited subset of thevast data available on the Pediatric Heart Network, generative adversarial networks can create syntheticdata that mimics the probability distribution of real Fontan echocardiographic measures. Clinicians can usethese synthetic data to create models that predict health outcomes for Fontan patients.
文摘Geophysicists interpreting seismic reflection data aim for the highest resolution possible as this facilitates the interpretation and discrimination of subtle geological features.Various deterministic methods based on Wiener filtering exist to increase the temporal frequency bandwidth and compress the seismic wavelet in a process called spectral shaping.Auto-encoder neural networks with convolutional layers have been applied to this problem,with encouraging results,but the problem of generalization to unseen data remains.Most published works have used supervised learning with training data constructed from field seismic data or synthetic seismic data generated based on measured well logs or based on seismic wavefield modelling.This leads to satisfactory results on datasets similar to the training data but requires re-training of the networks for unseen data with different characteristics.In this work seek to improve the generalization,not by experimenting with network architecture(we use a conventional U-net with some small modifications),but by adopting a different approach to creating the training data for the supervised learning process.Although the network is important,at this stage of development we see more improvement in prediction results by altering the design of the training data than by architectural changes.The approach we take is to create synthetic training data consisting of simple geometric shapes convolved with a seismic wavelet.We created a very diverse training dataset consisting of 9000 seismic images with between 5 and 300 seismic events resembling seismic reflections that have geophysically motived perturbations in terms of shape and character.The 2D U-net we have trained can boost robustly and recursively the dominant frequency by 50%.We demonstrate this on unseen field data with different bandwidths and signal-to-noise ratios.Additionally,this 2D U-net can handle non-stationary wavelets and overlapping events of different bandwidth without creating excessive ringing.It is also robust in the presence of noise.The significance of this result is that it simplifies the effort of bandwidth extension and demonstrates the usefulness of auto-encoder neural network for geophysical data processing.
基金This research was financially supported by the Ministry of Small and Mediumsized Enterprises(SMEs)and Startups(MSS),Korea,under the“Regional Specialized Industry Development Program(R&D,S3091627)”supervised by Korea Institute for Advancement of Technology(KIAT).
文摘Renewable and nonrenewable energy sources are widely incorporated for solar and wind energy that produces electricity without increasing carbon dioxide emissions.Energy industries worldwide are trying hard to predict future energy consumption that could eliminate over or under contracting energy resources and unnecessary financing.Machine learning techniques for predicting energy are the trending solution to overcome the challenges faced by energy companies.The basic need for machine learning algorithms to be trained for accurate prediction requires a considerable amount of data.Another critical factor is balancing the data for enhanced prediction.Data Augmentation is a technique used for increasing the data available for training.Synthetic data are the generation of new data which can be trained to improve the accuracy of prediction models.In this paper,we propose a model that takes time series energy consumption data as input,pre-processes the data,and then uses multiple augmentation techniques and generative adversarial networks to generate synthetic data which when combined with the original data,reduces energy consumption prediction error.We propose TGAN-skip-Improved-WGAN-GP to generate synthetic energy consumption time series tabular data.We modify TGANwith skip connections,then improveWGANGPby defining a consistency term,and finally use the architecture of improved WGAN-GP for training TGAN-skip.We used various evaluation metrics and visual representation to compare the performance of our proposed model.We also measured prediction accuracy along with mean and maximum error generated while predicting with different variations of augmented and synthetic data with original data.The mode collapse problemcould be handled by TGAN-skip-Improved-WGAN-GP model and it also converged faster than existing GAN models for synthetic data generation.The experiment result shows that our proposed technique of combining synthetic data with original data could significantly reduce the prediction error rate and increase the prediction accuracy of energy consumption.
文摘One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation,yet today’s simulation engines are computationally intensive.Recently,machine learning(ML)energy prediction models have shown promise in replacing these simulation engines.However,it is often difficult to develop such ML models due to the lack of proper datasets.Synthetic datasets can provide a solution,but determining the optimal quantity and diversity of synthetic data remains a challenging task.Furthermore,there is a lack of understanding of the compatibility between different ML algorithms and the characteristics of synthetic datasets.To fill these gaps,this study conducted multiple ML experiments using residential buildings in Sweden to determine the best-performing ML algorithm,as well as the characteristics of the corresponding synthetic dataset.A parametric model was developed to generate a wide range of synthetic datasets varying in size and building shape,referred to as diversity.Five ML algorithms selected through a literature review were trained using the different datasets.Results show that the Support Vector Machine performed the best overall.Multiple Linear Regression performed well with small and lowdiverse datasets,while the Artificial Neural Network performed well with large and high-diverse datasets.We conclude that developers should focus more on increasing diversity instead of size once the dataset size reaches around 1440 when generating synthetic training datasets.This study offers insights for researchers and practitioners,such as software tool developers,when developing ML building energy prediction models in early-stage optimization.
基金supported by the National Natural Science Foundation of China(No.52377084)the Zhishan Young Scholar Program of Southeast University,China(No.2242024RCB0019)。
文摘The phenomenon of sub-synchronous oscillation(SSO)poses significant threats to the stability of power systems.The advent of artificial intelligence(AI)has revolutionized SSO research through data-driven methodologies,which necessitates a substantial collection of data for effective training,a requirement frequently unfulfilled in practical power systems due to limited data availability.To address the critical issue of data scarcity in training AI models,this paper proposes a novel transfer-learning-based(TL-based)Wasserstein generative adversarial network(WGAN)approach for synthetic data generation of SSO in wind farms.To improve the capability of WGAN to capture the bidirectional temporal features inherent in oscillation data,a bidirectional long short-term memory(BiLSTM)layer is introduced.Additionally,to address the training instability caused by few-shot learning scenarios,the discriminator is augmented with mini-batch discrimination(MBD)layers and gradient penalty(GP)terms.Finally,TL is leveraged to finetune the model,effectively bridging the gap between the training data and real-world system data.To evaluate the quality of the synthetic data,two indexes are proposed based on dynamic time warping(DTW)and frequency domain analysis,followed by a classification task.Case studies demonstrate the effectiveness of the proposed approach in swiftly generating a large volume of synthetic SSO data,thereby significantly mitigating the issue of data scarcity prevalent in SSO research.
文摘The standard approach to tackling computer vision problems is to train deep convolutional neural network(CNN)models using large-scale image datasets that are representative of the target task.However,in many scenarios,it is often challenging to obtain sufficient image data for the target task.Data augmentation is a way to mitigate this challenge.A common practice is to explicitly transform existing images in desired ways to create the required volume and variability of training data necessary to achieve good generalization performance.In situations where data for the target domain are not accessible,a viable workaround is to synthesize training data from scratch,i.e.,synthetic data augmentation.This paper presents an extensive review of synthetic data augmentation techniques.It covers data synthesis approaches based on realistic 3D graphics modelling,neural style transfer(NST),differential neural rendering,and generative modelling using generative adversarial networks(GANs)and variational autoencoders(VAEs).For each of these classes of methods,we focus on the important data generation and augmentation techniques,general scope of application and specific use-cases,as well as existing limitations and possible workarounds.Additionally,we provide a summary of common synthetic datasets for training computer vision models,highlighting the main features,application domains and supported tasks.Finally,we discuss the effectiveness of synthetic data augmentation methods.Since this is the first paper to explore synthetic data augmentation methods in great detail,we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues.
文摘Data-driven models for battery state estimation require extensive experimental training data,which may not be available or suitable for specific tasks like open-circuit voltage(OCV)reconstruction and subsequent state of health(SOH)estimation.This study addresses this issue by developing a transfer-learning-based OCV reconstruction model using a temporal convolutional long short-term memory(TCN-LSTM)network trained on synthetic data from an automotive nickel cobalt aluminium oxide(NCA)cell generated through a mechanistic model approach.The data consists of voltage curves at constant temperature,C-rates between C/30 to 1C,and a SOH-range from 70%to 100%.The model is refined via Bayesian optimization and then applied to four use cases with reduced experimental nickel manganese cobalt oxide(NMC)cell training data for higher use cases.The TL models’performances are compared with models trained solely on experimental data,focusing on different C-rates and voltage windows.The results demonstrate that the OCV reconstruction mean absolute error(MAE)within the average battery electric vehicle(BEV)home charging window(30%to 85%state of charge(SOC))is less than 22 mV for the first three use cases across all C-rates.The SOH estimated from the reconstructed OCV exhibits an mean absolute percentage error(MAPE)below 2.2%for these cases.The study further investigates the impact of the source domain on TL by incorporating two additional synthetic datasets,a lithium iron phosphate(LFP)cell and an entirely artificial,non-existing,cell,showing that solely the shifting and scaling of gradient changes in the charging curve suffice to transfer knowledge,even between different cell chemistries.A key limitation with respect to extrapolation capability is identified and evidenced in our fourth use case,where the absence of such comprehensive data hindered the TL process.
基金funded jointly and a collaboration by the Economic Development Board of Singapore,Singapore Institute of Technology and NVIDIA.
文摘The Metaverse’s emergence is redefining digital interaction,enabling seamless engagement in immersive virtual realms.This trend’s integration with AI and virtual reality(VR)is gaining momentum,albeit with challenges in acquiring extensive human action datasets.Real-world activities involve complex intricate behaviors,making accurate capture and annotation difficult.VR compounds this difficulty by requiring meticulous simulation of natural movements and interactions.As the Metaverse bridges the physical and digital realms,the demand for diverse human action data escalates,requiring innovative solutions to enrich AI and VR capabilities.This need is underscored by state-of-the-art models that excel but are hampered by limited real-world data.The overshadowing of synthetic data benefits further complicates the issue.This paper systematically examines both real-world and synthetic datasets for activity detection and recognition in computer vision.Introducing Metaverse-enabled advancements,we unveil SynDa’s novel streamlined pipeline using photorealistic rendering and AI pose estimation.By fusing real-life video datasets,large-scale synthetic datasets are generated to augment training and mitigate real data scarcity and costs.Our preliminary experiments reveal promising results in terms of mean average precision(mAP),where combining real data and synthetic video data generated using this pipeline to train models presents an improvement in mAP(32.35%),compared to the mAP of the same model when trained on real data(29.95%).This demonstrates the transformative synergy between Metaverse and AI-driven synthetic data augmentation.
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.
基金supported by Internal Research Support Program(IRSPG202202).
文摘Dealing with data scarcity is the biggest challenge faced by Artificial Intelligence(AI),and it will be interesting to see how we overcome this obstacle in the future,but for now,“THE SHOW MUST GO ON!!!”As AI spreads and transforms more industries,the lack of data is a significant obstacle:the best methods for teaching machines how real-world processes work.This paper explores the considerable implications of data scarcity for the AI industry,which threatens to restrict its growth and potential,and proposes plausible solutions and perspectives.In addition,this article focuses highly on different ethical considerations:privacy,consent,and non-discrimination principles during AI model developments under limited conditions.Besides,innovative technologies are investigated through the paper in aspects that need implementation by incorporating transfer learning,few-shot learning,and data augmentation to adapt models so they could fit effective use processes in low-resource settings.This thus emphasizes the need for collaborative frameworks and sound methodologies that ensure applicability and fairness,tackling the technical and ethical challenges associated with data scarcity in AI.This article also discusses prospective approaches to dealing with data scarcity,emphasizing the blend of synthetic data and traditional models and the use of advanced machine learning techniques such as transfer learning and few-shot learning.These techniques aim to enhance the flexibility and effectiveness of AI systems across various industries while ensuring sustainable AI technology development amid ongoing data scarcity.
文摘In this study,we employ advanced data-driven techniques to investigate the complex relationships between the yields of five major crops and various geographical and spatiotemporal features in Senegal.We analyze how these features influence crop yields by utilizing remotely sensed data.Our methodology incorporates clustering algorithms and correlation matrix analysis to identify significant patterns and dependencies,offering a comprehensive understanding of the factors affecting agricultural productivity in Senegal.To optimize the model's performance and identify the optimal hyperparameters,we implemented a comprehensive grid search across four distinct machine learning regressors:Random Forest,Extreme Gradient Boosting(XGBoost),Categorical Boosting(CatBoost),and Light Gradient-Boosting Machine(LightGBM).Each regressor offers unique functionalities,enhancing our exploration of potential model configurations.The top-performing models were selected based on evaluating multiple performance metrics,ensuring robust and accurate predictive capabilities.The results demonstrated that XGBoost and CatBoost perform better than the other two.We introduce synthetic crop data generated using a Variational Auto Encoder to address the challenges posed by limited agricultural datasets.By achieving high similarity scores with real-world data,our synthetic samples enhance model robustness,mitigate overfitting,and provide a viable solution for small dataset issues in agriculture.Our approach distinguishes itself by creating a flexible model applicable to various crops together.By integrating five crop datasets and generating high-quality synthetic data,we improve model performance,reduce overfitting,and enhance realism.Our findings provide crucial insights for productivity drivers in key cropping systems,enabling robust recommendations and strengthening the decision-making capabilities of policymakers and farmers in datascarce regions.
文摘The collection of user attributes by service providers is a double-edged sword.They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders.The analysis of the collected user data includes frequency estimation for categorical attributes.Nonetheless,the users deserve privacy guarantees against inadvertent identity disclosures.Therefore algorithms called frequency oracles were developed to randomize or perturb user attributes and estimate the frequencies of their values.We propose Sarve,a frequency oracle that used Randomized Aggregatable Privacy-Preserving Ordinal Response(RAPPOR)and Hadamard Response(HR)for randomization in combination with fake data.The design of a service-oriented architecture must consider two types of complexities,namely computational and communication.The functions of such systems aim to minimize the two complexities and therefore,the choice of privacy-enhancing methods must be a calculated decision.The variant of RAPPOR we had used was realized through bloom flters.A bloom filter is a memory-efficient data structure that offers time complexity of O(1).On the other hand,HR has been proven to give the best communication costs of the order of log(b)for b-bits communication.Therefore,Sarve is a step towards frequency oracles that exhibit how privacy provisions of existing methods can be combined with those of fake data to achieve statistical results comparable to the original data.Sarve also implemented an adaptive solution enhanced from the work of Arcolezi et al.The use of RAPPOR was found to provide better privacy-utility tradeoffs for specific privacy budgets in both high and general privacyregimes.
基金the National Key Research and Development Program of China No.2020YFB1313602.
文摘Offline handwritten mathematical expression recognition is a challenging optical character recognition(OCR)task due to various ambiguities of handwritten symbols and complicated two-dimensional structures.Recent work in this area usually constructs deeper and deeper neural networks trained with end-to-end approaches to improve the performance.However,the higher the complexity of the network,the more the computing resources and time required.To improve the performance without more computing requirements,we concentrate on the training data and the training strategy in this paper.We propose a data augmentation method which can generate synthetic samples with new LaTeX notations by only using the official training data of CROHME.Moreover,we propose a novel training strategy called Shuffled Multi-Round Training(SMRT)to regularize the model.With the generated data and the shuffled multi-round training strategy,we achieve the state-of-the-art result in expression accuracy,i.e.,59.74%and 61.57%on CROHME 2014 and 2016,respectively,by using attention-based encoder-decoder models for offline handwritten mathematical expression recognition.
基金supported by the ARC Discovery Early Career Researcher Award,China(No.DE200101283)the ARC Discovery Project,China(No.DP210102801).
文摘Association,aiming to link bounding boxes of the same identity in a video sequence,is a central component in multi-object tracking(MOT).To train association modules,e.g.,parametric networks,real video data are usually used.However,annotating person tracks in consecutive video frames is expensive,and such real data,due to its inflexibility,offer us limited opportunities to evaluate the system performance w.r.t.changing tracking scenarios.In this paper,we study whether 3D synthetic data can replace real-world videos for association training.Specifically,we introduce a large-scale synthetic data engine named MOTX,where the motion characteristics of cameras and objects are manually configured to be similar to those of real-world datasets.We show that,compared with real data,association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques.Our intriguing observation is credited to two factors.First and foremost,3D engines can well simulate motion factors such as camera movement,camera view,and object movement so that the simulated videos can provide association modules with effective motion features.Second,the experimental results show that the appearance domain gap hardly harms the learning of association knowledge.In addition,the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT,which brings new insights to the community.
文摘In crime science, understanding the dynamics and interactions between crime events is crucial for comprehending the underlying factors that drive their occurrences. Nonetheless, gaining access to detailed spatiotemporal crime records from law enforcement faces significant challenges due to confidentiality concerns. In response to these challenges, this paper introduces an innovative analytical tool named “stppSim,” designed to synthesize fine-grained spatiotemporal point records while safeguarding the privacy of individual locations. By utilizing the open-source R platform, this tool ensures easy accessibility for researchers, facilitating download, re-use, and potential advancements in various research domains beyond crime science.
文摘Recently,machine learning(ML)has been considered a powerful technological element of different society areas.To transform the computer into a decision maker,several sophisticated methods and algorithms are constantly created and analyzed.In geophysics,both supervised and unsupervised ML methods have dramatically contributed to the development of seismic and well-log data interpretation.In well-logging,ML algorithms are well-suited for lithologic reconstruction problems,once there is no analytical expressions for computing well-log data produced by a particular rock unit.Additionally,supervised ML methods are strongly dependent on a accurate-labeled training data-set,which is not a simple task to achieve,due to data absences or corruption.Once an adequate supervision is performed,the classification outputs tend to be more accurate than unsupervised methods.This work presents a supervised version of a Self-Organizing Map,named as SSOM,to solve a lithologic reconstruction problem from well-log data.Firstly,we go for a more controlled problem and simulate well-log data directly from an interpreted geologic cross-section.We then define two specific training data-sets composed by density(RHOB),sonic(DT),spontaneous potential(SP)and gamma-ray(GR)logs,all simulated through a Gaussian distribution function per lithology.Once the training data-set is created,we simulate a particular pseudo-well,referred to as classification well,for defining controlled tests.First one comprises a training data-set with no labeled log data of the simulated fault zone.In the second test,we intentionally improve the training data-set with the fault.To bespeak the obtained results for each test,we analyze confusion matrices,logplots,accuracy and precision.Apart from very thin layer misclassifications,the SSOM provides reasonable lithologic reconstructions,especially when the improved training data-set is considered for supervision.The set of numerical experiments shows that our SSOM is extremely well-suited for a supervised lithologic reconstruction,especially to recover lithotypes that are weakly-sampled in the training log-data.On the other hand,some misclassifications are also observed when the cortex could not group the slightly different lithologies.
基金financially supported by the National Natural Science Foundation of China under Grant 62372369,52107229,62272383the Key Research and Development Program of Shaanxi Province(2024GX-YBXM-442)Natural Science Basic Research Program of Shaanxi Province(2024JC-YBMS-477)。
文摘To predict the lithium-ion(Li-ion)battery degradation trajectory in the early phase,arranging the maintenance of battery energy storage systems is of great importance.However,under different operation conditions,Li-ion batteries present distinct degradation patterns,and it is challenging to capture negligible capacity fade in early cycles.Despite the data-driven method showing promising performance,insufficient data is still a big issue since the ageing experiments on the batteries are too slow and expensive.In this study,we proposed twin autoencoders integrated into a two-stage method to predict the early cycles'degradation trajectories.The two-stage method can properly predict the degradation from course to fine.The twin autoencoders serve as a feature extractor and a synthetic data generator,respectively.Ultimately,a learning procedure based on the long-short term memory(LSTM)network is designed to hybridize the learning process between the real and synthetic data.The performance of the proposed method is verified on three datasets,and the experimental results show that the proposed method can achieve accurate predictions compared to its competitors.
基金supported in part by the National High-Tech R&D Program(863 program)under grant number 2009AA122004the National Natural Science Foundation of China under grant number 60171009the Hong Kong Research Grant Council under grant number CUHK 444612.
文摘This study focused on land cover mapping based on synthetic images,especially using the method of spatial and temporal classification as well as the accuracy validation of their results.Our experimental results indicate that the accuracy of land cover map based on synthetic imagery and actual observation has a similar standard compared with actual land cover survey data.These findings facilitate land cover mapping with synthetic data in the area where actual observation is missing.Furthermore,in order to improve the quality of the land cover mapping,this research employed the spatial and temporal Markov random field classification approach.Test results show that overall mapping accuracy can be increased by approximately 5% after applying spatial and temporal classification.This finding contributes towards the achievement of higher quality land cover mapping of areas with missing data by using spatial and temporal information.