Artificial intelligence(AI)relies on data and algorithms.State-of-the-art(SOTA)AI smart algorithms have been developed to improve the performance of AI-oriented structures.However,model-centric approaches are limited ...Artificial intelligence(AI)relies on data and algorithms.State-of-the-art(SOTA)AI smart algorithms have been developed to improve the performance of AI-oriented structures.However,model-centric approaches are limited by the absence of high-quality data.Data-centric AI is an emerging approach for solving machine learning(ML)problems.It is a collection of various data manipulation techniques that allow ML practitioners to systematically improve the quality of the data used in an ML pipeline.However,data-centric AI approaches are not well documented.Researchers have conducted various experiments without a clear set of guidelines.This survey highlights six major data-centric AI aspects that researchers are already using to intentionally or unintentionally improve the quality of AI systems.These include big data quality assessment,data preprocessing,transfer learning,semi-supervised learning,machine learning operations(MLOps),and the effect of adding more data.In addition,it highlights recent data-centric techniques adopted by ML practitioners.We addressed how adding data might harm datasets and how HoloClean can be used to restore and clean them.Finally,we discuss the causes of technical debt in AI.Technical debt builds up when software design and implementation decisions run into“or outright collide with”business goals and timelines.This survey lays the groundwork for future data-centric AI discussions by summarizing various data-centric approaches.展开更多
In a time characterized by the availability of vast amounts of data,the effective utilization of information is critical for timely decision-making in military operations.However,processing large amounts of data requi...In a time characterized by the availability of vast amounts of data,the effective utilization of information is critical for timely decision-making in military operations.However,processing large amounts of data requires computational resources and time.Therefore,decision makers have used data-centric technologies to take advantage of public and private data sources to support military operations.This survey explores the integration and application of data-centric technologies,such as data analytics,data science,and machine learning,to optimize decision-making workflows within military contexts supporting the deployment of military assets and resources.To address the information gap,this article presents a literature review,specifically a survey.Our survey examines the use of the mentioned technologies to process and analyze information that contributes to the phases of situational awareness,and planning in military environments.We then introduce a taxonomy of the approaches associated with implementing these technologies in military scenarios.Furthermore,we discuss relevant factors for the seamless integration of data-centric technologies into military decision-making processes,and reveal the importance of specialized personnel,architectures,and cybersecurity issues in the task of developing prototypes and models.The findings of this paper aim to provide valuable insights for military institutions,offering a deeper understanding of the use of data-centric technologies as innovative practices to enhance the effectiveness of military decision-making.展开更多
The purpose of this paper(presented online as a keynote lecture at the 25th Annual Indonesian Geotechnical Conference on 10 Nov 2021)is to broadly conceptualize the agenda for data-centric geotechnics,an emerging fiel...The purpose of this paper(presented online as a keynote lecture at the 25th Annual Indonesian Geotechnical Conference on 10 Nov 2021)is to broadly conceptualize the agenda for data-centric geotechnics,an emerging field that attempts to prepare geotechnical engineering for digital transformation.The agenda must include(1)development of methods that make sense of all real-world data(not selective input data for a physical model),(2)offering insights of significant value to critical real-world decisions for current or future practice(not decisions for an ideal world or decisions of minor concern to geotechnical engineers),and(3)sensitivity to the physical context of geotechnics(not abstract data-driven analysis connected to geotechnics in a peripheral way,i.e.,engagement with the knowledge and experience base should be substantial).These three elements are termed“data centricity”,“fit for(and transform)practice”,and“geotechnical context”in the agenda.Given that a knowledge of the site is central to any geotechnical engineering project,datadriven site characterization(DDSC)must constitute one key application domain in data-centric geotechnics,although other infrastructure lifecycle phases such as project conceptualization,design,construction,operation,and decommission/reuse would benefit from data-informed decision support as well.One part of DDSC that addresses numerical soil data in a site investigation report and soil property databases is pursued under Project DeepGeo.In principle,the source of data can also go beyond site investigation,and the type of data can go beyond numbers,such as categorical data,text,audios,images,videos,and expert opinion.The purpose of Project DeepGeo is to produce a 3D stratigraphic map of the subsurface volume below a full-scale project site and to estimate relevant engineering properties at each spatial point based on actual site investigation data and other relevant Big Indirect Data(BID).Uncertainty quantification is necessary,as current real-world data is insufficient,incomplete,and/or not directly relevant to construct a deterministic map.The value of a deterministic map for decision support is debatable.The computational cost to do this for a 3D true scale subsurface volume must be reasonable.Ultimately,geotechnical structures need to be a part of a completely smart infrastructure that fits the circular economy and need to focus on delivering service to end-users and the community from project conceptualization to decommission/reuse with full integration to smart city and smart society.Although current geotechnical practice has been very successful in taking“calculated risk”informed by limited data,imperfect theories,prototype testing,observations,among others and exercising judicious caution and engineering judgment,there is no clear pathway forward to leverage on big data and digital technologies such as machine learning,BIM,and digital twin to meet more challenging needs such as sustainability and resilience engineering.展开更多
While novel artificial intelligence and machine learning techniques are evolving and disrupting established terrestrial technologies at an unprecedented speed,their adaptation onboard satellites is seemingly lagging.A...While novel artificial intelligence and machine learning techniques are evolving and disrupting established terrestrial technologies at an unprecedented speed,their adaptation onboard satellites is seemingly lagging.A major hindrance in this regard is the need for highquality annotated data for training such systems,which makes the development process of machine learning solutions costly,time-consuming,and inefficient.This paper presents“the OPS-SAT case”,a novel data-centric competition that seeks to address these challenges.The powerful computational capabilities of the European Space Agency’s OPS-SAT satellite are utilized to showcase the design of machine learning systems for space by using only the small amount of available labeled data,relying on the widely adopted and freely available open-source software.The generation of a suitable dataset,design and evaluation of a public data-centric competition,and results of an onboard experimental campaign by using the competition winners’machine learning model directly on OPS-SAT are detailed.The results indicate that adoption of open standards and deployment of advanced data augmentation techniques can retrieve meaningful onboard results comparatively quickly,simplifying and expediting an otherwise prolonged development period.展开更多
文摘Artificial intelligence(AI)relies on data and algorithms.State-of-the-art(SOTA)AI smart algorithms have been developed to improve the performance of AI-oriented structures.However,model-centric approaches are limited by the absence of high-quality data.Data-centric AI is an emerging approach for solving machine learning(ML)problems.It is a collection of various data manipulation techniques that allow ML practitioners to systematically improve the quality of the data used in an ML pipeline.However,data-centric AI approaches are not well documented.Researchers have conducted various experiments without a clear set of guidelines.This survey highlights six major data-centric AI aspects that researchers are already using to intentionally or unintentionally improve the quality of AI systems.These include big data quality assessment,data preprocessing,transfer learning,semi-supervised learning,machine learning operations(MLOps),and the effect of adding more data.In addition,it highlights recent data-centric techniques adopted by ML practitioners.We addressed how adding data might harm datasets and how HoloClean can be used to restore and clean them.Finally,we discuss the causes of technical debt in AI.Technical debt builds up when software design and implementation decisions run into“or outright collide with”business goals and timelines.This survey lays the groundwork for future data-centric AI discussions by summarizing various data-centric approaches.
文摘In a time characterized by the availability of vast amounts of data,the effective utilization of information is critical for timely decision-making in military operations.However,processing large amounts of data requires computational resources and time.Therefore,decision makers have used data-centric technologies to take advantage of public and private data sources to support military operations.This survey explores the integration and application of data-centric technologies,such as data analytics,data science,and machine learning,to optimize decision-making workflows within military contexts supporting the deployment of military assets and resources.To address the information gap,this article presents a literature review,specifically a survey.Our survey examines the use of the mentioned technologies to process and analyze information that contributes to the phases of situational awareness,and planning in military environments.We then introduce a taxonomy of the approaches associated with implementing these technologies in military scenarios.Furthermore,we discuss relevant factors for the seamless integration of data-centric technologies into military decision-making processes,and reveal the importance of specialized personnel,architectures,and cybersecurity issues in the task of developing prototypes and models.The findings of this paper aim to provide valuable insights for military institutions,offering a deeper understanding of the use of data-centric technologies as innovative practices to enhance the effectiveness of military decision-making.
文摘The purpose of this paper(presented online as a keynote lecture at the 25th Annual Indonesian Geotechnical Conference on 10 Nov 2021)is to broadly conceptualize the agenda for data-centric geotechnics,an emerging field that attempts to prepare geotechnical engineering for digital transformation.The agenda must include(1)development of methods that make sense of all real-world data(not selective input data for a physical model),(2)offering insights of significant value to critical real-world decisions for current or future practice(not decisions for an ideal world or decisions of minor concern to geotechnical engineers),and(3)sensitivity to the physical context of geotechnics(not abstract data-driven analysis connected to geotechnics in a peripheral way,i.e.,engagement with the knowledge and experience base should be substantial).These three elements are termed“data centricity”,“fit for(and transform)practice”,and“geotechnical context”in the agenda.Given that a knowledge of the site is central to any geotechnical engineering project,datadriven site characterization(DDSC)must constitute one key application domain in data-centric geotechnics,although other infrastructure lifecycle phases such as project conceptualization,design,construction,operation,and decommission/reuse would benefit from data-informed decision support as well.One part of DDSC that addresses numerical soil data in a site investigation report and soil property databases is pursued under Project DeepGeo.In principle,the source of data can also go beyond site investigation,and the type of data can go beyond numbers,such as categorical data,text,audios,images,videos,and expert opinion.The purpose of Project DeepGeo is to produce a 3D stratigraphic map of the subsurface volume below a full-scale project site and to estimate relevant engineering properties at each spatial point based on actual site investigation data and other relevant Big Indirect Data(BID).Uncertainty quantification is necessary,as current real-world data is insufficient,incomplete,and/or not directly relevant to construct a deterministic map.The value of a deterministic map for decision support is debatable.The computational cost to do this for a 3D true scale subsurface volume must be reasonable.Ultimately,geotechnical structures need to be a part of a completely smart infrastructure that fits the circular economy and need to focus on delivering service to end-users and the community from project conceptualization to decommission/reuse with full integration to smart city and smart society.Although current geotechnical practice has been very successful in taking“calculated risk”informed by limited data,imperfect theories,prototype testing,observations,among others and exercising judicious caution and engineering judgment,there is no clear pathway forward to leverage on big data and digital technologies such as machine learning,BIM,and digital twin to meet more challenging needs such as sustainability and resilience engineering.
文摘While novel artificial intelligence and machine learning techniques are evolving and disrupting established terrestrial technologies at an unprecedented speed,their adaptation onboard satellites is seemingly lagging.A major hindrance in this regard is the need for highquality annotated data for training such systems,which makes the development process of machine learning solutions costly,time-consuming,and inefficient.This paper presents“the OPS-SAT case”,a novel data-centric competition that seeks to address these challenges.The powerful computational capabilities of the European Space Agency’s OPS-SAT satellite are utilized to showcase the design of machine learning systems for space by using only the small amount of available labeled data,relying on the widely adopted and freely available open-source software.The generation of a suitable dataset,design and evaluation of a public data-centric competition,and results of an onboard experimental campaign by using the competition winners’machine learning model directly on OPS-SAT are detailed.The results indicate that adoption of open standards and deployment of advanced data augmentation techniques can retrieve meaningful onboard results comparatively quickly,simplifying and expediting an otherwise prolonged development period.
基金Supported by the Key Program of National Natural Science Foundation of China under Grant No.60533110(国家自然科学基金重点项目)the National Natural Science Foundation of China under Grant No.60473075(国家自然科学基金)+3 种基金the National Grand Fundamental Research973Program of China under Grant No.2006CB303000(国家重点基础研究发展计划(973))the Program for New Century Excellent Talents in University of China under Grant No.NCET-05-0333(新世纪优秀人才支持计划)the Key Program of the Natural Science Foundation of Heilongjiang Province of China under Grant No.ZJG03-05(黑龙江省自然科学基金重点项目)the Heilongjiang Province Scientific and Technological Special Fund for Young Scholars of China under Grant No.QC06C033(黑龙江省青年科技专项资金)