Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the predicti...Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.展开更多
Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better p...Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better performance.Since big data involves numerous features and necessitates high computational time,feature selection methodologies using metaheuristic optimization algorithms can be adopted to choose optimum set of features and thereby improves the overall classification performance.This study proposes a new sigmoid butterfly optimization method with an optimum gated recurrent unit(SBOA-OGRU)model for big data classification in Apache Spark.The SBOA-OGRU technique involves the design of SBOA based feature selection technique to choose an optimum subset of features.In addition,OGRU based classification model is employed to classify the big data into appropriate classes.Besides,the hyperparameter tuning of the GRU model takes place using Adam optimizer.Furthermore,the Apache Spark platform is applied for processing big data in an effective way.In order to ensure the betterment of the SBOA-OGRU technique,a wide range of experiments were performed and the experimental results highlighted the supremacy of the SBOA-OGRU technique.展开更多
针对大数据教学中学生计算机环境差异导致的配置困难问题,本文提出了基于VSCode Dev Container和Docker容器化技术的统一开发环境解决方案。该方案采用三层架构设计,通过标准化配置文件和容器镜像,实现学生在个人计算机上快速部署一致的...针对大数据教学中学生计算机环境差异导致的配置困难问题,本文提出了基于VSCode Dev Container和Docker容器化技术的统一开发环境解决方案。该方案采用三层架构设计,通过标准化配置文件和容器镜像,实现学生在个人计算机上快速部署一致的Apache Spark开发环境。实践表明,该方案将环境准备时间从90~180分钟缩短至15分钟,有效消除了跨平台差异,保证了实验结果的可复现性,显著提升了大数据课程的教学效率。展开更多
Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been ...Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been widely validated in computer science,bioscience,and geoscience.However,it has a quadratic time com-plexity due to costly K-nearest neighbor search and internal con-nection operations,which hinder its ability to handle large-scale datasets.To improve its computational efficiency and scalability,we proposed a performance enhanced distributed framework of CDc,named D-CDC,by workflow-level algorithm optimization and dis-tributed computational acceleration.Specifically,KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time,and KNN constraints and disjoint sets are intro-duced to decrease the computational cost of internal connection.Besides,to minimize cross-partition communication,we designed an Improved QuadTree(ImprovedQT)spatial partitioning method by considering cluster completeness and shape regularity.We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset(RDD)customization techniques.Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original cDC in general and achieves up to 60o-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets.A real-world case study on over 2 million enterprise registration POl data in Chinese main-land further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an effi cient manner.展开更多
This article delves into the intricate relationship between big data, cloud computing, and artificial intelligence, shedding light on their fundamental attributes and interdependence. It explores the seamless amalgama...This article delves into the intricate relationship between big data, cloud computing, and artificial intelligence, shedding light on their fundamental attributes and interdependence. It explores the seamless amalgamation of AI methodologies within cloud computing and big data analytics, encompassing the development of a cloud computing framework built on the robust foundation of the Hadoop platform, enriched by AI learning algorithms. Additionally, it examines the creation of a predictive model empowered by tailored artificial intelligence techniques. Rigorous simulations are conducted to extract valuable insights, facilitating method evaluation and performance assessment, all within the dynamic Hadoop environment, thereby reaffirming the precision of the proposed approach. The results and analysis section reveals compelling findings derived from comprehensive simulations within the Hadoop environment. These outcomes demonstrate the efficacy of the Sport AI Model (SAIM) framework in enhancing the accuracy of sports-related outcome predictions. Through meticulous mathematical analyses and performance assessments, integrating AI with big data emerges as a powerful tool for optimizing decision-making in sports. The discussion section extends the implications of these results, highlighting the potential for SAIM to revolutionize sports forecasting, strategic planning, and performance optimization for players and coaches. The combination of big data, cloud computing, and AI offers a promising avenue for future advancements in sports analytics. This research underscores the synergy between these technologies and paves the way for innovative approaches to sports-related decision-making and performance enhancement.展开更多
Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial ras...Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model,distributed physical data storage model,and the data pipeline in distributed computing frameworks.To efficiently process big geospatial data,this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System(HDFS)from the following aspects:(1)improve I/O efficiency by adopting the chunking data structure;(2)keep the workload balance and high data locality by building the global index(k-d tree);(3)enable Spark and HDFS to natively support geospatial raster data formats(e.g.,HDF4,NetCDF4,GeoTiff)by building the local index(hash table);(4)index the in-memory data to further improve geospatial data queries;(5)develop a data repartition strategy to tune the query parallelism while keeping high data locality.The above strategies are implemented by developing the customized RDDs,and evaluated by comparing the performance with that of Spark SQL and SciSpark.The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.展开更多
基金This study was financially supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute(KHIDI),the Ministry of Health and Welfare(HI18C1216),and the Soonchunhyang University Research Fund.
文摘Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.
文摘Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better performance.Since big data involves numerous features and necessitates high computational time,feature selection methodologies using metaheuristic optimization algorithms can be adopted to choose optimum set of features and thereby improves the overall classification performance.This study proposes a new sigmoid butterfly optimization method with an optimum gated recurrent unit(SBOA-OGRU)model for big data classification in Apache Spark.The SBOA-OGRU technique involves the design of SBOA based feature selection technique to choose an optimum subset of features.In addition,OGRU based classification model is employed to classify the big data into appropriate classes.Besides,the hyperparameter tuning of the GRU model takes place using Adam optimizer.Furthermore,the Apache Spark platform is applied for processing big data in an effective way.In order to ensure the betterment of the SBOA-OGRU technique,a wide range of experiments were performed and the experimental results highlighted the supremacy of the SBOA-OGRU technique.
文摘针对大数据教学中学生计算机环境差异导致的配置困难问题,本文提出了基于VSCode Dev Container和Docker容器化技术的统一开发环境解决方案。该方案采用三层架构设计,通过标准化配置文件和容器镜像,实现学生在个人计算机上快速部署一致的Apache Spark开发环境。实践表明,该方案将环境准备时间从90~180分钟缩短至15分钟,有效消除了跨平台差异,保证了实验结果的可复现性,显著提升了大数据课程的教学效率。
基金supported by National Natural Science Foundation of China[No.42090010,No.41971349]the Fundamental Research Funds for the Central Universities,China[No.2042022dx0001].
文摘Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been widely validated in computer science,bioscience,and geoscience.However,it has a quadratic time com-plexity due to costly K-nearest neighbor search and internal con-nection operations,which hinder its ability to handle large-scale datasets.To improve its computational efficiency and scalability,we proposed a performance enhanced distributed framework of CDc,named D-CDC,by workflow-level algorithm optimization and dis-tributed computational acceleration.Specifically,KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time,and KNN constraints and disjoint sets are intro-duced to decrease the computational cost of internal connection.Besides,to minimize cross-partition communication,we designed an Improved QuadTree(ImprovedQT)spatial partitioning method by considering cluster completeness and shape regularity.We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset(RDD)customization techniques.Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original cDC in general and achieves up to 60o-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets.A real-world case study on over 2 million enterprise registration POl data in Chinese main-land further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an effi cient manner.
文摘This article delves into the intricate relationship between big data, cloud computing, and artificial intelligence, shedding light on their fundamental attributes and interdependence. It explores the seamless amalgamation of AI methodologies within cloud computing and big data analytics, encompassing the development of a cloud computing framework built on the robust foundation of the Hadoop platform, enriched by AI learning algorithms. Additionally, it examines the creation of a predictive model empowered by tailored artificial intelligence techniques. Rigorous simulations are conducted to extract valuable insights, facilitating method evaluation and performance assessment, all within the dynamic Hadoop environment, thereby reaffirming the precision of the proposed approach. The results and analysis section reveals compelling findings derived from comprehensive simulations within the Hadoop environment. These outcomes demonstrate the efficacy of the Sport AI Model (SAIM) framework in enhancing the accuracy of sports-related outcome predictions. Through meticulous mathematical analyses and performance assessments, integrating AI with big data emerges as a powerful tool for optimizing decision-making in sports. The discussion section extends the implications of these results, highlighting the potential for SAIM to revolutionize sports forecasting, strategic planning, and performance optimization for players and coaches. The combination of big data, cloud computing, and AI offers a promising avenue for future advancements in sports analytics. This research underscores the synergy between these technologies and paves the way for innovative approaches to sports-related decision-making and performance enhancement.
基金This research is funded by NASA(National Aeronautics and Space Administration)NCCS and AIST(NNX15AM85G)NSF I/UCRC,CSSI,and EarthCube Programs(1338925 and 1835507).
文摘Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model,distributed physical data storage model,and the data pipeline in distributed computing frameworks.To efficiently process big geospatial data,this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System(HDFS)from the following aspects:(1)improve I/O efficiency by adopting the chunking data structure;(2)keep the workload balance and high data locality by building the global index(k-d tree);(3)enable Spark and HDFS to natively support geospatial raster data formats(e.g.,HDF4,NetCDF4,GeoTiff)by building the local index(hash table);(4)index the in-memory data to further improve geospatial data queries;(5)develop a data repartition strategy to tune the query parallelism while keeping high data locality.The above strategies are implemented by developing the customized RDDs,and evaluated by comparing the performance with that of Spark SQL and SciSpark.The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.