Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the predicti...Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.展开更多
Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better p...Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better performance.Since big data involves numerous features and necessitates high computational time,feature selection methodologies using metaheuristic optimization algorithms can be adopted to choose optimum set of features and thereby improves the overall classification performance.This study proposes a new sigmoid butterfly optimization method with an optimum gated recurrent unit(SBOA-OGRU)model for big data classification in Apache Spark.The SBOA-OGRU technique involves the design of SBOA based feature selection technique to choose an optimum subset of features.In addition,OGRU based classification model is employed to classify the big data into appropriate classes.Besides,the hyperparameter tuning of the GRU model takes place using Adam optimizer.Furthermore,the Apache Spark platform is applied for processing big data in an effective way.In order to ensure the betterment of the SBOA-OGRU technique,a wide range of experiments were performed and the experimental results highlighted the supremacy of the SBOA-OGRU technique.展开更多
Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been ...Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been widely validated in computer science,bioscience,and geoscience.However,it has a quadratic time com-plexity due to costly K-nearest neighbor search and internal con-nection operations,which hinder its ability to handle large-scale datasets.To improve its computational efficiency and scalability,we proposed a performance enhanced distributed framework of CDc,named D-CDC,by workflow-level algorithm optimization and dis-tributed computational acceleration.Specifically,KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time,and KNN constraints and disjoint sets are intro-duced to decrease the computational cost of internal connection.Besides,to minimize cross-partition communication,we designed an Improved QuadTree(ImprovedQT)spatial partitioning method by considering cluster completeness and shape regularity.We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset(RDD)customization techniques.Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original cDC in general and achieves up to 60o-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets.A real-world case study on over 2 million enterprise registration POl data in Chinese main-land further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an effi cient manner.展开更多
Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial ras...Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model,distributed physical data storage model,and the data pipeline in distributed computing frameworks.To efficiently process big geospatial data,this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System(HDFS)from the following aspects:(1)improve I/O efficiency by adopting the chunking data structure;(2)keep the workload balance and high data locality by building the global index(k-d tree);(3)enable Spark and HDFS to natively support geospatial raster data formats(e.g.,HDF4,NetCDF4,GeoTiff)by building the local index(hash table);(4)index the in-memory data to further improve geospatial data queries;(5)develop a data repartition strategy to tune the query parallelism while keeping high data locality.The above strategies are implemented by developing the customized RDDs,and evaluated by comparing the performance with that of Spark SQL and SciSpark.The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.展开更多
Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical...Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical features of traffic volume. However, this approach is not sufficient to reflect the communication pattern features. A different approach is required to detect anomalous behaviors that do not exhibit traffic volume changes, such as low-intensity anomalous behaviors caused by Denial of Service/Distributed Denial of Service (DoS/DDoS) attacks, Internet worms and scanning, and BotNets. We propose an efficient traffic feature extraction architecture based on our proposed approach, which combines the benefit of traffic volume features and network communication pattern features. This method can detect low-intensity anomalous network behaviors and conventional traffic volume anomalies. We implemented our approach on Spark Streaming and validated our feature set using labelled real-world dataset collected from the Sichuan University campus network. Our results demonstrate that the traffic feature extraction approach is efficient in detecting both traffic variations and communication structure changes. Based on our evaluation of the MIT-DRAPA dataset, the same detection approach utilizes traffic volume features with detection precision of 82.3% and communication pattern features with detection precision of 89.9%. Our proposed feature set improves precision by 94%.展开更多
针对大数据教学中学生计算机环境差异导致的配置困难问题,本文提出了基于VSCode Dev Container和Docker容器化技术的统一开发环境解决方案。该方案采用三层架构设计,通过标准化配置文件和容器镜像,实现学生在个人计算机上快速部署一致的...针对大数据教学中学生计算机环境差异导致的配置困难问题,本文提出了基于VSCode Dev Container和Docker容器化技术的统一开发环境解决方案。该方案采用三层架构设计,通过标准化配置文件和容器镜像,实现学生在个人计算机上快速部署一致的Apache Spark开发环境。实践表明,该方案将环境准备时间从90~180分钟缩短至15分钟,有效消除了跨平台差异,保证了实验结果的可复现性,显著提升了大数据课程的教学效率。展开更多
基金This study was financially supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute(KHIDI),the Ministry of Health and Welfare(HI18C1216),and the Soonchunhyang University Research Fund.
文摘Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.
文摘Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better performance.Since big data involves numerous features and necessitates high computational time,feature selection methodologies using metaheuristic optimization algorithms can be adopted to choose optimum set of features and thereby improves the overall classification performance.This study proposes a new sigmoid butterfly optimization method with an optimum gated recurrent unit(SBOA-OGRU)model for big data classification in Apache Spark.The SBOA-OGRU technique involves the design of SBOA based feature selection technique to choose an optimum subset of features.In addition,OGRU based classification model is employed to classify the big data into appropriate classes.Besides,the hyperparameter tuning of the GRU model takes place using Adam optimizer.Furthermore,the Apache Spark platform is applied for processing big data in an effective way.In order to ensure the betterment of the SBOA-OGRU technique,a wide range of experiments were performed and the experimental results highlighted the supremacy of the SBOA-OGRU technique.
基金supported by National Natural Science Foundation of China[No.42090010,No.41971349]the Fundamental Research Funds for the Central Universities,China[No.2042022dx0001].
文摘Clustering by local direction centrality(CDC)is a newly proposed versatile algorithm adept at identifying clusters with heteroge-neous density and weak connectivity.Its advantages in accuracy and robustness have been widely validated in computer science,bioscience,and geoscience.However,it has a quadratic time com-plexity due to costly K-nearest neighbor search and internal con-nection operations,which hinder its ability to handle large-scale datasets.To improve its computational efficiency and scalability,we proposed a performance enhanced distributed framework of CDc,named D-CDC,by workflow-level algorithm optimization and dis-tributed computational acceleration.Specifically,KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time,and KNN constraints and disjoint sets are intro-duced to decrease the computational cost of internal connection.Besides,to minimize cross-partition communication,we designed an Improved QuadTree(ImprovedQT)spatial partitioning method by considering cluster completeness and shape regularity.We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset(RDD)customization techniques.Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original cDC in general and achieves up to 60o-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets.A real-world case study on over 2 million enterprise registration POl data in Chinese main-land further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an effi cient manner.
基金This research is funded by NASA(National Aeronautics and Space Administration)NCCS and AIST(NNX15AM85G)NSF I/UCRC,CSSI,and EarthCube Programs(1338925 and 1835507).
文摘Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model,distributed physical data storage model,and the data pipeline in distributed computing frameworks.To efficiently process big geospatial data,this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System(HDFS)from the following aspects:(1)improve I/O efficiency by adopting the chunking data structure;(2)keep the workload balance and high data locality by building the global index(k-d tree);(3)enable Spark and HDFS to natively support geospatial raster data formats(e.g.,HDF4,NetCDF4,GeoTiff)by building the local index(hash table);(4)index the in-memory data to further improve geospatial data queries;(5)develop a data repartition strategy to tune the query parallelism while keeping high data locality.The above strategies are implemented by developing the customized RDDs,and evaluated by comparing the performance with that of Spark SQL and SciSpark.The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.
基金supported by the National Natural Science Foundation of China (No. 61272447)Sichuan Province Science and Technology Planning (Nos. 2016GZ0042, 16ZHSF0483, and 2017GZ0168)+1 种基金Key Research Project of Sichuan Provincial Department of Education (Nos. 17ZA0238 and 17ZA0200)Scientific Research Staring Foundation for Young Teachers of Sichuan University (No. 2015SCU11079)
文摘Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical features of traffic volume. However, this approach is not sufficient to reflect the communication pattern features. A different approach is required to detect anomalous behaviors that do not exhibit traffic volume changes, such as low-intensity anomalous behaviors caused by Denial of Service/Distributed Denial of Service (DoS/DDoS) attacks, Internet worms and scanning, and BotNets. We propose an efficient traffic feature extraction architecture based on our proposed approach, which combines the benefit of traffic volume features and network communication pattern features. This method can detect low-intensity anomalous network behaviors and conventional traffic volume anomalies. We implemented our approach on Spark Streaming and validated our feature set using labelled real-world dataset collected from the Sichuan University campus network. Our results demonstrate that the traffic feature extraction approach is efficient in detecting both traffic variations and communication structure changes. Based on our evaluation of the MIT-DRAPA dataset, the same detection approach utilizes traffic volume features with detection precision of 82.3% and communication pattern features with detection precision of 89.9%. Our proposed feature set improves precision by 94%.
文摘针对大数据教学中学生计算机环境差异导致的配置困难问题,本文提出了基于VSCode Dev Container和Docker容器化技术的统一开发环境解决方案。该方案采用三层架构设计,通过标准化配置文件和容器镜像,实现学生在个人计算机上快速部署一致的Apache Spark开发环境。实践表明,该方案将环境准备时间从90~180分钟缩短至15分钟,有效消除了跨平台差异,保证了实验结果的可复现性,显著提升了大数据课程的教学效率。