In the era of big data, huge volumes of data are generated from online social networks, sensor networks, mobile devices, and organizations’ enterprise systems. This phenomenon provides organizations with unprecedente...In the era of big data, huge volumes of data are generated from online social networks, sensor networks, mobile devices, and organizations’ enterprise systems. This phenomenon provides organizations with unprecedented opportunities to tap into big data to mine valuable business intelligence. However, traditional business analytics methods may not be able to cope with the flood of big data. The main contribution of this paper is the illustration of the development of a novel big data stream analytics framework named BDSASA that leverages a probabilistic language model to analyze the consumer sentiments embedded in hundreds of millions of online consumer reviews. In particular, an inference model is embedded into the classical language modeling framework to enhance the prediction of consumer sentiments. The practical implication of our research work is that organizations can apply our big data stream analytics framework to analyze consumers’ product preferences, and hence develop more effective marketing and production strategies.展开更多
The rising popularity of online social networks (OSNs), such as Twitter, Facebook, MySpace, and LinkedIn, in recent years has sparked great interest in sentiment analysis on their data. While many methods exist for id...The rising popularity of online social networks (OSNs), such as Twitter, Facebook, MySpace, and LinkedIn, in recent years has sparked great interest in sentiment analysis on their data. While many methods exist for identifying sentiment in OSNs such as communication pattern mining and classification based on emoticon and parts of speech, the majority of them utilize a suboptimal batch mode learning approach when analyzing a large amount of real time data. As an alternative we present a stream algorithm using Modified Balanced Winnow for sentiment analysis on OSNs. Tested on three real-world network datasets, the performance of our sentiment predictions is close to that of batch learning with the ability to detect important features dynamically for sentiment analysis in data streams. These top features reveal key words important to the analysis of sentiment.展开更多
In this work we discuss SDSPbMM, an integrated Strategy for Data Stream Processing based on Measurement Metadata, applied to an outpatient monitoring scenario. The measures associated to the attributes of the patient ...In this work we discuss SDSPbMM, an integrated Strategy for Data Stream Processing based on Measurement Metadata, applied to an outpatient monitoring scenario. The measures associated to the attributes of the patient (entity) under monitoring, come from heterogeneous data sources as data streams, together with metadata associated with the formal definition of a measurement and evaluation project. Such metadata supports the patient analysis and monitoring in a more consistent way, facilitating for instance: i) The early detection of problems typical of data such as missing values, outliers, among others;and ii) The risk anticipation by means of on-line classification models adapted to the patient. We also performed a simulation using a prototype developed for outpatient monitoring, in order to analyze empirically processing times and variable scalability, which shed light on the feasibility of applying the prototype to real situations. In addition, we analyze statistically the results of the simulation, in order to detect the components which incorporate more variability to the system.展开更多
With the huge increase in popularity of Twitter in recent years, the ability to draw information regarding public sentiment from Twitter data has become an area of immense interest. Numerous methods of determining the...With the huge increase in popularity of Twitter in recent years, the ability to draw information regarding public sentiment from Twitter data has become an area of immense interest. Numerous methods of determining the sentiment of tweets, both in general and in regard to a specific topic, have been developed, however most of these functions are in a batch learning environment where instances may be passed over multiple times. Since Twitter data in real world situations are far similar to a stream environment, we proposed several algorithms which classify the sentiment of tweets in a data stream. We were able to determine whether a tweet was subjective or objective with an error rate as low as 0.24 and an F-score as high as 0.85. For the determination of positive or negative sentiment in subjective tweets, an error rate as low as 0.23 and an F-score as high as 0.78 were achieved.展开更多
Automatic Identification System(AIS)data stream analysis is based on the AIS data of different vessel’s behaviours,including the vessels’routes.When the AIS data consists of outliers,noises,or are incomplete,then th...Automatic Identification System(AIS)data stream analysis is based on the AIS data of different vessel’s behaviours,including the vessels’routes.When the AIS data consists of outliers,noises,or are incomplete,then the analysis of the vessel’s behaviours is not possible or is limited.When the data consists of outliers,it is not possible to automatically assign the AIS data to a particular vessel.In this paper,a clustering method is proposed to support the AIS data analysis,to qualify noises and outliers with respect to their suitability,and finally to aid the reconstruction of the vessel’s trajectory.In this paper,clustering results have been obtained using selected algorithms,including k-means,k-medoids,and fuzzy c-means.Based on the clustering results,it is possible to decide on the qualification of data with outliers and on their usefulness in the reconstruction of the vessel trajectory.The main aim of this paper is to answer how different distance measures during a clustering process can influence AIS data clustering quality.The main core question is whether or not they have an impact on the process of reconstruction of the vessel trajectories when the data are damaged.The research question during the computational experiments asked whether or not distance measure influence AIS data clustering quality.The computational experiments have been carried out using original AIS data.In general,the experiment and the results confirm the usefulness of the cluster-based analysis when the data include outliers that are derived from the natural environment.It is also possible to monitor and to analyse AIS data using clustering when the data include outliers.The computational experiment results confirm that the k-means with Euclidean distance has the best performance.展开更多
文摘In the era of big data, huge volumes of data are generated from online social networks, sensor networks, mobile devices, and organizations’ enterprise systems. This phenomenon provides organizations with unprecedented opportunities to tap into big data to mine valuable business intelligence. However, traditional business analytics methods may not be able to cope with the flood of big data. The main contribution of this paper is the illustration of the development of a novel big data stream analytics framework named BDSASA that leverages a probabilistic language model to analyze the consumer sentiments embedded in hundreds of millions of online consumer reviews. In particular, an inference model is embedded into the classical language modeling framework to enhance the prediction of consumer sentiments. The practical implication of our research work is that organizations can apply our big data stream analytics framework to analyze consumers’ product preferences, and hence develop more effective marketing and production strategies.
文摘The rising popularity of online social networks (OSNs), such as Twitter, Facebook, MySpace, and LinkedIn, in recent years has sparked great interest in sentiment analysis on their data. While many methods exist for identifying sentiment in OSNs such as communication pattern mining and classification based on emoticon and parts of speech, the majority of them utilize a suboptimal batch mode learning approach when analyzing a large amount of real time data. As an alternative we present a stream algorithm using Modified Balanced Winnow for sentiment analysis on OSNs. Tested on three real-world network datasets, the performance of our sentiment predictions is close to that of batch learning with the ability to detect important features dynamically for sentiment analysis in data streams. These top features reveal key words important to the analysis of sentiment.
文摘In this work we discuss SDSPbMM, an integrated Strategy for Data Stream Processing based on Measurement Metadata, applied to an outpatient monitoring scenario. The measures associated to the attributes of the patient (entity) under monitoring, come from heterogeneous data sources as data streams, together with metadata associated with the formal definition of a measurement and evaluation project. Such metadata supports the patient analysis and monitoring in a more consistent way, facilitating for instance: i) The early detection of problems typical of data such as missing values, outliers, among others;and ii) The risk anticipation by means of on-line classification models adapted to the patient. We also performed a simulation using a prototype developed for outpatient monitoring, in order to analyze empirically processing times and variable scalability, which shed light on the feasibility of applying the prototype to real situations. In addition, we analyze statistically the results of the simulation, in order to detect the components which incorporate more variability to the system.
文摘With the huge increase in popularity of Twitter in recent years, the ability to draw information regarding public sentiment from Twitter data has become an area of immense interest. Numerous methods of determining the sentiment of tweets, both in general and in regard to a specific topic, have been developed, however most of these functions are in a batch learning environment where instances may be passed over multiple times. Since Twitter data in real world situations are far similar to a stream environment, we proposed several algorithms which classify the sentiment of tweets in a data stream. We were able to determine whether a tweet was subjective or objective with an error rate as low as 0.24 and an F-score as high as 0.85. For the determination of positive or negative sentiment in subjective tweets, an error rate as low as 0.23 and an F-score as high as 0.78 were achieved.
文摘Automatic Identification System(AIS)data stream analysis is based on the AIS data of different vessel’s behaviours,including the vessels’routes.When the AIS data consists of outliers,noises,or are incomplete,then the analysis of the vessel’s behaviours is not possible or is limited.When the data consists of outliers,it is not possible to automatically assign the AIS data to a particular vessel.In this paper,a clustering method is proposed to support the AIS data analysis,to qualify noises and outliers with respect to their suitability,and finally to aid the reconstruction of the vessel’s trajectory.In this paper,clustering results have been obtained using selected algorithms,including k-means,k-medoids,and fuzzy c-means.Based on the clustering results,it is possible to decide on the qualification of data with outliers and on their usefulness in the reconstruction of the vessel trajectory.The main aim of this paper is to answer how different distance measures during a clustering process can influence AIS data clustering quality.The main core question is whether or not they have an impact on the process of reconstruction of the vessel trajectories when the data are damaged.The research question during the computational experiments asked whether or not distance measure influence AIS data clustering quality.The computational experiments have been carried out using original AIS data.In general,the experiment and the results confirm the usefulness of the cluster-based analysis when the data include outliers that are derived from the natural environment.It is also possible to monitor and to analyse AIS data using clustering when the data include outliers.The computational experiment results confirm that the k-means with Euclidean distance has the best performance.