Supervised machine learning techniques require labelled multivariate training datasets.Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visuali...Supervised machine learning techniques require labelled multivariate training datasets.Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations.Using appropriate techniques,analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions.While this principle has been implemented either for unsupervised,semi-supervised,or supervised machine learning tasks,the combination of all three methodologies remains challenging.In this paper,a visual analytics approach is presented,combining a variety of machine learning capabilities with four linked visualisation views,all integrated within the mVis(multivariate Visualiser)system.The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions,from which a classifier can be built.In the workflow,the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning.Once a dataset has been interactively labelled,the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset.Using a novel technique called automatic dimension selection,interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms.A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks,from initial labelling through iterations of data exploration,clustering,classification,and active learning to refine the named partitions,to finally producing a high-quality labelled training dataset suitable for training a classifier.The tool empowers the analyst with interactive visualisations including scatterplots,parallel coordinates,similarity maps for records,and a new similarity map for partitions.展开更多
Methods from supervised machine learning allow the classification of new data automatically and are tremendously helpful for data analysis.The quality of supervised maching learning depends not only on the type of alg...Methods from supervised machine learning allow the classification of new data automatically and are tremendously helpful for data analysis.The quality of supervised maching learning depends not only on the type of algorithm used,but also on the quality of the labelled dataset used to train the classifier.Labelling instances in a training dataset is often done manually relying on selections and annotations by expert analysts,and is often a tedious and time-consuming process.Active learning algorithms can automatically determine a subset of data instances for which labels would provide useful input to the learning process.Interactive visual labelling techniques are a promising alternative,providing effective visual overviews from which an analyst can simultaneously explore data records and select items to a label.By putting the analyst in the loop,higher accuracy can be achieved in the resulting classifier.While initial results of interactive visual labelling techniques are promising in the sense that user labelling can improve supervised learning,many aspects of these techniques are still largely unexplored.This paper presents a study conducted using the mVis tool to compare three interactive visualisations,similarity map,scatterplot matrix(SPLOM),and parallel coordinates,with each other and with active learning for the purpose of labelling a multivariate dataset.The results show that all three interactive visual labelling techniques surpass active learning algorithms in terms of classifier accuracy,and that users subjectively prefer the similarity map over SPLOM and parallel coordinates for labelling.Users also employ different labelling strategies depending on the visualisation used.展开更多
文摘Supervised machine learning techniques require labelled multivariate training datasets.Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations.Using appropriate techniques,analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions.While this principle has been implemented either for unsupervised,semi-supervised,or supervised machine learning tasks,the combination of all three methodologies remains challenging.In this paper,a visual analytics approach is presented,combining a variety of machine learning capabilities with four linked visualisation views,all integrated within the mVis(multivariate Visualiser)system.The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions,from which a classifier can be built.In the workflow,the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning.Once a dataset has been interactively labelled,the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset.Using a novel technique called automatic dimension selection,interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms.A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks,from initial labelling through iterations of data exploration,clustering,classification,and active learning to refine the named partitions,to finally producing a high-quality labelled training dataset suitable for training a classifier.The tool empowers the analyst with interactive visualisations including scatterplots,parallel coordinates,similarity maps for records,and a new similarity map for partitions.
文摘Methods from supervised machine learning allow the classification of new data automatically and are tremendously helpful for data analysis.The quality of supervised maching learning depends not only on the type of algorithm used,but also on the quality of the labelled dataset used to train the classifier.Labelling instances in a training dataset is often done manually relying on selections and annotations by expert analysts,and is often a tedious and time-consuming process.Active learning algorithms can automatically determine a subset of data instances for which labels would provide useful input to the learning process.Interactive visual labelling techniques are a promising alternative,providing effective visual overviews from which an analyst can simultaneously explore data records and select items to a label.By putting the analyst in the loop,higher accuracy can be achieved in the resulting classifier.While initial results of interactive visual labelling techniques are promising in the sense that user labelling can improve supervised learning,many aspects of these techniques are still largely unexplored.This paper presents a study conducted using the mVis tool to compare three interactive visualisations,similarity map,scatterplot matrix(SPLOM),and parallel coordinates,with each other and with active learning for the purpose of labelling a multivariate dataset.The results show that all three interactive visual labelling techniques surpass active learning algorithms in terms of classifier accuracy,and that users subjectively prefer the similarity map over SPLOM and parallel coordinates for labelling.Users also employ different labelling strategies depending on the visualisation used.