随着高清摄像设备在城市交通领域的普及,交通卡口采集的过车信息呈现爆发式增长态势。这些海量数据为构建交通起讫点(Origin Destination,OD)矩阵提供了丰富的基础资源。然而,面对庞大的数据体量,传统的串行计算模式存在处理速度慢、响...随着高清摄像设备在城市交通领域的普及,交通卡口采集的过车信息呈现爆发式增长态势。这些海量数据为构建交通起讫点(Origin Destination,OD)矩阵提供了丰富的基础资源。然而,面对庞大的数据体量,传统的串行计算模式存在处理速度慢、响应时间长等问题,难以适应实时分析的业务要求。针对这一瓶颈,文章设计了基于Spark平台的交通OD矩阵计算(Spark-based Calculation of Traffic OD Matrix,Spark-CoTODM)方法,该方法利用Spark平台的分布式并行计算能力,对OD矩阵生成过程进行并行化改造,从而大幅缩短运算周期。测试结果表明,当处理大规模数据集时,所提方法的执行效率得到明显改善。展开更多
Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
文摘随着高清摄像设备在城市交通领域的普及,交通卡口采集的过车信息呈现爆发式增长态势。这些海量数据为构建交通起讫点(Origin Destination,OD)矩阵提供了丰富的基础资源。然而,面对庞大的数据体量,传统的串行计算模式存在处理速度慢、响应时间长等问题,难以适应实时分析的业务要求。针对这一瓶颈,文章设计了基于Spark平台的交通OD矩阵计算(Spark-based Calculation of Traffic OD Matrix,Spark-CoTODM)方法,该方法利用Spark平台的分布式并行计算能力,对OD矩阵生成过程进行并行化改造,从而大幅缩短运算周期。测试结果表明,当处理大规模数据集时,所提方法的执行效率得到明显改善。
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.