期刊文献+

一种Hadoop小文件存储和读取的方法 被引量:38

AN APPROACH FOR STORING AND ACCESSING SMALL FILES ON HADOOP
在线阅读 下载PDF
导出
摘要 HDFS(Hadoop Distributed File System)凭借其高容错、可伸缩和廉价存储的优点,在当前面向云计算的应用场景中得到了广泛应用。然而,HDFS设计的初衷是存储超大文件,对于海量小文件,由于NameNode内存开销等问题,其存储和读取性能并不理想。提出一种基于小文件合并的方法 HIFM(Hierarchy Index File Merging),综合考虑小文件之间的相关性和数据的目录结构,来辅助将小文件合并成大文件,并生成分层索引。采用集中存储和分布式存储相结合的方式管理索引文件,并实现索引文件预加载。此外,HIFM采用数据预取的机制,提高顺序访问小文件的效率。实验结果表明,HIFM方法能够有效提高小文件存储和读取效率,显著降低NameNode和DataNode的内存开销,适合应用在有一定目录结构的海量小文件存储的应用场合。 Benefiting from its advantages of high fault-tolerance,scalability and low-cost storage capability,HDFS(Hadoop distributed file system) has been gaining widely application in current cloud computing-based applied scenes.However,HDFS is primarily designed for streaming access of ultra-large files and suffers the performance penalty in both storage and accessing while managing massive small files due to the memory overhead problem of NameNode.In this paper,an approach based on combining small files,called HIFM(hierarchy index file merging),is proposed.In it,the correlations between small files and the directory structure of data are comprehensively considered to assist the small files to be merged into large ones and to generate hierarchical index.Centralised storage and distributed storage methods are jointly used in index files management,and the preload of index files is implemented.Besides,in order to improve the efficiency of sequentially accessing the small files,HIFM adopts data prefetching mechanism.Experimental results show that HIFM can improve the efficiency of storing and accessing small files effectively,and mitigate the memory overhead of NameNode and DataNode obviously.It is suitable for the applications which have massive structured small files storage.
出处 《计算机应用与软件》 CSCD 北大核心 2012年第11期95-100,共6页 Computer Applications and Software
基金 新闻出版重大科技工程项目(0610-1041BJNF2328/23) 国家科技支撑计划课题(2011BAH14B02) 中国科学院知识创新工程方向性项目课题(KGCX2-YW-174)
关键词 HDFS 小文件 HIFM 分层索引 索引预加载 数据预取 HDFS Small files HIFM Hierarchical index Index preload Prefetching
  • 相关文献

参考文献9

  • 1Armbrust M, Fox A. Griffith R, et al. Above the Clouds: A Berkeley View of Cloud Computing[ D ]. UCB/EECS-2009-28, EECS Department, University of California, Berkeley, 2009.
  • 2Tom White. Hadoop: The Definitive Guide[M]. 2rid ed. O' Reilly Media, Inc ,2011.
  • 3Konstantin Shvachko , Hairing Kuang , Sanyjy Radia , et al. The Ha- doop Distributed File System [ C ]//Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), May 03 -07, 2010:1 -10.
  • 4Hadooparchives[ OL]. http ://hadoop. apache. org/common/docs/current/hadoop_ archives. html.
  • 5Sequence File Wiki [ OL ]. http ://wiki. apache.org/hadoop/Seq uen ce File.
  • 6Map files[OL], http://hadoop. apache. org/common/docs/current / api/org/apache/hadoop/io/MapFile. html.
  • 7Tom White. The Small Files Problem[ OL]. http://www, clou dera. com/blog/2009/02/02/the-small-files-problem/.
  • 8Xuhui Liu, Jizhong Han, Yunqin Zhong, et al. Implementing WebGIS on Hadoop: A Case Study of Improving Small File L/O Performance on HDFS [C]//Proc. of the 2009 IEEE Conf. on Cluster Computing:1 - 8.
  • 9Bo Dong, Jie Qiu, Qinghua Zheng, et al. A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop : a Case Study by PowerPoint Files [ C ]//International Conference on Services Computing,2010:65 - 72.

同被引文献284

引证文献38

二级引证文献428

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部