HFileOutputFormat与TotalOrderPartitioner

最近需要为一些数据增加随机读的功能,于是采用生成HFile再bulk load进HBase的方式。

运行的时候map很快完成,reduce在sort阶段花费时间很长,reducer用的是KeyValueSortReducer而且只有一个,这就形成了单reducer全排序的瓶颈。于是就想着采用TotalOrderPartitioner使得MR Job可以有多个reducer,来提高并行度解决这个瓶颈。

于是动手写代码,不仅用了TotalOrderPartitioner,还使用InputSampler.RandomSampler生成分区文件。但执行时碰到问题,查资料时无意发现HFileOutputFormat内部是使用TotalOrderPartitioner来进行全排序的,

 public static void configureIncrementalLoad(Job job, HTable table)throws IOException {Configuration conf = job.getConfiguration();Class<? extends Partitioner> topClass;try {topClass = getTotalOrderPartitionerClass();} catch (ClassNotFoundException e) {throw new IOException("Failed getting TotalOrderPartitioner", e);}job.setPartitionerClass(topClass);......

 

分区文件的内容就是各region的startKey(去掉最小的),

private static void writePartitions(Configuration conf, Path partitionsPath,List<ImmutableBytesWritable> startKeys) throws IOException {if (startKeys.isEmpty()) {throw new IllegalArgumentException("No regions passed");}// We're generating a list of split points, and we don't ever// have keys < the first region (which has an empty start key)// so we need to remove it. Otherwise we would end up with an// empty reducer with index 0//没有哪个rowkey会排在最小的startKey之前,所以去掉最小的startKeyTreeSet<ImmutableBytesWritable> sorted =new TreeSet<ImmutableBytesWritable>(startKeys);ImmutableBytesWritable first = sorted.first();//如果最小的region startKey不是“法定”的最小rowkey,那就报异常if (!first.equals(HConstants.EMPTY_BYTE_ARRAY)) {throw new IllegalArgumentException("First region of table should have empty start key. Instead has: "+ Bytes.toStringBinary(first.get()));}sorted.remove(first);// Write the actual fileFileSystem fs = partitionsPath.getFileSystem(conf);SequenceFile.Writer writer = SequenceFile.createWriter(fs,conf, partitionsPath, ImmutableBytesWritable.class, NullWritable.class);try {//写入分区文件中 for (ImmutableBytesWritable startKey : sorted) {writer.append(startKey, NullWritable.get());}} finally {writer.close();}}

因为我的表都是新表,只有一个region, 所以肯定是只有一个reducer了。

既然如此,使用HFileOutputFormat时reducer的数量就是HTable的region数量,如果使用bluk load HFile的方式导入巨量数据,最好的办法是在定义htable是就预先定义好各region。这种方式其实叫Pre-Creating Regions,PCR还能带来些别的优化,比如减少split region的操作:淘宝有些优化就是应用PCR并且关闭自动split,等到系统空闲时再手动split,这样可以保证系统繁忙时不会再被split雪上加霜。

关于Pre-Creating Regions: http://hbase.apache.org/book.html#precreate.regions

 11.7.2. Table Creation: Pre-Creating Regions Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. There are two different approaches to pre-creating splits. The first approach is to rely on the default HBaseAdmin strategy (which is implemented in Bytes.split)...

byte[] startKey = ...;       // your lowest keuy
byte[] endKey = ...;           // your highest key
int numberOfRegions = ...;    // # of regions to create
admin.createTable(table, startKey, endKey, numberOfRegions);

And the other approach is to define the splits yourself...

byte[][] splits = ...;   // create your own splits
admin.createTable(table, splits);

 

 

 

转载于:https://www.cnblogs.com/aprilrain/archive/2013/03/27/2985064.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/274309.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

qt按钮禁用和激活禁用_为什么试探法只是经验法则:禁用按钮的情况

qt按钮禁用和激活禁用Most user experience designers will be familiar with Jackob Nielsen’s 10 usability heuristics. They are widely cited and a great set of broad rules of thumb to follow when designing user interfaces.大多数用户体验设计师将熟悉Jackob Niel…

Teach Yourself Java 2 in 21 Days 书中样例代码实践

找了好几书JAVA的书&#xff0c;看了几章&#xff0c;都看不下去。 我觉得适合《Teach Yourself Java 2 in 21 Days》&#xff08;Rogers Cadenhead Laura Lemay&#xff09;还是适合我的。 孙卫琴那本&#xff0c;我感觉就罗嗦多了没到我点子上。 接口&#xff0c;抽象类这些内…

好奇心机制_好奇心问题

好奇心机制For my past two jobs I’ve posted a question every week in my team chat and learned so much about my co-workers. Give it a try! :D对于过去的两个工作&#xff0c;我每周都会在团队聊天中发布一个问题&#xff0c;并且对我的同事了解很多。 试试看&#xff…

20130328java基础学习笔记-循环结构for以及for,while循环区别

1.循环结构:for讲解class ForDemo{ public static void main(String[] args) { /* for(初始化表达式;循环条件表达式;循环后的操作表达式) { 执行语句;(循环体) } */ for(int x 1; x<3; x) { …

小程序设计避免犯什么错_新设计师犯下的5种印刷错误以及如何避免

小程序设计避免犯什么错Over the last year and a half, I’ve had the opportunity to teach the basics of typography to undergraduate graphic design students. During this time, I’ve noticed some common mistakes that my students make when first learning how to…

移动设备web文字单位_移动设备如何塑造现代Web设计

移动设备web文字单位I was working with a nonprofit earlier this month on redesigning their website and during the first meeting, I proposed a very standard idea: the home page needed to tell a story and guide the intended user through the intended process (…

hp-ux修改时区方法_UX研究人员可以倡导人类的6种方法

hp-ux修改时区方法In the UX world, we often hear terms like “user-centered,” “human-centered,” and “customer-centered.” We believe that in order to be innovative, we need to center experiences that are authentic, intuitive, and practical.在UX世界中&am…

2013年3月百度之星A题

伪随机数生成器 题目描述 baidu熊最近在学习随机算法&#xff0c;于是他决定自己做一个随机数生成器。 这个随机数生成器通过三个参数c, q, n作为种子, 然后它就可以通过以下方式生成伪随机数序列&#xff1a; m0 c, mi1 (q2mi 1) mod 2n, for all i > 0. 因为一些奇怪的…

为什么张扬的人别人很讨厌_为什么每个人总是讨厌重新设计,即使他们很好

为什么张扬的人别人很讨厌重点 (Top highlight)微处理 (Microprocessing) In Microprocessing, columnist Angela Lashbrook aims to improve your relationship with technology every week. Microprocessing goes deep on the little things that define your online life to…

转载--C语言:浮点数在内存中的表示

单精度浮点数&#xff1a; 1位符号位 8位阶码位 23位尾数 双精度浮点数&#xff1a; 1位符号位 8位阶码位 52位尾数 实数在内存中以规范化的浮点数存放&#xff0c;包括数符、阶码、尾数。数的精度取决于尾数的位数。比如32位机上float型为23位 double型为52位。…

学习ui设计_如果您想学习UI设计,该怎么办

学习ui设计There is a question that is always asked when we want to learn something new.当我们想学习新东西时&#xff0c;总会问一个问题。 Where to start?从哪儿开始&#xff1f; This is also being my question when I want to learn UI design. In this article, …

Christmas

html5 game - Christmasloading......转载于:https://www.cnblogs.com/yorhom/archive/2013/04/05/3001116.html

30个WordPress Retina(iPad)自适应主题

原文地址&#xff1a;http://www.goodfav.com/zh/retina-ready-wordpress-themes-3556.html WordPress Retina定制主题进行了优化&#xff0c;支持Retina屏幕上的高品质和清晰的图像。如果你关心这个话题&#xff0c;又不知道这究竟是什么&#xff0c;那么请你继续阅读。 wordp…

Thinking in java第一章对象导论

这一章&#xff0c;做笔记感觉不是很好做。每个人又每个人对面向对象的理解。这里说一下书里的关键字&#xff0c;穿插一下自己的思想 面向对象的编程语言里面很流行的一句话&#xff0c;一切都是对象。面向对象的核心就是抽象&#xff0c;抽象的能力有大有小&#xff0c;是决定…

Android SlidingMenu插件的使用

1、在github上下载了源码后 不知道如何使用&#xff0c;在折腾了一个晚上后终于弄好了 下载地址 https://github.com/jfeinstein10/SlidingMenu 下载完后&#xff0c;解压&#xff0c;然后先import 其中的library &#xff0c;然后把项目名改为SlidingMenu 2、然后再到http…

css 字体字体图标_CSS基础知识:了解字体

css 字体字体图标In this tutorial, we’ll be learning about working with fonts in CSS!在本教程中&#xff0c;我们将学习有关在CSS中使用字体的知识&#xff01; The font property is a shorthand property which can combine a number of sub-properties in a single d…

openstack quantum搭建过程中一些有用的链接

OpenvSwitch的概念和流程&#xff1a; http://blog.wachang.net/2013/03/openvswitch-fullbook-2-workflow-1/ OpenvSwitch的vlan模式&#xff1a; http://openvswitch.org/support/config-cookbooks/vlan-configuration-cookbook/ OpenvSwitch问答&#xff1a; http://openvsw…

mysql下载哪一代版本好_潮一代更好的设计

mysql下载哪一代版本好I think we can all agree that quarantined life has been strange. And while most of the day is comprised of the monotony of domestic life, I’ve been surprised at how much of my time is dominated by technology.我认为我们都可以同意隔离的…

预约清单ui设计_持续交付质量设计所需的UI清单

预约清单ui设计重点 (Top highlight)Over the past few months, my design team at StreetEasy has started experimenting in adding a “design buddy” check-in to the final stages of the design process.在过去的几个月中&#xff0c;我在StreetEasy的设计团队已开始尝试…

黑书上的DP例题

pagesectionnotitlesubmit1131.5.1例题1括号序列POJ11411161.5.1例题2棋盘分割POJ11911171.5.1例题3决斗Sicily18221171.5.1例题4“舞蹈家”怀特先生ACM-ICPC Live Archive1191.5.1例题5积木游戏http://202.120.80.191/problem.php?problemid12441231.5.2例题1方块消除http://…