HDFS缓存数据命令
- 查看缓存池信息
hdfs cacheadmin -listPools -stats
- 查看已缓存的数据信息
hdfs cacheadmin -listDirectives -stats
- Impala表卸载缓存数据
alter table dw_crawler.bsl_zhongda_weibo_article_hive set uncached;
- 创建缓存池
hdfs cacheadmin -addPool article_pool2 -owner impala
- 显示表状态
show table stats bsl_zhongda_weibo_article_hive;
- 将表加入缓存
alter table dw_crawler.bsl_zhongda_weibo_article_hive set cached in 'article_pool3';
- 指定分区将表加入缓存
alter table dw_crawler.bsl_zhongda_weibo_article_hive partition(pt_created_date=20180101) set cached in 'article_pool3';
性能测试结果:
是否缓存 | 数据条数 | 数据量 | 处理时间(并发1) | 处理时间(并发3) | 处理时间(并发5) |
是 | - | 116.5G | 8s | 20s | 29s |
否 | - | 116.5G | 68.17 | 136 | 240 |
是 | - | 72.7G | 8.38 | 20.6 | 30.3 |
否 | - | 72.7G | 80s | 165s | 235s |
异常处理
1.hdfs缓存池空间不足
[fwqzx002.zh:21000] ods_crawler> alter table xxx partition (pt_created_date='201811') set cached in 'article_pool';
Query: alter table xxx partition (pt_created_date='201811') set cached in 'article_pool'
ERROR: ImpalaRuntimeException: Caching path /user/hive/warehouse/xxxx/pt_created_date=201811 of size 24274436556 bytes at replication 1 would exceed pool article_pool's remaining capacity of 20450868109 bytes.at org.apache.hadoop.hdfs.server.namenode.CacheManager.checkLimit(CacheManager.java:405)at org.apache.hadoop.hdfs.server.namenode.CacheManager.addDirective(CacheManager.java:531)at org.apache.hadoop.hdfs.server.namenode.FSNDNCacheOp.addCacheDirective(FSNDNCacheOp.java:45)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheDirective(FSNamesystem.java:6782)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addCacheDirective(NameNodeRpcServer.java:1883)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addCacheDirective(ClientNamenodeProtocolServerSideTranslatorPB.java:1265)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1685)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)CAUSED BY: InvalidRequestException: Caching path /user/hive/warehouse/xxxx/pt_created_date=201811 of size 24274436556 bytes at replication 1 would exceed pool article_pool's remaining capacity of 20450868109 bytes.at org.apache.hadoop.hdfs.server.namenode.CacheManager.checkLimit(CacheManager.java:405)at org.apache.hadoop.hdfs.server.namenode.CacheManager.addDirective(CacheManager.java:531)at org.apache.hadoop.hdfs.server.namenode.FSNDNCacheOp.addCacheDirective(FSNDNCacheOp.java:45)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheDirective(FSNamesystem.java:6782)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addCacheDirective(NameNodeRpcServer.java:1883)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addCacheDirective(ClientNamenodeProtocolServerSideTranslatorPB.java:1265)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1685)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)CAUSED BY: RemoteException: Caching path /user/hive/warehouse/xxxx/pt_created_date=201811 of size 24274436556 bytes at replication 1 would exceed pool article_pool's remaining capacity of 20450868109 bytes.at org.apache.hadoop.hdfs.server.namenode.CacheManager.checkLimit(CacheManager.java:405)at org.apache.hadoop.hdfs.server.namenode.CacheManager.addDirective(CacheManager.java:531)at org.apache.hadoop.hdfs.server.namenode.FSNDNCacheOp.addCacheDirective(FSNDNCacheOp.java:45)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheDirective(FSNamesystem.java:6782)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addCacheDirective(NameNodeRpcServer.java:1883)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addCacheDirective(ClientNamenodeProtocolServerSideTranslatorPB.java:1265)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1685)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
原因是创建时指定-limit 40000000000,即大小为40000000000 字节 ≈ 37.25GB,而我缓存的数据是18.21(已成功缓存分区)+22.61(缓存异常分区) 大于
之前创建缓存池命令
hdfs cacheadmin -addPool article_pool -owner impala -limit 40000000000
解决方法:修改缓存池大小为自己可用值,默认为不限制
2.HDFS未全部缓存分区数据
明显看出数据Size 大小与BytesCached大小不相同
[fwqzx002.zh:21000] ods_crawler> show table stats tableName ;
Query: show table stats tableName
+-----------------+-------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------------------------------------------------------+
| pt_created_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------------+-------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------------------------------------------------------+
| 201810 | -1 | 75 | 18.21GB | 5.36GB | 1 | PARQUET | false | hdfs://nameservice1/user/hive/warehouse/tableName /pt_created_date=201810 |
| 201811 | -1 | 94 | 22.61GB | 55.35MB | 1 | PARQUET | false | hdfs://nameservice1/user/hive/warehouse/tableName /pt_created_date=201811 |
| 201812 | -1 | 141 | 33.70GB | 33.22GB | 1 | PARQUET | false | hdfs://nameservice1/user/hive/warehouse/tableName /pt_created_date=201812 |
| Total | -1 | 310 | 74.51GB | 38.63GB | | | | |
+-----------------+-------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------------------------------------------------------------------------------+
查看hdfs,发现需要缓存大小(BYTES_NEEDED)与实际缓存(BYTES_CACHED)的大小不相同
[hdfs@fwqzx002 root]$ hdfs cacheadmin -listDirectives -stats
Found 4 entriesID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED20 article_pool3 1 never /user/hive/warehouse/tableName/pt_created_date=201812 36183104282 35666895768 141 13921 article_pool3 1 never /user/hive/warehouse/tableName 0 0 0 022 article_pool3 1 never /user/hive/warehouse/tableName/pt_created_date=201810 19549131891 5751122434 75 2223 article_pool3 1 never /user/hive/warehouse/tableName/pt_created_date=201811 24274436556 58042919 94 1
实际操作中未报异常,猜测HDFS缓存达到上限,去查看下HDFS配置
查看HDFS “dfs.datanode.max.locked.memory”参数,发现为4G,集群中有10个DataNode节点,一共加起来最多缓存40G,上面实际缓存大小≈38G左右已基本达到上限(因为数据并非绝对的平均存储,可能部分节点数据超过4G就会达到缓存上限)
尝试修改“dfs.datanode.max.locked.memory”为50G(根据个人服务器适当调整),结果全部缓存成功
- 注:show table stats bsl_zhongda_weibo_article_hive 与 hdfs cacheadmin -listDirectives -stats 区别在于,show table stats bsl_zhongda_weibo_article_hive 未完成Bytes Cached显示为-1 ;后者是计算分配的空间,即后者非实际缓存完。