介绍
时间:
Hadoop Archives (HAR files)是在0.18.0版本中引入的。
作用:
将hdfs里的小文件打包成一个文件,相当于windows的zip,rar。Linux的 tar等压缩文件。把多个文件打包一个文件。
意义:
它的出现就是为了缓解大量小文件消耗namenode内存的问题。
原理:
HAR文件是通过在HDFS上构建一个层次化的文件系统来工作。
一个HAR文件是通过hadoop的archive命令来创建,而这个命令实际上也是运行了一个MapReduce任务来将小文件打包成HAR。
对于client端来说,使用HAR文件没有任何影响。但在HDFS端它内部的文件数减少了。
读取效率不高:
通过HAR来读取一个文件并不会比直接从HDFS中读取文件高效,而且实际上可能还会稍微低效一点,因为对每一个HAR文件的访问都需要完成两层 index文件的读取和文件本身数据的读取。
尽管HAR文件可以被用来作为MapReduce job的input,但是并没有特殊的方法来使maps将HAR文件中打包的文件当作一个HDFS文件处理。
创建命令:
hadoop archive -archiveName xxx.har -p /src /dest
archive -archiveName <NAME>.har -p <parent path> [-r <replication factor>]<src>* <dest>
查看命令:
hadoop fs -ls -r har://路径/xxx.har
操作实例:
注意:是hdfs里的文件才能打包,如果不是hdfs里的路径会报错。
1、hdfs dfs -ls /
drwx------ - hadoop supergroup 0 2016-04-14 22:19 /tmp
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 /wc
2、hadoop archive -archiveName temp.har -p /tmp /
会启动mapreduce
16/08/13 00:41:16 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO mapreduce.JobSubmitter: number of splits:1
16/08/13 00:41:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1471019987033_0001
16/08/13 00:41:19 INFO impl.YarnClientImpl: Submitted application application_1471019987033_0001
16/08/13 00:41:19 INFO mapreduce.Job: The url to track the job: http://hello110:8088/proxy/application_1471019987033_0001/
16/08/13 00:41:19 INFO mapreduce.Job: Running job: job_1471019987033_0001
16/08/13 00:41:35 INFO mapreduce.Job: Job job_1471019987033_0001 running in uber mode : false
16/08/13 00:41:35 INFO mapreduce.Job: map 0% reduce 0%
16/08/13 00:41:57 INFO mapreduce.Job: map 100% reduce 0%
16/08/13 00:42:21 INFO mapreduce.Job: map 100% reduce 100%
16/08/13 00:42:23 INFO mapreduce.Job: Job job_1471019987033_0001 completed successfully
3、hdfs dfs -ls /
drwxr-xr-x - hadoop supergroup 0 2016-08-13 00:42 /temp.har (新增的)
drwx------ - hadoop supergroup 0 2016-04-14 22:19 /tmp
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 /wc
4、hadoop fs -ls -R har:///temp.har
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 har:///temp.har/hadoop-yarn
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/hadoop
drwxr-xr-x - hadoop supergroup 0 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging
drwxr-xr-x - hadoop supergroup 0 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging/har_dj36hy
-rw-r--r-- 1 hadoop supergroup 1593 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging/har_dj36hy/_har_src_files
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/history
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hadoop supergroup 0 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop
-rw-r--r-- 1 hadoop supergroup 33303 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001-1460643581404-hadoop-wcount.jar-1460643608082-1-1-SUCCEEDED-default-1460643592087.jhist
-rw-r--r-- 1 hadoop supergroup 349 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001.summary
-rw-r--r-- 1 hadoop supergroup 115449 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001_conf.xml
5、 hdfs dfs -cat har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001_conf.xml
<property><name>mapreduce.tasktracker.instrumentation</name><value>org.apache.hadoop.mapred.TaskTrackerMetricsInst</value><source>mapred-default.xml</source><source>job.xml</source></property>
<property><name>io.seqfile.sorter.recordlimit</name><value>1000000</value><source>core-default.xml</source><source>job.xml</source></property>
<property><name>yarn.sharedcache.webapp.address</name><value>0.0.0.0:8788</value><source>yarn-default.xml</source><source>job.xml</source></property>
<property><name>yarn.app.mapreduce.am.resource.mb</name><value>1536</value><source>mapred-default.xml</source><source>job.xml</source></property>
<property><name>mapreduce.framework.name</name><value>yarn</value><source>mapred-site.xml</source><source>job.xml</source></property>
<property><name>mapreduce.job.reduce.slowstart.completedmaps</name><value>0.05</value><source>mapred-default.xml</source><source>job.xml</source></property>.....................太多了.....................................