You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TAR is a widely used format for storing and distributing large collections of files such as backup images, large datasets etc. TAR is also very popular format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.
3
+
4
+
<hr/>
5
+
<small>
6
+
Version: 2.0_beta
7
+
</small>
8
+
<hr/>
9
+
10
+
TAR is a widely used format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.
4
11
5
12
Apache Hadoop, as of now, is not TAR aware. That is, it can not directly read a file inside a TAR. Neither it can run map-reduce on those files. To run analytic jobs on a TAR, one needs to first copy it to local disk, un-TAR it, then copy back to Hadoop file system. Or convert it to sequence file/other Hadoop aware format using custom (java) program. This procedure is time consuming and the user ends up having two copies of data.
6
13
7
14
By using the TarFileSystem for Hadoop, Hadoop can directly read files inside a TAR and run analytic jobs on those file. This way, no conversion/extraction is required.
8
15
9
16
Building
10
17
---------
11
-
Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at ./target/TarFileSystem-*.jar
18
+
Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at `./target/hadoop-tarfs-2.0_beta.jar`
12
19
13
20
14
21
Distribution and Configuration
15
22
-------------------------------
16
-
TAR File System binary for Hadoop is distributed as a JAR library (TarFileSystem.jar). This JAR contains the main TarFileSystem class and other supporting classes. The user needs to copy this JAR to the HADOOP_HOME/lib directory (HDFS_HOME/lib for Hadoop 2.0) or add the jar to HADOOP_CLASSPATH environment variable.
23
+
TAR File System binary for Hadoop is distributed as a JAR library (`hadoop-tarfs-*.jar`). This JAR contains all the required classes to support TarFileSystem. Copy this JAR to the `HADOOP_HOME/lib` directory (`HDFS_HOME/lib` for Hadoop 2.0) or add the jar to `HADOOP_CLASSPATH` environment variable.
17
24
18
-
Next you need to expose tar:// uri schema to Hadoop by adding the following property in HADOOP_CONF_DIR/core-site.xml
25
+
Next expose `tar://` uri schema to Hadoop by adding the following property in `HADOOP_CONF_DIR/core-site.xml` file.
19
26
20
27
<property>
21
28
<name>fs.tar.impl</name>
@@ -24,7 +31,7 @@ Next you need to expose tar:// uri schema to Hadoop by adding the following prop
24
31
25
32
### Optional Configuration:
26
33
27
-
By default, TarFileSystem creates an .index file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:
34
+
By default, TarFileSystem creates an `.index` file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:
28
35
29
36
<property>
30
37
<name>tarfs.tmp.dir</name>
@@ -39,30 +46,29 @@ Hadoop can access a TAR archive using TAR URI SCHEMA (URI starting with tar://).
To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign
62
+
To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign. All sub-directory paths within a TAR archive are also defined using ‘+’ sign. For example, if the file is in path `dir1/dir2/file1.txt` within tar archive, use the following path to read it.
13/07/15 20:38:35 INFO tar.TarFileSystem: *** Using Tar file system ***
59
66
This is the file content.
60
67
[...]
61
68
62
69
In TAR File System, the TAR archive is modeled like a directory and all the files inside a TAR are modeled like files within a directory. One can run mapreduce jobs on files within a TAR archive just like they do it on normal files.
63
70
64
-
[jd@morpheus hadoop-1.0.3]$ bin/hadoop jar hadoop*examples*.jar wordcount ↲
65
-
tar:///tardemo/archive.tar wc_out
71
+
[jd@node1 ~]$ bin/hadoop jar hadoop*examples*.jar wordcount tar:///tardemo/archive.tar wc_out ↲
66
72
13/07/15 20:43:05 INFO tar.TarFileSystem: *** Using Tar file system ***
67
73
13/07/15 20:43:05 INFO input.FileInputFormat: Total input paths to process : 3
68
74
13/07/15 20:43:05 INFO mapred.JobClient: Running job: job_201307151954_0001
@@ -71,4 +77,5 @@ In TAR File System, the TAR archive is modeled like a directory and all the file
71
77
72
78
# TO DO
73
79
1. Implement efficient seek in SeekableTarInputStream
0 commit comments