Skip to content

Commit e699842

Browse files
authored
Merge pull request #4 from JDatta/pr/hadoop-2-3-0/2.0_beta
Support hadoop 2.3.x
2 parents 46043e7 + e0c616d commit e699842

6 files changed

Lines changed: 116 additions & 56 deletions

File tree

README.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,28 @@
1-
Tar FileSystem for Hadopp
1+
Tar FileSystem for Hadoop
22
==========================
3-
TAR is a widely used format for storing and distributing large collections of files such as backup images, large datasets etc. TAR is also very popular format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.
3+
4+
<hr/>
5+
<small>
6+
Version: 2.0_beta
7+
</small>
8+
<hr/>
9+
10+
TAR is a widely used format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.
411

512
Apache Hadoop, as of now, is not TAR aware. That is, it can not directly read a file inside a TAR. Neither it can run map-reduce on those files. To run analytic jobs on a TAR, one needs to first copy it to local disk, un-TAR it, then copy back to Hadoop file system. Or convert it to sequence file/other Hadoop aware format using custom (java) program. This procedure is time consuming and the user ends up having two copies of data.
613

714
By using the TarFileSystem for Hadoop, Hadoop can directly read files inside a TAR and run analytic jobs on those file. This way, no conversion/extraction is required.
815

916
Building
1017
---------
11-
Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at ./target/TarFileSystem-*.jar
18+
Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at `./target/hadoop-tarfs-2.0_beta.jar`
1219

1320

1421
Distribution and Configuration
1522
-------------------------------
16-
TAR File System binary for Hadoop is distributed as a JAR library (TarFileSystem.jar). This JAR contains the main TarFileSystem class and other supporting classes. The user needs to copy this JAR to the HADOOP_HOME/lib directory (HDFS_HOME/lib for Hadoop 2.0) or add the jar to HADOOP_CLASSPATH environment variable.
23+
TAR File System binary for Hadoop is distributed as a JAR library (`hadoop-tarfs-*.jar`). This JAR contains all the required classes to support TarFileSystem. Copy this JAR to the `HADOOP_HOME/lib` directory (`HDFS_HOME/lib` for Hadoop 2.0) or add the jar to `HADOOP_CLASSPATH` environment variable.
1724

18-
Next you need to expose tar:// uri schema to Hadoop by adding the following property in HADOOP_CONF_DIR/core-site.xml
25+
Next expose `tar://` uri schema to Hadoop by adding the following property in `HADOOP_CONF_DIR/core-site.xml` file.
1926

2027
<property>
2128
<name>fs.tar.impl</name>
@@ -24,7 +31,7 @@ Next you need to expose tar:// uri schema to Hadoop by adding the following prop
2431

2532
### Optional Configuration:
2633

27-
By default, TarFileSystem creates an .index file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:
34+
By default, TarFileSystem creates an `.index` file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:
2835

2936
<property>
3037
<name>tarfs.tmp.dir</name>
@@ -39,30 +46,29 @@ Hadoop can access a TAR archive using TAR URI SCHEMA (URI starting with tar://).
3946

4047
Following is a TAR inside Hadoop file System
4148

42-
[jd@morpheus hadoop-1.0.3]$ bin/hadoop fs -ls /tardemo/archive.tar ↲
49+
[jd@node1 ~]$ bin/hadoop fs -ls /tardemo/archive.tar ↲
4350
Found 1 items
4451
-rw-r--r-- 1 jd supergroup 1751040 2013-07-15 20:30 /tardemo/archive.tar
4552

4653
To access files inside this tar, simply prepone this with tar:// to make it a TAR File System URI
4754

48-
[jd@morpheus hadoop-1.0.3]$ bin/hadoop fs -ls tar:///tardemo/archive.tar ↲
55+
[jd@node1 ~]$ bin/hadoop fs -ls tar:///tardemo/archive.tar ↲
4956
13/07/15 20:33:04 INFO tar.TarFileSystem: *** Using Tar file system ***
5057
Found 3 items
51-
-rw-rw-r-- 1 jd jd 502760 2013-07-15 20:27 /tardemo/archive.tar+data/file2.txt
52-
-rw-rw-r-- 1 jd jd 594933 2013-07-15 20:26 /tardemo/archive.tar+data/file1.txt
53-
-rw-rw-r-- 1 jd jd 641720 2013-07-15 20:27 /tardemo/archive.tar+data/file3.txt
58+
-rw-rw-r-- 1 jd jd 502760 2013-07-15 20:27 /tardemo/archive.tar+/data+file2.txt
59+
-rw-rw-r-- 1 jd jd 594933 2013-07-15 20:26 /tardemo/archive.tar+/data+file1.txt
60+
-rw-rw-r-- 1 jd jd 641720 2013-07-15 20:27 /tardemo/archive.tar+/data+file3.txt
5461

55-
To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign
62+
To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign. All sub-directory paths within a TAR archive are also defined using ‘+’ sign. For example, if the file is in path `dir1/dir2/file1.txt` within tar archive, use the following path to read it.
5663

57-
[jd@morpheus hadoop-1.0.3]$ bin/hadoop fs -cat tar://hdfs-localhost:54310/tardemo/archive.tar+data/file1.txt ↲
64+
[jd@node1 ~]$ bin/hadoop fs -cat tar://hdfs-localhost:54310/tardemo/archive.tar/+dir1+dir2+file1.txt ↲
5865
13/07/15 20:38:35 INFO tar.TarFileSystem: *** Using Tar file system ***
5966
This is the file content.
6067
[...]
6168

6269
In TAR File System, the TAR archive is modeled like a directory and all the files inside a TAR are modeled like files within a directory. One can run mapreduce jobs on files within a TAR archive just like they do it on normal files.
6370

64-
[jd@morpheus hadoop-1.0.3]$ bin/hadoop jar hadoop*examples*.jar wordcount ↲
65-
tar:///tardemo/archive.tar wc_out
71+
[jd@node1 ~]$ bin/hadoop jar hadoop*examples*.jar wordcount tar:///tardemo/archive.tar wc_out ↲
6672
13/07/15 20:43:05 INFO tar.TarFileSystem: *** Using Tar file system ***
6773
13/07/15 20:43:05 INFO input.FileInputFormat: Total input paths to process : 3
6874
13/07/15 20:43:05 INFO mapred.JobClient: Running job: job_201307151954_0001
@@ -71,4 +77,5 @@ In TAR File System, the TAR archive is modeled like a directory and all the file
7177

7278
# TO DO
7379
1. Implement efficient seek in SeekableTarInputStream
80+
2. Support compressed TAR archives
7481

pom.xml

Lines changed: 21 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,30 @@
11
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2-
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3-
<modelVersion>4.0.0</modelVersion>
4-
<groupId>org.apache.hadoop</groupId>
5-
<artifactId>TarFileSystem</artifactId>
6-
<packaging>jar</packaging>
7-
<version>1.0</version>
8-
<name>TarFileSystem</name>
9-
<url>http://maven.apache.org</url>
10-
11-
<dependencies>
12-
<dependency>
13-
<groupId>org.apache.commons</groupId>
14-
<artifactId>commons-compress</artifactId>
15-
<version>1.5</version>
16-
</dependency>
2+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3+
<modelVersion>4.0.0</modelVersion>
4+
<groupId>org.apache.hadoop</groupId>
5+
<artifactId>hadoop-tarfs</artifactId>
6+
<packaging>jar</packaging>
7+
<version>2.0_beta</version>
8+
<name>hadoop-tarfs</name>
9+
<url>http://maven.apache.org</url>
1710

18-
<dependency>
19-
<groupId>org.apache.hadoop</groupId>
20-
<artifactId>hadoop-core</artifactId>
21-
<version>1.0.3</version>
22-
</dependency>
23-
11+
<dependencies>
12+
<dependency>
13+
<groupId>org.apache.commons</groupId>
14+
<artifactId>commons-compress</artifactId>
15+
<version>1.5</version>
16+
</dependency>
17+
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
18+
<dependency>
19+
<groupId>org.apache.hadoop</groupId>
20+
<artifactId>hadoop-common</artifactId>
21+
<version>2.3.0</version>
22+
</dependency>
2423
<dependency>
2524
<groupId>junit</groupId>
2625
<artifactId>junit</artifactId>
2726
<version>4.8.1</version>
2827
<scope>test</scope>
2928
</dependency>
30-
31-
</dependencies>
32-
29+
</dependencies>
3330
</project>

src/main/java/org/apache/hadoop/fs/tar/TarFileSystem.java

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,9 @@ public class TarFileSystem extends FileSystem {
5959
private FileSystem underlyingFS = null;
6060
private Path workingDir;
6161

62-
private static final String TAR_URLPREFIX = "tar://";
62+
private static final String TAR_URLPREFIX = "tar:/";
6363
private static final char TAR_INFILESEP = '+';
64+
private static final String TAR_INFILESEP_STR = "\\+";
6465

6566
public static final Log LOG = LogFactory.getLog(TarFileSystem.class);
6667

@@ -97,10 +98,11 @@ public URI getUri() {
9798

9899
private String getFileInArchive(Path tarPath) {
99100
String fullUri = tarPath.toUri().toString();
100-
int i = fullUri.lastIndexOf(TAR_INFILESEP);
101+
int i = fullUri.indexOf(TAR_INFILESEP);
101102
if (i == -1)
102103
return null;
103-
return fullUri.substring(i + 1);
104+
return fullUri.substring(i + 1)
105+
.replaceAll(TAR_INFILESEP_STR, Path.SEPARATOR);
104106
}
105107

106108
/**
@@ -123,7 +125,7 @@ private Path getBaseTarPath(Path tarPath) {
123125
// form the path component
124126
String basePath = uri.getPath();
125127
// strip the part containing inFile name
126-
int lastPlusIndex = basePath.lastIndexOf(TAR_INFILESEP);
128+
int lastPlusIndex = basePath.indexOf(TAR_INFILESEP);
127129
if (lastPlusIndex != -1)
128130
basePath = basePath.substring(0, lastPlusIndex);
129131

@@ -226,10 +228,11 @@ public FileStatus[] listStatus(Path f) throws IOException {
226228
entry.getUserName(),
227229
entry.getGroupName(),
228230
new Path(
229-
abs.toUri().toASCIIString()
230-
+ TAR_INFILESEP + entry.getName()
231-
)
232-
);
231+
abs.toUri().toASCIIString()
232+
+ Path.SEPARATOR
233+
+ TAR_INFILESEP
234+
+ entry.getName()
235+
.replaceAll(Path.SEPARATOR, TAR_INFILESEP_STR)));
233236
ret.add(fstatus);
234237
}
235238
}

src/test/java/org/apache/hadoop/fs/tar/TestTarFileSystem.java

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,65 @@
33
import static org.junit.Assert.assertEquals;
44

55
import java.io.IOException;
6+
import java.io.InputStream;
7+
import java.io.StringWriter;
68
import java.net.URISyntaxException;
79

10+
import org.apache.commons.io.IOUtils;
811
import org.apache.hadoop.fs.FileStatus;
912
import org.apache.hadoop.fs.tar.test.TarFileSystemTestFramework;
1013
import org.apache.hadoop.fs.tar.test.TestUtils;
1114
import org.junit.Test;
1215

16+
import junit.framework.Assert;
1317

1418
public class TestTarFileSystem extends TarFileSystemTestFramework {
1519

20+
private static final String SAMPLE_TEXT =
21+
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
22+
+ "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
23+
+ "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris "
24+
+ "nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in "
25+
+ "reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla "
26+
+ "pariatur. Excepteur sint occaecat cupidatat non proident, sunt in "
27+
+ "culpa qui officia deserunt mollit anim id est laborum.";
28+
1629
@Override
1730
protected void createTarFile() throws IOException {
18-
TestUtils.createLocalTarFile(getTestTarFile(), 10);
31+
TestUtils.createLocalTarFile(
32+
this.getTestTarFile(), "", SAMPLE_TEXT, 10);
1933
}
2034

2135
@Test
2236
public void testListStatus() throws IOException, URISyntaxException {
37+
assertEquals(this.getTarfs().listStatus(this.getTestTarPath()).length, 10);
38+
}
2339

24-
FileStatus[] stuses = getTarfs().listStatus(getTestTarPath());
25-
assertEquals(stuses.length, 10);
40+
@Test
41+
public void testGetFileStatus() throws IOException, URISyntaxException {
42+
final FileStatus[] stats = this.getTarfs().listStatus(this.getTestTarPath());
43+
assertEquals(stats.length, 10);
44+
for (int i = 0; i < stats.length; i++) {
45+
Assert.assertEquals(
46+
stats[i], this.getTarfs().getFileStatus(stats[i].getPath()));
47+
}
2648
}
2749

50+
@Test
51+
public void testRead() throws IOException, URISyntaxException {
52+
final FileStatus[] stats = this.getTarfs().listStatus(this.getTestTarPath());
53+
assertEquals(stats.length, 10);
54+
for (int i = 0; i < stats.length; i++) {
55+
InputStream in = null;
56+
try {
57+
System.out.println(stats[i].getPath());
58+
in = this.getTarfs().open(stats[i].getPath());
59+
final StringWriter writer = new StringWriter();
60+
IOUtils.copy(in, writer);
61+
Assert.assertEquals(SAMPLE_TEXT + i, writer.toString());
62+
} finally {
63+
IOUtils.closeQuietly(in);
64+
}
65+
}
66+
}
2867
}

src/test/java/org/apache/hadoop/fs/tar/test/TarFileSystemTestFramework.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ public Path getTestTarPath() {
3131
}
3232

3333
@Before
34-
public void getTarFs() throws IOException {
34+
public void setup() throws IOException {
3535

3636
testTarFile = File.createTempFile("/tmp/", ".tar");
3737
testTarPath = new Path("tar://"+testTarFile.getAbsolutePath());

src/test/java/org/apache/hadoop/fs/tar/test/TestUtils.java

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,18 @@
77

88
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
99
import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
10+
import org.apache.hadoop.fs.Path;
1011

1112
public class TestUtils {
1213

1314
public static void createLocalTarFile(File tarFile, int count)
1415
throws IOException {
1516

16-
createLocalTarFile(tarFile, "file_", count);
17+
createLocalTarFile(tarFile, "files", "Lorem", count);
1718
}
1819

1920
public static void createLocalTarFile(
20-
File tarFile, String filePrefix, int count)
21+
File tarFile, String prefix, String message, int count)
2122
throws IOException {
2223

2324
TarArchiveOutputStream tarOutput = null;
@@ -26,9 +27,22 @@ public static void createLocalTarFile(
2627
tarOutput = new TarArchiveOutputStream(os);
2728

2829
for (int i = 0; i < count; i++) {
29-
String msg = "Lorem";
30-
byte[] bytes = msg.getBytes();
31-
TarArchiveEntry entry = new TarArchiveEntry(filePrefix + i);
30+
String thisMessage = message + i;
31+
byte[] bytes = thisMessage.getBytes();
32+
// put the i-th file in i-th level directory
33+
// i.e. 2nd file is placed in prefix/dir1/dir2/file
34+
// 3rd file is placed in prefix/dir1/dir2/dir3/file and so on
35+
StringBuilder thisPrefix = new StringBuilder(prefix);
36+
for (int j = 0; j < i; j++) {
37+
thisPrefix.append(Path.SEPARATOR);
38+
thisPrefix.append("dir");
39+
thisPrefix.append(j);
40+
}
41+
thisPrefix.append(Path.SEPARATOR);
42+
thisPrefix.append("file_");
43+
thisPrefix.append(i);
44+
45+
TarArchiveEntry entry = new TarArchiveEntry(thisPrefix.toString());
3246
entry.setSize(bytes.length);
3347
tarOutput.putArchiveEntry(entry);
3448
tarOutput.write(bytes);

0 commit comments

Comments
 (0)