Merge pull request #4 from JDatta/pr/hadoop-2-3-0/2.0_beta

JDatta · web-flow · commit e699842a460e · 2017-01-30T15:35:37.000+05:30
Support hadoop 2.3.x
diff --git a/README.md b/README.md
@@ -1,21 +1,28 @@
-Tar FileSystem for Hadopp
+Tar FileSystem for Hadoop
 ==========================
-TAR is a widely used format for storing and distributing large collections of files such as backup images, large datasets etc. TAR is also very popular format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.
+
+<hr/>
+<small>
+Version: 2.0_beta
+</small>
+<hr/>
+
+TAR is a widely used format for storing backup images, distributing large datasets etc. Many of those files could be used as an input to analytic jobs.
 
 Apache Hadoop, as of now, is not TAR aware. That is, it can not directly read a file inside a TAR. Neither it can run map-reduce on those files. To run analytic jobs on a TAR, one needs to first copy it to local disk, un-TAR it, then copy back to Hadoop file system. Or convert it to sequence file/other Hadoop aware format using custom (java) program. This procedure is time consuming and the user ends up having two copies of data.
 
 By using the TarFileSystem for Hadoop, Hadoop can directly read files inside a TAR and run analytic jobs on those file. This way, no conversion/extraction is required. 
 
 Building
 ---------
-Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at ./target/TarFileSystem-*.jar
+Run "mvn package" inside the project directory. The TarFileSystem distribution is created as a jar file at `./target/hadoop-tarfs-2.0_beta.jar`
 
 
 Distribution and Configuration
 -------------------------------
-TAR File System binary for Hadoop is distributed as a JAR library (TarFileSystem.jar). This JAR contains the main TarFileSystem class and other supporting classes. The user needs to copy this JAR to the HADOOP_HOME/lib directory (HDFS_HOME/lib for Hadoop 2.0) or add the jar to HADOOP_CLASSPATH environment variable. 
+TAR File System binary for Hadoop is distributed as a JAR library (`hadoop-tarfs-*.jar`). This JAR contains all the required classes to support TarFileSystem. Copy this JAR to the `HADOOP_HOME/lib` directory (`HDFS_HOME/lib` for Hadoop 2.0) or add the jar to `HADOOP_CLASSPATH` environment variable. 
 
-Next you need to expose tar:// uri schema to Hadoop by adding the following property in HADOOP_CONF_DIR/core-site.xml
+Next expose `tar://` uri schema to Hadoop by adding the following property in `HADOOP_CONF_DIR/core-site.xml` file.
 
 	<property>
 	  <name>fs.tar.impl</name>
@@ -24,7 +31,7 @@ Next you need to expose tar:// uri schema to Hadoop by adding the following prop
 
 ### Optional Configuration:
 
-By default, TarFileSystem creates an .index file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:
+By default, TarFileSystem creates an `.index` file in the same directory where the tar file resides. Index writing may fail if you do not have sufficient permission in that directory. In that case you may specify a temporary directory where you have write permission and tell TarFileSystem to use that directory instead. You may specify the following property in core-site.xml for this:
 
 	<property>
 	  <name>tarfs.tmp.dir</name>
@@ -39,30 +46,29 @@ Hadoop can access a TAR archive using TAR URI SCHEMA (URI starting with tar://).
 
 Following is a TAR inside Hadoop file System
 
-	[jd@morpheus hadoop-1.0.3]$ bin/hadoop fs -ls /tardemo/archive.tar ↲
+	[jd@node1 ~]$ bin/hadoop fs -ls /tardemo/archive.tar ↲
 	Found 1 items
 	-rw-r--r--   1 jd supergroup    1751040 2013-07-15 20:30 /tardemo/archive.tar
 
 To access files inside this tar, simply prepone this with tar:// to make it a TAR File System URI
 
-	[jd@morpheus hadoop-1.0.3]$ bin/hadoop fs -ls tar:///tardemo/archive.tar ↲
+	[jd@node1 ~]$ bin/hadoop fs -ls tar:///tardemo/archive.tar ↲
 	13/07/15 20:33:04 INFO tar.TarFileSystem: *** Using Tar file system ***
 	Found 3 items
-	-rw-rw-r--   1 jd jd     502760 2013-07-15 20:27 /tardemo/archive.tar+data/file2.txt
-	-rw-rw-r--   1 jd jd     594933 2013-07-15 20:26 /tardemo/archive.tar+data/file1.txt
-	-rw-rw-r--   1 jd jd     641720 2013-07-15 20:27 /tardemo/archive.tar+data/file3.txt
+	-rw-rw-r--   1 jd jd     502760 2013-07-15 20:27 /tardemo/archive.tar+/data+file2.txt
+	-rw-rw-r--   1 jd jd     594933 2013-07-15 20:26 /tardemo/archive.tar+/data+file1.txt
+	-rw-rw-r--   1 jd jd     641720 2013-07-15 20:27 /tardemo/archive.tar+/data+file3.txt
 
-To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign
+To access a file inside a TAR archive, append the name of the file after the TAR URI using a ‘+’ sign. All sub-directory paths within a TAR archive are also defined using ‘+’ sign. For example, if the file is in path `dir1/dir2/file1.txt` within tar archive, use the following path to read it.
 
-	[jd@morpheus hadoop-1.0.3]$ bin/hadoop fs -cat tar://hdfs-localhost:54310/tardemo/archive.tar+data/file1.txt ↲
+	[jd@node1 ~]$ bin/hadoop fs -cat tar://hdfs-localhost:54310/tardemo/archive.tar/+dir1+dir2+file1.txt ↲
 	13/07/15 20:38:35 INFO tar.TarFileSystem: *** Using Tar file system ***
 	This is the file content.
 	[...]
 
 In TAR File System, the TAR archive is modeled like a directory and all the files inside a TAR are modeled like files within a directory. One can run mapreduce jobs on files within a TAR archive just like they do it on normal files.
 
-	[jd@morpheus hadoop-1.0.3]$ bin/hadoop jar hadoop*examples*.jar wordcount ↲ 
-	tar:///tardemo/archive.tar wc_out
+	[jd@node1 ~]$ bin/hadoop jar hadoop*examples*.jar wordcount tar:///tardemo/archive.tar wc_out ↲ 
 	13/07/15 20:43:05 INFO tar.TarFileSystem: *** Using Tar file system ***
 	13/07/15 20:43:05 INFO input.FileInputFormat: Total input paths to process : 3
 	13/07/15 20:43:05 INFO mapred.JobClient: Running job: job_201307151954_0001
@@ -71,4 +77,5 @@ In TAR File System, the TAR archive is modeled like a directory and all the file
 
 # TO DO
   1. Implement efficient seek in SeekableTarInputStream
+  2. Support compressed TAR archives
 
diff --git a/pom.xml b/pom.xml
@@ -1,33 +1,30 @@
 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
-	<modelVersion>4.0.0</modelVersion>
-	<groupId>org.apache.hadoop</groupId>
-	<artifactId>TarFileSystem</artifactId>
-	<packaging>jar</packaging>
-	<version>1.0</version>
-	<name>TarFileSystem</name>
-	<url>http://maven.apache.org</url>
-	
-	<dependencies>
-		<dependency>
-			<groupId>org.apache.commons</groupId>
-			<artifactId>commons-compress</artifactId>
-			<version>1.5</version>
-		</dependency>
+  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <groupId>org.apache.hadoop</groupId>
+  <artifactId>hadoop-tarfs</artifactId>
+  <packaging>jar</packaging>
+  <version>2.0_beta</version>
+  <name>hadoop-tarfs</name>
+  <url>http://maven.apache.org</url>
 
-		<dependency>
-			<groupId>org.apache.hadoop</groupId>
-			<artifactId>hadoop-core</artifactId>
-			<version>1.0.3</version>
-		</dependency>
-		
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.commons</groupId>
+      <artifactId>commons-compress</artifactId>
+      <version>1.5</version>
+    </dependency>
+    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>2.3.0</version>
+    </dependency>
     <dependency>
       <groupId>junit</groupId>
       <artifactId>junit</artifactId>
       <version>4.8.1</version>
       <scope>test</scope>
     </dependency>
-		
-	</dependencies>
-
+  </dependencies>
 </project>
diff --git a/src/main/java/org/apache/hadoop/fs/tar/TarFileSystem.java b/src/main/java/org/apache/hadoop/fs/tar/TarFileSystem.java
@@ -59,8 +59,9 @@ public class TarFileSystem extends FileSystem {
   private FileSystem underlyingFS = null;
   private Path workingDir;
 
-  private static final String TAR_URLPREFIX = "tar://";
+  private static final String TAR_URLPREFIX = "tar:/";
   private static final char TAR_INFILESEP = '+';
+  private static final String TAR_INFILESEP_STR = "\\+";
 
   public static final Log LOG = LogFactory.getLog(TarFileSystem.class);
 
@@ -97,10 +98,11 @@ public URI getUri() {
 
   private String getFileInArchive(Path tarPath) {
     String fullUri = tarPath.toUri().toString();
-    int i = fullUri.lastIndexOf(TAR_INFILESEP);
+    int i = fullUri.indexOf(TAR_INFILESEP);
     if (i == -1)
       return null;
-    return fullUri.substring(i + 1);
+    return fullUri.substring(i + 1)
+      .replaceAll(TAR_INFILESEP_STR, Path.SEPARATOR);
   }
 
   /**
@@ -123,7 +125,7 @@ private Path getBaseTarPath(Path tarPath) {
     // form the path component
     String basePath = uri.getPath();
     // strip the part containing inFile name
-    int lastPlusIndex = basePath.lastIndexOf(TAR_INFILESEP);
+    int lastPlusIndex = basePath.indexOf(TAR_INFILESEP);
     if (lastPlusIndex != -1)
       basePath = basePath.substring(0, lastPlusIndex);
 
@@ -226,10 +228,11 @@ public FileStatus[] listStatus(Path f) throws IOException {
             entry.getUserName(),
             entry.getGroupName(),
             new Path(
-                abs.toUri().toASCIIString()
-                    + TAR_INFILESEP + entry.getName()
-            )
-            );
+              abs.toUri().toASCIIString()
+                + Path.SEPARATOR
+                + TAR_INFILESEP
+                + entry.getName()
+                  .replaceAll(Path.SEPARATOR, TAR_INFILESEP_STR)));
         ret.add(fstatus);
       }
     }
diff --git a/src/test/java/org/apache/hadoop/fs/tar/TestTarFileSystem.java b/src/test/java/org/apache/hadoop/fs/tar/TestTarFileSystem.java
@@ -3,26 +3,65 @@
 import static org.junit.Assert.assertEquals;
 
 import java.io.IOException;
+import java.io.InputStream;
+import java.io.StringWriter;
 import java.net.URISyntaxException;
 
+import org.apache.commons.io.IOUtils;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.tar.test.TarFileSystemTestFramework;
 import org.apache.hadoop.fs.tar.test.TestUtils;
 import org.junit.Test;
 
+import junit.framework.Assert;
 
 public class TestTarFileSystem extends TarFileSystemTestFramework {
 
+  private static final String SAMPLE_TEXT =
+    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
+      + "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
+      + "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris "
+      + "nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in "
+      + "reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla "
+      + "pariatur. Excepteur sint occaecat cupidatat non proident, sunt in "
+      + "culpa qui officia deserunt mollit anim id est laborum.";
+
   @Override
   protected void createTarFile() throws IOException {
-    TestUtils.createLocalTarFile(getTestTarFile(), 10);
+    TestUtils.createLocalTarFile(
+      this.getTestTarFile(), "", SAMPLE_TEXT, 10);
   }
 
   @Test
   public void testListStatus() throws IOException, URISyntaxException {
+    assertEquals(this.getTarfs().listStatus(this.getTestTarPath()).length, 10);
+  }
 
-    FileStatus[] stuses = getTarfs().listStatus(getTestTarPath());
-    assertEquals(stuses.length, 10);
+  @Test
+  public void testGetFileStatus() throws IOException, URISyntaxException {
+    final FileStatus[] stats = this.getTarfs().listStatus(this.getTestTarPath());
+    assertEquals(stats.length, 10);
+    for (int i = 0; i < stats.length; i++) {
+      Assert.assertEquals(
+        stats[i], this.getTarfs().getFileStatus(stats[i].getPath()));
+    }
   }
 
+  @Test
+  public void testRead() throws IOException, URISyntaxException {
+    final FileStatus[] stats = this.getTarfs().listStatus(this.getTestTarPath());
+    assertEquals(stats.length, 10);
+    for (int i = 0; i < stats.length; i++) {
+      InputStream in = null;
+      try {
+        System.out.println(stats[i].getPath());
+        in = this.getTarfs().open(stats[i].getPath());
+        final StringWriter writer = new StringWriter();
+        IOUtils.copy(in, writer);
+        Assert.assertEquals(SAMPLE_TEXT + i, writer.toString());
+      } finally {
+        IOUtils.closeQuietly(in);
+      }
+    }
+  }
 }
diff --git a/src/test/java/org/apache/hadoop/fs/tar/test/TarFileSystemTestFramework.java b/src/test/java/org/apache/hadoop/fs/tar/test/TarFileSystemTestFramework.java
@@ -31,7 +31,7 @@ public Path getTestTarPath() {
   }
 
   @Before
-  public void getTarFs() throws IOException {
+  public void setup() throws IOException {
 
     testTarFile = File.createTempFile("/tmp/", ".tar");
     testTarPath = new Path("tar://"+testTarFile.getAbsolutePath());
diff --git a/src/test/java/org/apache/hadoop/fs/tar/test/TestUtils.java b/src/test/java/org/apache/hadoop/fs/tar/test/TestUtils.java
@@ -7,17 +7,18 @@
 
 import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
 import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
+import org.apache.hadoop.fs.Path;
 
 public class TestUtils {
 
   public static void createLocalTarFile(File tarFile, int count)
       throws IOException {
 
-    createLocalTarFile(tarFile, "file_", count);
+    createLocalTarFile(tarFile, "files", "Lorem", count);
   }
 
   public static void createLocalTarFile(
-      File tarFile, String filePrefix, int count)
+    File tarFile, String prefix, String message, int count)
           throws IOException {
 
     TarArchiveOutputStream tarOutput = null;
@@ -26,9 +27,22 @@ public static void createLocalTarFile(
       tarOutput = new TarArchiveOutputStream(os);
 
       for (int i = 0; i < count; i++) {
-        String msg = "Lorem";
-        byte[] bytes = msg.getBytes();
-        TarArchiveEntry entry = new TarArchiveEntry(filePrefix + i);
+        String thisMessage = message + i;
+        byte[] bytes = thisMessage.getBytes();
+        // put the i-th file in i-th level directory
+        // i.e. 2nd file is placed in prefix/dir1/dir2/file
+        // 3rd file is placed in prefix/dir1/dir2/dir3/file and so on
+        StringBuilder thisPrefix = new StringBuilder(prefix);
+        for (int j = 0; j < i; j++) {
+          thisPrefix.append(Path.SEPARATOR);
+          thisPrefix.append("dir");
+          thisPrefix.append(j);
+        }
+        thisPrefix.append(Path.SEPARATOR);
+        thisPrefix.append("file_");
+        thisPrefix.append(i);
+
+        TarArchiveEntry entry = new TarArchiveEntry(thisPrefix.toString());
         entry.setSize(bytes.length);
         tarOutput.putArchiveEntry(entry);
         tarOutput.write(bytes);

Original file line number	Diff line number	Diff line change
`@@ -31,7 +31,7 @@ public Path getTestTarPath() {`
`31`	`31`	`}`
`32`	`32`
`33`	`33`	`@Before`
`34`		`- public void getTarFs() throws IOException {`
	`34`	`+ public void setup() throws IOException {`
`35`	`35`
`36`	`36`	`testTarFile = File.createTempFile("/tmp/", ".tar");`
`37`	`37`	`testTarPath = new Path("tar://"+testTarFile.getAbsolutePath());`