-
Notifications
You must be signed in to change notification settings - Fork 11
Nashorn Java to JavaScript interoperability issues
This page is to document, discuss and possible solutions for short comings of the Java JavaScript interoperability in Java8
Extending a Spark Lambda function, Nashorn does not support extending Java class that extends java.io.Serializable
Spark processing id accomplished by providing Lambda functions to Spark class for example RDD
JavaRDD complete_ratings_data = complete_ratings_raw_data.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
if (line.equals(complete_ratings_raw_data_header)) {
return false;
} else {
return true;
}
}
});
In the Example above we are implementing the org.apache.spark.api.java.function.Function class
The Nashorn Documentation states that we can implement/extend a Java class by either
// This syntax is primarily used to support anonymous class-like syntax for
// Java interface implementation as shown below.
var r = new java.lang.Runnable() {
run: function() { print("run"); }
}
or
var ArrayList = Java.type("java.util.ArrayList")
var ArrayListExtender = Java.extend(ArrayList)
var printSizeInvokedArrayList = new ArrayListExtender() {
size: function() { print("size invoked!"); }
}
So for org.apache.spark.api.java.function.Function we would have code like:
var jsFunc = new org.apache.spark.api.java.function.Function() {
call: function(line) {
return line != "userId,movieId,rating,timestamp"; //complete_ratings_raw_data_header;
}
}
var xx = complete_ratings_raw_data_JavaObj.filter(boolFunctionExtender);
or
var sparkFunction = Java.type("org.apache.spark.api.java.function.Function")
var sparkFunctionExtender = Java.extend(sparkFunction)
var boolFunctionExtender = new sparkFunctionExtender() {
call: function(line) {
return line != "userId,movieId,rating,timestamp"; //complete_ratings_raw_data_header;
}
}
var xx = complete_ratings_raw_data_JavaObj.filter(boolFunctionExtender);
will throw the exception Exception in thread "main" java.lang.RuntimeException: org.apache.spark.SparkException: Task not serializable
Nashorn parseInt returns a java.lang.Double
var x = parseInt("3");
print(x.getClass()); // prints java.lang.Double
This causes class cast exceptions when running many mllib classes. One way to fix this would be to add RDD.map() to every place we run into the class cast exceptions and insure we have the correct types by using java.lang.Integer.parseInt() to ensure integers. But a better solution would be to just "monkey patch" the Nashorn implementation of parseInt.
/**
* We need to replace the Nashorn's implementation of parseInt becouse it returns
* a java.lang.Double. Why you ask, that is a good question!
* Any way this really mess up spark as we need a parseInt to be a java.lang.Integer
* so we will replace it globally with an implementation that works for spark
* @param string
* @param radix
* @returns {Number}
* @private
*/
parseInt = function(string, radix) {
var val = NaN;
try{
if (radix) {
val = java.lang.Integer.parseInt(string, radix);
} else {
val = java.lang.Integer.parseInt(string);
}
} catch (e) {
// bad parseInt value
}
return val;
};
Convert arrays from JavaScript to Java is not automatic if we have a JavaScript array of int[] we must do
ret = Java.to(l, "int[]");
for double[]
ret = Java.to(l, "double[]");
for Object[]
ret = Java.to(l);
and for a Java array we need to call var keys = Java.from(javaObj.keySet().toArray());
We have notices that looping through and array is faster than using LAMBDA functions to process the array
a.forEach(function(x) {
args.push(x);
});
takes longer than
for (var i = 1; i < arguments.length; i++) {
args.push(Serialize.javaToJs(arguments[i]));
}
The above code snippet is from Utils_invoke So this code is invoked every time we "setup" to call a users LAMBDA functions. I would suspect that the issues is setting up the stack for the anonymous function, while this may not be significant if this were to only happen once but with large datasets like movieLen (100M) this time would be significant.
Running the LAMBDA functions in Nashorn has a cost, running code that just loads a large dataset and filters the dataset removing a string.
var obj = complete_ratings_raw_data.getJavaObject();
var start = new Date().getTime();
var complete_ratings_data = obj.filter(new org.eclairjs.nashorn.JSFunctionTest());
print("There are recommendations in the complete dataset: " + complete_ratings_data.count());
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time + " milliseconds");
Running the filter in Java with
package org.eclairjs.nashorn;
import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.api.java.function.Function;
import javax.script.Invocable;
import javax.script.ScriptEngine;
public class JSFunctionTest implements Function {
public JSFunctionTest() {
}
@SuppressWarnings({ "null", "unchecked" })
@Override
public Object call(Object l) {
String line = (String) l;
if (line.equals("userId,movieId,rating,timestamp")) {
return false;
} else {
return true;
}
}
}
gives us a time of
There are recommendations in the complete dataset: 22884377
Execution time: 2379 milliseconds
Changing the LAMDA to run in nashorn using
import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.api.java.function.Function;
import javax.script.Invocable;
import javax.script.ScriptEngine;
public class JSFunctionTest2 implements Function {
private Object fn = null;
public JSFunctionTest2() {
}
@SuppressWarnings({ "null", "unchecked" })
@Override
public Object call(Object l) throws Exception {
ScriptEngine e = NashornEngineSingleton.getEngine();
if (this.fn == null) {
String func = "function myTestFunc(line) { return line != \"userId,movieId,rating,timestamp\";}";
this.fn = e.eval(func);
}
Invocable invocable = (Invocable) e;
Object params[] = {this.fn, l};
Object ret = invocable.invokeFunction("myTestFunc", params);
return ret;
}
}
gives us a time of:
There are recommendations in the complete dataset: 22884378
Execution time: 8372 milliseconds
just using Nashorn to run the equivalent JavaScript code without our serialization cost us 60 milliseconds