The tokenizer package allows to tokenize java and C/C++ source code either at the line level or at the file level.
- UTFTokenizer based on the one created for Terrier, see: UTF Tokeniser
Tokenizer that uses everything nonalphanumerical as a delimiter, delimiter in a row are kept together, while only space delimiter is removed.
Available at line and file level.
- UTFWocTokenizer: same tokenizer as UTF except that comments are removed (available only at file level)
- JavaLemmeTokenizer: Tokenizer based on the lemmatization of java code performed by Java Parser, available for file level as well as line level.
- JavaLemmeWocTokenizer: Same as JavaLemmeTokenizer, except that comments are removed, available only at the file level
- DepthFirst: Tokenize according to the AST generated by JavaParser and go through the tree depth first, each token correspond to the text serialization of a node
- BreadthFirst: Tokenize according to the AST generated by JavaParser and go through the tree breadth first, each token correspond to the text serialization of a node
- DepthFirstPruned: same than DepthFirst except that intermediate node of the tree are removed
- BreadthFirstPruned: same than BreadthFirst except that intermediate node of the tree are removed
- CPPLemmeTokenizer:Tokenizer based on the lemmatization of cpp code performed by ANTLR CPP14 parser, available for file level as well as line level.
- CPPASTTokenizer: Tokenize according to the AST generated by Joern and go through the tree depth first, each token corresponds to the text serialization of a node
Dependencies are handled through maven, all of them will be downloaded except for joern-antlr that need to be installed first and that can be found at this link
All tokenizer inherit from the Abstract tokenizer interface, which gives intel about the scope of the tokenizer and its type
/**
* AbstractTokenizer interface
*/
public interface AbstractTokenizer {
/**
* Scope of the tokenizer, lines or files
* @return the Scope
*/
Scope getScope();
/**
* Type of the Tokenizer
* @return the type of tokenizer
*/
String getType();
}Then depending on whether it's a File Level tokenizer or a Line Level one, it will either inherit from
AbstractLineTokenizer
package tokenizer.line;
import tokenizer.AbstractTokenizer;
import tokenizer.Scope;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
public abstract class AbstractLineTokenizer implements AbstractTokenizer {
/**
* Tokenize
*/
/**
* Method to tokenize a reader
*
* @param reader to use
* @return an array of array(line) of tokens
* @throws IOException in case of exception from the reader
*/
public abstract Iterable<Iterable<String>> tokenize(Reader reader) throws IOException;
/**
* Method to tokenize a string
*
* @param s string to tokenize
* @return an array of array (line) token
*/
public Iterable<Iterable<String>> tokenize(String s) throws IOException {
Reader r = new StringReader(s);
Iterable<Iterable<String>> result = tokenize(r);
r.close();
return result;
}
public Scope getScope() {
return Scope.LINE;
}
}or AbstractFileTokenizer
package tokenizer.file;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import tokenizer.AbstractTokenizer;
import tokenizer.Scope;
import tokenizer.file.java.exception.UnparsableException;
public abstract class AbstractFileTokenizer implements AbstractTokenizer {
/**
* Tokenize
*/
/**
* Method to tokenize a reader
*
* @param reader to use
* @return an array of tokens on which all preprocessor registered has been applied
* @throws IOException in case of reader exception
* @throws UnparsableException if the content of reader could not be parsed
*/
public abstract Iterable<String> tokenize(Reader reader) throws IOException, UnparsableException;
/**
* Method to tokenize a string
*
* @param s string to tokenize
* @return an array of tokens on which all preprocessor registered has been applied
*/
public Iterable<String> tokenize(String s) throws IOException, UnparsableException {
Reader r = new StringReader(s);
Iterable result = tokenize(r);
r.close();
return result;
}
public Scope getScope() {
return Scope.FILE;
}
}Both provide method to tokenize either from a String or a reader, but differ in their output, where the File Tokenizer return an Iterable, the Line Tokenizer will return an Iterable<Iterable>.
To obtain a tokenizer different factories are provided:
CPPFileTokenizerFactory JavaFileTokenizerFactory
CPPLineTokenizerFactory JavaLineTokenizer
then once the choice made just call
JavaFileTokenizerFactory.lemmeTokenizer();- Java Parser (LGPL)
- Joern (LGPL)
- ANTLR4 (BSD)