Skip to content

Latest commit

 

History

History
196 lines (137 loc) · 5.72 KB

File metadata and controls

196 lines (137 loc) · 5.72 KB

The tokenizer package

The tokenizer package allows to tokenize java and C/C++ source code either at the line level or at the file level.

What are the different tokenizers provided?

Generic

  • UTFTokenizer based on the one created for Terrier, see: UTF Tokeniser

Tokenizer that uses everything nonalphanumerical as a delimiter, delimiter in a row are kept together, while only space delimiter is removed.

Available at line and file level.

Java

Code Based

  • UTFWocTokenizer: same tokenizer as UTF except that comments are removed (available only at file level)
  • JavaLemmeTokenizer: Tokenizer based on the lemmatization of java code performed by Java Parser, available for file level as well as line level.
  • JavaLemmeWocTokenizer: Same as JavaLemmeTokenizer, except that comments are removed, available only at the file level

AST Based (only available at the file Level)

  • DepthFirst: Tokenize according to the AST generated by JavaParser and go through the tree depth first, each token correspond to the text serialization of a node
  • BreadthFirst: Tokenize according to the AST generated by JavaParser and go through the tree breadth first, each token correspond to the text serialization of a node
  • DepthFirstPruned: same than DepthFirst except that intermediate node of the tree are removed
  • BreadthFirstPruned: same than BreadthFirst except that intermediate node of the tree are removed

C/C++

Code Based

  • CPPLemmeTokenizer:Tokenizer based on the lemmatization of cpp code performed by ANTLR CPP14 parser, available for file level as well as line level.

AST Based (only available at the file Level)

  • CPPASTTokenizer: Tokenize according to the AST generated by Joern and go through the tree depth first, each token corresponds to the text serialization of a node

Requirements

Dependencies are handled through maven, all of them will be downloaded except for joern-antlr that need to be installed first and that can be found at this link

Architecture

All tokenizer inherit from the Abstract tokenizer interface, which gives intel about the scope of the tokenizer and its type

/**
* AbstractTokenizer interface
*/
public interface AbstractTokenizer {
                                                            
    /**
    * Scope of the tokenizer, lines or files
    * @return the Scope
    */
    Scope getScope();
    /**
    * Type of the Tokenizer
    * @return the type of tokenizer
    */
    String getType();
}

Then depending on whether it's a File Level tokenizer or a Line Level one, it will either inherit from

AbstractLineTokenizer

package tokenizer.line;

import tokenizer.AbstractTokenizer;
import tokenizer.Scope;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

public abstract class AbstractLineTokenizer implements AbstractTokenizer {
    /**
     * Tokenize
     */

    /**
     * Method to tokenize a reader
     *
     * @param reader to use
     * @return an array of array(line) of tokens
     * @throws IOException in case of  exception from the reader
     */
    public abstract Iterable<Iterable<String>> tokenize(Reader reader) throws IOException;

    /**
     * Method to tokenize a string
     *
     * @param s string to tokenize
     * @return an array of array (line) token
     */
    public Iterable<Iterable<String>> tokenize(String s) throws IOException {
            Reader r = new StringReader(s);
            Iterable<Iterable<String>> result = tokenize(r);
            r.close();
            return result;

    }

    public Scope getScope() {
        return Scope.LINE;
    }

}

or AbstractFileTokenizer

package tokenizer.file;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import tokenizer.AbstractTokenizer;
import tokenizer.Scope;
import tokenizer.file.java.exception.UnparsableException;


public abstract class AbstractFileTokenizer implements AbstractTokenizer {

    /**
     * Tokenize
     */

    /**
     * Method to tokenize a reader
     *
     * @param reader to use
     * @return an array of tokens on which all preprocessor registered has been applied
     * @throws IOException         in case of reader exception
     * @throws UnparsableException if the content of reader could not be parsed
     */
    public abstract Iterable<String> tokenize(Reader reader) throws IOException, UnparsableException;

    /**
     * Method to tokenize a string
     *
     * @param s string to tokenize
     * @return an array of tokens on which all preprocessor registered has been applied
     */
    public Iterable<String> tokenize(String s) throws IOException, UnparsableException {
            Reader r = new StringReader(s);
            Iterable result = tokenize(r);
            r.close();
            return result;
    }

    public Scope getScope() {
        return Scope.FILE;
    }
}

Both provide method to tokenize either from a String or a reader, but differ in their output, where the File Tokenizer return an Iterable, the Line Tokenizer will return an Iterable<Iterable>.

How to use the tool

To obtain a tokenizer different factories are provided:

File Level

CPPFileTokenizerFactory JavaFileTokenizerFactory

Line Level

CPPLineTokenizerFactory JavaLineTokenizer

then once the choice made just call

JavaFileTokenizerFactory.lemmeTokenizer();

Third Party tool

  • Java Parser (LGPL)
  • Joern (LGPL)
  • ANTLR4 (BSD)