-
Notifications
You must be signed in to change notification settings - Fork 12
API
This document provides an insight into basic classes and interfaces of PT Pattern Matching Engine and shows how these interact.
- Workflow
- Source code repositories
- Pattern repositories
- Parsers and converters
- Basic classes and interfaces
- Logging
- UST nodes
- Errors and exceptions
- ANTLR utility classes
- JSON serialization
- Other utility classes
Workflow is a basic class that combines stages of reading and parsing
files, tree conversions, and matching UST with patterns. The Workflow
class is responsible for timing these stages. It also provides support
for parallelizing.
-
ISourceCodeRepository SourceCodeRepositoryβ a source code repository -
LanguageFlags languagesβ a list of languages to be parsed. Based on these languages,Workflowdetermines, whichILanguageParserparsers andIParseTreeToUstConverterconverters should be created. -
IPatternsRepository PatternsRepositoryβ a pattern repository -
Stageis the final operational stage:-
Readβ reading a file -
Parseβ parsing -
Convertβ converting to UAST -
Preprocessβ tree preprocessing (calculating arithmetic expressions, connecting lines, simplifying patterns) -
Matchβ pattern matching -
Patternsβ a mode for checking the parsing of patterns If a certain stage is specified, the stages preceding it will also be performed. This does not apply to thePatternsstage, which is responsible only for testing templates.
-
-
ILogger Loggerβ logger that is embedded in other internal objects implementingILoggable.
Additional parameters that are used for ANTLR parsers only are also available:
-
MaxStackSizeβ the maximum stack size in bytes. If the value is not zero, then parsing is started in the instance ofThreadwith a specified value. This option is useful if a StackOverflow exception occurred during ANTLR parsing. -
MemoryConsumptionMbβ approximate memory consumption in megabytes after which the procedure for cleaning the ANTLR cache is started (ClearDFAfor a lexer and parser).
See also the paragraph "Memory consumption" in the article "Theory and practice of source code parsing with ANTLR and Roslyn."
The result of processing is obtained as follows:
WorkflowResult result = workflow.Process(). The resulting class has
the following properties:
- List of uploaded source code files
IReadOnlyList<SourceCodeFile> SourceCodeFiles - List of parse trees
IReadOnlyList<ParseTree> ParseTrees - The list of USTs:
IReadOnlyList<Ust> Usts - List of matching results
IReadOnlyList<MatchingResult> MatchingResults - Timing for the stages:
-
TotalReadTimeSpanβ total reading time span -
TotalParseTimeSpanβ total parsing time span -
TotalConvertTimeSpanβ total converting time span -
TotalPreprocessTimeSpanβ total preprocessing time span -
TotalMatchTimeSpanβ total matching time span -
TotalPatternsTimeSpanβ total pattern processing time span -
TotalLexerTicksβ total lexer processing time span (ANTLR parsers only) -
TotalParserTicksβ total parser processing time span (ANTLR parsers only)
-
-
TotalProcessedFileCountβ total amount of files processed -
TotalProcessedCharsCountβ total amount of characters processed -
TotalProcessedLinesCountβ total amount of lines processed -
ErrorCountβ total amount of errors -
Workflowalso generates real-time logger messages.
The figure below illustrates the workflow:
Workflow
Repositories implement the ISourceCodeRepository interface; they are
located in the PT.PM.Common project.
-
FileCodeRepositoryuploads the source code from thefileNamefile. -
FilesAggregatorCodeRepositoryuploads all files from the specified folderfilePath. -
MemoryCodeRepositoryuploads the source code from thefileNameline. -
ZipAtUrlCachedCodeRepositorydownloads the GitHub repository and unpacks it. It is used to test parsers and converters.
Repositories implement the IPatternsRepository interface; they are
located in the PT.PM.Patterns project.
-
FilePatternsRepositoryuploadsPatternDtoobjects from thefilePathfile. -
MemoryPatternsRepositoryallows uploadingPatternDtoobjects from memory. -
JsonPatternsRepositoryallows uploading patterns from a JSON document. -
DslPatternRepositoryallows uploading a DSL pattern from thepatternDataline forlanguageFlagslanguages. -
DefaultPatternRepositorycontains default hardcoded patterns.
-
ILanguageParserdescribes aSourceCodeparser to theParseTree. Implementations:-
CSharpRoslynParserβ a Roslyn parser for C# -
JavaAntlrParserβ an ANTLR parser for Java -
PhpAntlrParserβ an ANTLR parser for PHP -
TSqlAntlrParserβ an ANTLR parser for T-SQL -
PlSqlAntlrParserβ an ANTLR parser for PL/SQL -
JavaScriptAntlrParserβ an ANTLR parser for JavaScript
-
-
IParseTreeToUstConverterdescribes aParseTreeconverter toUST. Implementations:-
CSharpRoslynParseTreeConverterβ a converter for C# -
JavaAntlrParseTreeConverterβ a converter for Java -
PhpAntlrParseTreeConverterβ a converter for PHP -
TSqlAntlrConverterβ a converter for T-SQL -
PlSqlAntlrConverterβ a converter for PL/SQL -
JavaAntlrParseTreeConverterβ a converter for JavaScript
-
-
IAstPatternMatcher<TPatternsDataStructure>describes aUSTmatcher andPatternpatterns. Returns a collection fromMatchingResult. There is a single implementation yet:BruteForcePatternMatcherlocated in the PT.PM.Matching project.
-
SourceCodeFileβ a source code object containing:-
Nameβ name -
RelativePathβ a relative path to a file in the root of a repository -
Codeβ source code
-
-
ParseTreeβ a parse tree obtained as a result of parsing theSourceCodeFilesource code object. -
UST(Universal Syntax Tree) β a universal syntax tree obtained as a result of converters operation implementingIParseTreeToUstConverter. -
Patternβ an object that stores a pattern in a structured format, ready for use in the pattern matching engine-
Keyβ a uniquestringpattern identifier -
LanguageFlagsβ a list of languages for which this universal pattern is applicable For example, aRandomclass instance is applicable for C# and Java. -
PatternNode Dataβ an AST fragment for the pattern which is matched against theUST.
-
-
PatternDtoβ a serializedPatternintended for its storage and transmission.-
Nameβ name (optional) -
Keyβ a uniquestringpattern identifier -
LanguageFlagsβ a list of languages for which this universal pattern is applicable For example, aRandomclass instance is applicable for C# and Java. -
DataFormatβ a pattern formatJSONandDSLare available. -
Valueβ the text of a pattern inJSONorDSLformat (depends onDataFormat). -
CweIdβ a Common Weakness Enumeration identifier -
Descriptionβ a pattern description -
DebugInfoβ information that helps with debugging
-
- An example of a pattern serialized in JSON:
{ "Key": "96", "Name": "Hardcoded Password", "Languages": "CSharp, Java, PHP, PLSQL, TSQL", "DataFormat": "Dsl", "Value": "<[(?i)password]> = <[ \"\\w*\" || null ]>" } -
MatchingResultβ a result of matching aPatternagainstUST. Contains the following properties:-
Patternβ a reference to the pattern -
List<UstNode> Nodesβ a list of matched patterns. The list is used because several nodes can be specified for multi-line patterns. -
FileNodeβ a reference to the source code file
-
-
MatchingResultDtoβ a matching resultMatchingResultthat can be used for serialization.-
MatchedCodeβ a fragment of the matched code -
BeginLine,BeginColumn,EndLine,EndColumnβ coordinates in the line-column format -
PatternKeyβ a pattern ID -
SourceFileβ a source code file
-
- An example of a matching result serialized in JSON:
{ "MatchedCode": "rand()", "BeginLine": 60, "BeginColumn": 30, "EndLine": 60, "EndColumn": 36, "PatternKey": "27", "SourceFile": "Patterns.php" } -
TextSpanβ a linear text location that containsStartandLength(a start and length, respectively). Used in all conversions, except the output. -
LineColumnTextSpanβ a text location that containsBeginLine,BeginColumn,EndLine, andEndColumn. Used to output the found pattern in a handy format.
-
ILoggerβ an interface abstracting the logging methods:-
LogErrorβ logging errors in the text or exception formatException. There are several types of exceptions:ParsingException,ConversionException,MatchingException, and a generalException. -
LogInfoβ logging information messages Examples:- Command line arguments: -f Patterns.php --stage convert
- File Patterns.php has been parsed (Elapsed: 00:00:00.6350338)
-
LogDebugβ logging Debug messages They are not used in the Release configuration. Examples:- Arithmetic expression 60 * 60 has been folded to 3600
- Strings "a" + "b" has been concatenated to "ab"
-
-
ConsoleLoggerimplementation is located in the PT.PM.Console project. This logger outputs the result to the console and to a file. NLog is embedded. -
ILoggableβ an interface with theILogger Loggerproperty. It is implemented in all classes where logging is used. By default, all classes use theDummyLoggerimplementation, which contains empty methods. It is used to get rid of a necessity to check that theLoggeris notnullbefore calling the methodsLogError,LogInfo, etc.
-
Childrenβ the nearest descendants of the current node (siblings) -
Descendantβ all descendants of the current node (siblings, grandchildren, etc.)
UstNode is the basic class for all nodes. It has the following
properties:
-
NodeTypeβ node type. It is used for nodes matching. -
Parentβ parent of nodeNullis used for the root node. -
Childrenβ a list of children It is used forUSTtraversing. -
TextSpanβ a location in the text in linear coordinates.
And methods:
-
CompareToβ a basic implementation for matching two nodes against each other. More details are given in the article "Tree structures processing and unified AST" in the section "Algorithm for matching AST and patterns." -
DoesAnyDescendantMatchPredicateshows if any descendant matches the passed predicate. For example, it can be used forPatternExpressionInsideExpression. -
DoesAllDescendantsMatchPredicateshows if all descendants match the transmitted predicate. -
ApplyActionToDescendantsapplies the action to all descendants. It is used to add a text location offset when parsing island languages. -
GetAllDescendantsreturns all descendants of the given node. -
ToStringreturns a string representation of a node, which is more like a C# syntax.
-
EntityDeclarationβ entity declaration:-
TypeDeclarationβ class or interface declaration -
ConstructorDeclarationβ constructor declaration -
FieldDeclarationβ field declaration -
MethodDeclarationβ method declaration -
ParameterDeclaration- parameter declaration -
StatementDeclarationβ statement declaration
-
-
namespaceDeclaration- namespace declaration -
Statementβ an instruction or statement (usually ends with a semicolon)-
BlockStatementβ example:{ Statement* } -
BreakStatementβ example:break; -
ContinueStatementβ example:continue; -
DoWhileStatementβ example:do (Expression) while Statement -
EmptyStatementβ example:; -
ExpressionStatementβ example:Expression; -
ForeachStatementβ example:foreach (var e in Expression) Statement -
ForStatementβ example:for (int i = 0; i < n; i++) Statement -
GotoStatementβ example:goto Id; -
IfElseStatementβ example:if (Expression) Statement else Statement -
ReturnStatementβ example:return Expression; -
ThrowStatementβ example:throw Expression; -
TypeDeclarationStatementβ declaration of type inside theStatement -
WhileStatementβ example:while (Expression) Statement -
WithStatementβ example:using (Expression) Statement -
WrapperStatementβ an artificialStatementthat contains a node ofUstNodetype.
-
-
Expressionβ an expression that returns a result (for example, a function call, an arithmetic operator, etc.)-
AnonymousMethodExpressionβ an anonymous function -
AssignmentExpressionβ an assignment expressiona = b -
ArrayCreationExpression- creation of array:a = new[5, 5] -
BinaryOperatorExpressionβ a binary expression:a * b,a + b -
CastExpressionβ conversion to a type(type)expr -
ConditionalExpressionβ a conditional expressiona ? b : c -
IndexerExpressionβ an indexera[b] -
InvocationExpressionβ a function callTarget(Args) -
MemberReferenceExpressionβ reference to a member of the classA.B.C -
MultichildExpressionβ an artificialExpressioncontaining several expressionsExpression -
ObjectCreateExpressionβ creation of an objectnew A() -
UnaryOperatorExpressionβ an unary expression++a,!a -
VariableDeclarationExpressionβ variable declaration,var a = b -
WrapperExpressionβ an artificialExpressioncontaining a node of a typeUstNode
-
-
Tokenβ a tail node:-
IdTokenβ ID -
Literalβ a primitive constant value:-
BinaryOperatorLiteralβ a binary literal -
BooleanLiteralβtrueorfalse -
CommentLiteralβ a comment// password="e@jf7!ke" -
FloatLiteralβ a floating-point number42.42 -
IntLiteralβ an integer number42 -
ModifierLiteral- modifiers of types and type membersstatic class A -
ParameterModifierLiteral- parameter modifierint f(const int a) -
NullLiteralβnull -
StringLiteralβ a string"hello world" TypeTypeLiteral-
UnaryOperatorLiteralβ an unary literal
-
-
ThisReferenceToken-this
-
-
CollectionNode<TAstNode> : UstNodeβ node collection-
EntitiesNodeβ entities collection -
ArgsNodeβ the collection of argumentsinvocation(a, b, c)
-
-
DslNodeβ a node for wrapping the DSL information (target languages, variable definitions) -
PatternBooleanLiteralβ a boolean literalbool,true, orfalse -
PatternCommentallows finding comments on a regular expression, equivalent toComment: regexin DSL. -
PatternExpressionβ a wildcard expression (any expression), equivalent to#in DSL. Negation for expressions can also be used. -
PatternExpressionInsideExpressionwraps any expression and indicates that it can be met at any depth of the tree. Equivalent to<{ expression }>in DSL. -
PatternExpressionsβ a node for matching multiple expressions considering the constraints. For example,HashBytes(^(md2|md4|md5)$, ...) -
PatternIdTokenβ placement of identifiers by a regular expression. For example,<[\w+]>will be matched against any identifiers. -
PatternIntLiteralβ placement of integers by a range, for example,<[..-20 || -10 || -5..5 || 010 || 0x10 || 30..]> -
PatternMultipleExpressionsβ placement for an arbitrary number of any expressions. Equivalent to...In DSL. It can be used for function arguments. -
PatternStatementwraps an instruction. -
PatternStatementswraps multiple instructions. -
PatternStringLiteralplacement of strings by a regular expression. For example,<[""]>will be matched against any strings. -
PatternTryCatchStatementβ an empty constructiontry catch { } -
PatternVarDefβ definition of a variable that can take multiple values. It can be either named<["\w*" || null]>or unnamed:<[@pwd:password]>. -
PatternVarRefβ a reference to a pinned variable. For example,Response.Cookies.Add(<[@cookie]>);
IUstVisitorand IUstListener interfaces for traversing UST are
located in the PT.PM.UstPreprocessing project. The UstVisitor
class implements the IUstVisitor interface and deep copying of the
trees is performed in it by default. Thus far, the
dynamic
dispatching is used.
UstPreprocessor is used to simplify the UST. It overrides some
UstVisitor methods. In fact, at the preprocessing stage, UST is
transformed into a more simple UST. Π‘ΠΌΠΎΡΡΠΈ ΡΠ°ΠΊΠΆΠ΅ Simplifying an
AST.
-
ParsingExceptionoccurs when parsing source code or template, if there are lexical or syntax errors. Examples:Error: no viable alternative at input '(?' at 1:1Error: token recognition error at: '>' at 1:18
-
ConversionExceptionoccurs when the parse tree is converted to a universal AST (UST). It is also used when converting templates. For example,NullReferenceExceptionwill be wrapped inConversionExceptionin some visitor method. -
MatchingExceptionoccurs during the execution of the algorithm for matching patterns against the UST nodes. -
ShouldNotBeVisitedExceptionis used to explicitly specify that a visitor method should not be visited.
-
AntlrCaseInsensitiveInputStreamβ a case-insensitive input stream, which is used, for example, for PHP, PL/SQL, and T-SQL. -
AntlrDefaultVisitorβ implementation of the default visitor for ANTLR parse trees.-
Visitcalls thetree.Accept(this)for a particular node. Contains an exception handler. -
VisitChildren- If there is a single child, the
Visitis called. - Otherwise, it returns a descendant to
MultichildExpression, the descendant toExpression.
- If there is a single child, the
-
VisitTerminaltries to parse the value with different regular expressions and on the basis of this determine the type (String,Float,Int, etc.)
-
-
AntlrHelperβ converting ANTLR text locations to unified ones, outputting the parse tree in a convenient text representation. -
AntlrMemoryErrorListenerβ logging ANTLR lexer and parser errors -
AntlrParserβ a basic class for all ANTLR parsers. Implements the following:- Virtual method of preprocessing text
(
PreprocessText) which by default normalizes line breaks (replaces single\rwith\n). - Code parsing is performed first using a fast
SLLalgorithm, and, in case of failure, using a slow fullLLalgorithm. - Tracking the ANTLR cache and cleaning it under certain conditions (if memory consumption by the process exceeded a certain threshold).
- Virtual method of preprocessing text
(
PT.PM actively uses the JSON.NET library. The following classes are used to interact with it:
-
USTJsonConverterβ deserializer of UST trees. Creates children of the tree dynamically usingActivator.CreateInstancedepending on the value ofNodeType. -
JsonUstNodeSerializerallows enabling or disabling text locationsTextSpanduring serialization of trees, and using indents (which are not used by default). -
PatternLanguageFlagsSafeConverterdeserializesPatternDto, without showing an exception in case of incorrect or unsupported languagesLanguageFlags.
-
DummyLoggerβ a dummy logger It is used by default in all objects implementingILoggable, so thatLogger.LogInfocan be written instead ofLogger?.LogInfoand theNullReferenceExceptioncan be avoided. -
LanguageInfoinformation about the language:-
Language Languageβ the value of the enumeration for a given language (for example, CSharp). -
string Titleβ the title name of the language (for example, C#) -
string[] Extensionsβ language extensions (for example, .cs) -
bool CaseInsensitiveshows if the language is case sensitive -
LanguageFlags DependentLanguagesβ languages, the source code of which may be found within the source code of the given language (island languages). For example,JavaScriptcan be met within thePHP. -
bool HaveAntlrParsershows whether the ANTLR parser is used for this language.
-
-
LanguageExtcontains a static dictionary with supportedLanguageInfoand methods for working with them. -
LanguageDetectorβ a source code detection by a fragment. Has a singleParserLanguageDetectorimplementation, which uses different parsers and selects the language with the minimum amount of parsing errors. -
TextHelperβ various utilities to work with a text:-
LinearToLineColumnβ transformation of linear coordinates into two-dimensional ones. -
LineColumnToLinearβ transformation of two-dimensional coordinates into linear ones. -
GetLinesCountreturns the amount of line breaks for a string. -
NormDirSeparatornormalizes the directory separation operator. It is used to bring the physical address of a file to the correct format, because Windows OS uses backslashes, and Linux β direct slashes.
-
-
WorkflowLoggerHelperβ outputting information and statistics after the process of pattern matching. -
UstDotRendererβ rendering an UST tree to DOT, which can be visualized using Graphviz. -
GraphvizGraphsaves the transmitted graph inDOTto an image. Supported formats:Bmp, Png, Jpg, Plain, Svg, Pdf, Gif, Dot.PNGis used by default. -
TestHelperβ various utilities to perform unit tests
