phd/representation.md at main · rsatrioadi/phd

A knowledge graph representation for software structure

Our knowledge representation is outlined in Enabling Analysis and Reasoning on Software Systems through Knowledge Graph Representation (MSR'23). Below is an excerpt describing the requirements for the representation:

R1: Abstract & Concrete. We want our knowledge representation to support an abstract view that is free from implementation details so as to enable architecture-level analysis. This means the representation does not have to be back-convertible into source code. For example, our representation should not differentiate whether a data-structure is a struct, class, or record. At the same time, we want to be able to support relating high level views to concrete implementation level.
R2: Rich in relational information. Dependency graphs are commonly used in software analyses. These are often based on method/function calls. However, there are many other types of relationships between code fragments other than invocations. We want our knowledge representation to capture, for example, specialization and composition information.
R3: Language-agnostic. We want the knowledge representation not to be bound to a single programming language. We imagine the reasoning should be generic and apply to different implementation languages. However, for this work, we set our scope to include only class-based object-oriented software systems.
R4: Handles partial and evolving knowledge. From the models we find in practice, we know that knowledge about a system may be incomplete, yet still sufficient to enable basic types of analyses or reasoning. We want our representation to be able to capture such partial knowledge. This contrasts with classical modeling methods from the area of Formal Methods that require complete models of a system before any reasoning can be done.
R5: Supports multiple knowledge sources. Software knowledge may come from diverse sources such as source code, architecture documents, or UML models. We want to accumulate this different types of knowledge from different kinds of sources in a single representation.
R6: Enrichable. In addition to supporting different defined data sources, our knowledge representation should also be flexible enough for possible integration with emerging types of knowledge. For example, when classes are classified into role stereotypes [3], it should be possible to use this new information in an analysis pipeline.

Based on the requirements, we describe our LPG schemas for source code structure:

LPG schema for detailed structure knowledge

With the node labels defined as:
- Type: either a Primitive type or a Structure.
- Primitive: a specific kind of Type that is defined by a programming language, e.g., int, boolean, etc.
- Structure: a specific kind of Type, including what languages may call structs, records, enumerations, etc. (The definitions of and Structure’s relations to Type and Primitive are therefore trivial for people familiar with object orientation.)
- Container: anything in the language/platform that contains Structures or allows organization of Structures, e.g., package, directory, module, or namespace. Structure extends Container because some programming languages allow nested classes.
- Variable is self-explanatory; however, we point out that it includes what languages may call field, attribute, property, function/method parameter, etc.
- Script: a code expression or a block of statements; lines of code that produce value or make effects.
- Operation: a specific kind of Script that is identified with a signature (name and parameters) and return type.
- Constructor: a specific kind of Operation whose sole purpose is the creation of a Structure instance. We consider a Constructor’s return type to be the Structure that it creates, even though, e.g., Java considers constructors to have no return type.
And the edge labels as follows:
- specializes is for inheritance, including Java’s extends and implements keywords, mixins, etc. This edge is defined from a Structure to another.
- invokes denotes invocations. A Script that does not have a signature (i.e., not an Operation) cannot be referred to, and therefore invoked by, another Script. An example is field initializers. This edge is defined from a Script to an Operation.
- instantiates is a convenience edge that can be derived from other terms in the graph. M instantiates S when Structure S has a Constructor C that is invoked by Script M. It is defined from a Script to a Structure.
- contains is a containment relationship defined for Container to Container or Container to Structure.
- type is for the type of a Variable, i.e., defined from a Variable to a Type.
- returnType is for the return type of an Operation, i.e., defined from an Operation to a Type.
- hasVariable is defined from a Structure to a Variable; the meaning is self-explanatory.
- hasScript is defined from a Structure to a Script; the meaning is self-explanatory.
- hasParameter is defined from an Operation to a Variable; the meaning is self-explanatory.
LPG schema for abstract structure knowledge

This more compact schema contains only Container, Type, Structure, and Primitive nodes.

When a detailed LPG is available, an abstracted version can be derived by applying the following definitions of edges and then removing Variable and Script nodes (and consequently their connected edges):
- S holds T when Structure S has a variable of Type T.
- S accepts T when Structure S has an operation that has a parameter of Type T.
- S returns T when Structure S has an operation with return Type T.
- S constructs T when Structure S has a script that instantiates Structure T.
- S calls T when Structure S has a script that calls an operation of Structure T. It is what is usually considered in a “dependency graph”.
- S accesses T when Structure S has a script that directly access a field of Structure T.
Finally, an example of an LPG that complies to the abstract schema, rendered in a UML-class-diagram-like styling:

We mainly use a particular JSON format to store our knowledge graphs. From the above example, taking just a subset of the graph involving the two topmost nodes: ChangeConnectionHandle and Figure, the corresponding JSON looks like:
```
{ "elements": {
    "nodes": [
      { "data": {
          "id": "node1",
          "labels": ["Structure"],
          "properties": {
            "simpleName": "ChangeConnectionHandle",
            "kind": "abstract" }}},
      { "data": {
          "id": "node2",
          "labels": ["Structure"],
          "properties": {
            "simpleName": "Figure",
            "kind": "interface" }}}],
    "edges": [
      { "data": {
          "id": "edge1",
          "label": "holds",
          "source": "node1",
          "target": "node2",
          "properties": {} }},
      { "data": {
          "id": "edge2",
          "label": "calls",
          "source": "node1",
          "target": "node2",
          "properties": {} }}]}}
```
About this format and more can be found in Knowledge graph extractors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A knowledge graph representation for software structure

FilesExpand file tree

representation.md

Latest commit

History

representation.md

File metadata and controls

A knowledge graph representation for software structure