Skip to content

Commit 5470be3

Browse files
committed
Add information about the Merlin DAG
Define the important terms of the DAG.
1 parent 980e297 commit 5470be3

3 files changed

Lines changed: 79 additions & 0 deletions

File tree

docs/source/about-dag.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# About the Merlin Directed Acyclic Graph
2+
3+
Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.
4+
5+
Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.
6+
7+
## DAG Terminology
8+
9+
node
10+
: A node in the DAG is a group of columns and at least one _operator_.
11+
The columns are specified with a _column selector_.
12+
A node has an _input schema_ and an _output schema_.
13+
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.
14+
15+
column selector
16+
: A column selector specifies the columns to select from a dataset using column names or _tags_.
17+
18+
operator
19+
: An operator performs a transformation on data and return a new _node_.
20+
The data is identified by the _column selector_.
21+
Some simple operators like `+` and `-` add or remove columns.
22+
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.
23+
24+
schema
25+
: A Merlin schema is metadata that describes the columns in a dataset.
26+
Each column has its own schema that identifies the column name and can specify _tags_ and properties.
27+
28+
tag
29+
: A Merlin tag categorizes information about a column.
30+
Adding a tag to a column enables you to select columns for operations by tag rather than name.
31+
32+
For example, you can add the `USER` and `ITEM` tags to columns.
33+
Modeling and inference operations can use that information to act accordingly on the dataset.
34+
35+
## Syntax and Sample Code
36+
37+
The following code block shows the typical syntax for building a workflow that operates on DAG components.
38+
39+
```{rubric} Syntax
40+
```
41+
42+
```python
43+
result = [column_selector, ...] >> op1 >> op2 >> ...;
44+
```
45+
46+
Starting with the `column_selector`, the brackets group one or more column selectors that identify columns in the input data.
47+
48+
The `op1` and `op2` represent operators.
49+
When an operator performs its operation on the input data, the operator returns a node.
50+
51+
The `result` object is the graph.
52+
It contains the sequence of operations to perform.
53+
54+
```{rubric} Sample Code
55+
```
56+
57+
```python
58+
item_features = (
59+
["item_category", "item_shop", "item_brand"] >> Categorify(dtype="int32") >> TagAsItemFeatures()
60+
)
61+
```
62+
63+
In the sample code, the column selector is created by specifying the item-related column names.
64+
65+
The {py:class}`~nvtabular.ops.Categorify` operator transforms the categorical features into unique integer values, adds the {py:attr}`~merlin.schema.Tags.CATEGORICAL` tag, and returns a node.
66+
67+
The {py:class}`~nvtabular.ops.TagAsItemFeatures` operator applies the {py:attr}`~merlin.schema.Tags.ITEM` tag and returns a node.
68+
69+
When the `item_features` variable is included in a transformation and applied to input data, it will traverse the nodes in order and apply the data transformation and tagging.

docs/source/conf.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,14 @@
118118

119119
autosummary_generate = True
120120

121+
intersphinx_mapping = {
122+
"python": ("https://docs.python.org/3", None),
123+
"merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
124+
"merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
125+
"merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
126+
"NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
127+
}
128+
121129
copydirs_additional_dirs = ["../../examples/", "../../README.md"]
122130

123131
copydirs_file_rename = {

docs/source/toc.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,7 @@ subtrees:
4646
title: Deploy the HugeCTR Model with Triton
4747
- file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
4848
title: Deploy the TensorFlow Model with Triton
49+
- file: about-dag.md
50+
title: Merlin DAG
4951
- file: containers.rst
5052
- file: support_matrix/index.rst

0 commit comments

Comments
 (0)