|
| 1 | +# About the Merlin Directed Acyclic Graph |
| 2 | + |
| 3 | +Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference. |
| 4 | + |
| 5 | +Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin. |
| 6 | + |
| 7 | +## DAG Terminology |
| 8 | + |
| 9 | +node |
| 10 | +: A node in the DAG is a group of columns and at least one _operator_. |
| 11 | + The columns are specified with a _column selector_. |
| 12 | + A node has an _input schema_ and an _output schema_. |
| 13 | + Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset. |
| 14 | + |
| 15 | +column selector |
| 16 | +: A column selector specifies the columns to select from a dataset using column names or _tags_. |
| 17 | + |
| 18 | +operator |
| 19 | +: An operator performs a transformation on data and return a new _node_. |
| 20 | + The data is identified by the _column selector_. |
| 21 | + Some simple operators like `+` and `-` add or remove columns. |
| 22 | + More complex operations are applied by shifting the operators onto the column selector with the `>>` notation. |
| 23 | + |
| 24 | +schema |
| 25 | +: A Merlin schema is metadata that describes the columns in a dataset. |
| 26 | + Each column has its own schema that identifies the column name and can specify _tags_ and properties. |
| 27 | + |
| 28 | +tag |
| 29 | +: A Merlin tag categorizes information about a column. |
| 30 | + Adding a tag to a column enables you to select columns for operations by tag rather than name. |
| 31 | + |
| 32 | + For example, you can add the `USER` and `ITEM` tags to columns. |
| 33 | + Modeling and inference operations can use that information to act accordingly on the dataset. |
| 34 | + |
| 35 | +## Syntax and Sample Code |
| 36 | + |
| 37 | +The following code block shows the typical syntax for building a workflow that operates on DAG components. |
| 38 | + |
| 39 | +```{rubric} Syntax |
| 40 | +``` |
| 41 | + |
| 42 | +```python |
| 43 | +result = [column_selector, ...] >> op1 >> op2 >> ...; |
| 44 | +``` |
| 45 | + |
| 46 | +Starting with the `column_selector`, the brackets group one or more column selectors that identify columns in the input data. |
| 47 | + |
| 48 | +The `op1` and `op2` represent operators. |
| 49 | +When an operator performs its operation on the input data, the operator returns a node. |
| 50 | + |
| 51 | +The `result` object is the graph. |
| 52 | +It contains the sequence of operations to perform. |
| 53 | + |
| 54 | +```{rubric} Sample Code |
| 55 | +``` |
| 56 | + |
| 57 | +```python |
| 58 | +item_features = ( |
| 59 | + ["item_category", "item_shop", "item_brand"] >> Categorify(dtype="int32") >> TagAsItemFeatures() |
| 60 | +) |
| 61 | +``` |
| 62 | + |
| 63 | +In the sample code, the column selector is created by specifying the item-related column names. |
| 64 | + |
| 65 | +The {py:class}`~nvtabular.ops.Categorify` operator transforms the categorical features into unique integer values, adds the {py:attr}`~merlin.schema.Tags.CATEGORICAL` tag, and returns a node. |
| 66 | + |
| 67 | +The {py:class}`~nvtabular.ops.TagAsItemFeatures` operator applies the {py:attr}`~merlin.schema.Tags.ITEM` tag and returns a node. |
| 68 | + |
| 69 | +When the `item_features` variable is included in a transformation and applied to input data, it will traverse the nodes in order and apply the data transformation and tagging. |
0 commit comments