adding a new show command, debug operation, and NoOpDestination by tomreitz · Pull Request #89 · edanalytics/earthmover

tomreitz · 2024-05-13T22:18:48Z

This PR adds two new features:

debug operation, to be used for debugging an earthmover project
earthmover show command, likewise to be used for debugging

as well as a few under-the-hood changes such as a new NoOpDestination (earthmover prunes graph branches with no destination attached, so I had to create this in order to be able to programmatically add a debug operation for earthmover show without actually materializing any file to disk).

I will go through and do a self-review, adding some comments about various parts of the changed code.

…r class, updates to readme

tomreitz · 2024-05-14T14:47:28Z

earthmover/__main__.py

        nargs="?",
        type=str,
-        help='the command to run: `run`, `compile`, `visualize`'
+        help='the command to run: `deps`, `compile`, `run`, `show`, `visualize`'


I don't think we actually have a visualize command, maybe that should be removed.

tomreitz · 2024-05-14T14:48:03Z

earthmover/__main__.py

+    parser.add_argument("--transpose",
+        action='store_true',
+        help='transposes the output of `earthmover show`'
+    )


These three new CLI flags are only used by earthmover show.

tomreitz · 2024-05-14T14:48:34Z

earthmover/__main__.py

        "results_file": "",
+        "function": "head",
+        "rows": 10,
+        "transpose": False,


Default values if the flags are omitted. I think these defaults make sense, but happy to discuss alternatives.

earthmover/earthmover.py

tomreitz · 2024-05-14T14:53:02Z

earthmover/earthmover.py

+        active_graph = self.filter_graph_on_selector(self.graph, selector=f"{selector}_destination")
+        self.execute(active_graph)
+
+


Explaining what happens here; suppose you select the transformation node my_node (with earthmover show -s my_node):

we create a transformation node my_node_show with source=my_node which contains just one debug operation (per the CLI inputs)

we connect the new transformation node my_node_show to a new NoOpDestination node my_node_destination so earthmover wouldn't prune off our dangling transformation node

we filter down the graph to just the selected my_node (and upstream and downstream nodes)

we execute this subgraph

The result is that the debug operation will cause information about my_node to be output to the console.

tomreitz · 2024-05-14T14:56:30Z

earthmover/nodes/destination.py

+        elif (type(config)==dict or type(config)==YamlMapping) and 'kind' not in config.keys():
+            # default for backward compatibility
+            return object.__new__(FileDestination)
+        # else: throw an error?


Because I'm adding a new Destination class, I need a way to instantiate it... since type is already a property of the Node superclass, and class is a reserved Python keyword, I landed on kind as the property one can use (in a destination's config) to pick a specific destination.

In the future, we can extend this to other destination kinds, such as file.jsonl, file.csv, file.tsv, file.parquet, perhaps even database.snowflake or database.postgres (with additional column typing config).

The default destination kind (when unspecified by the user) is the existing FileDestination, for backward compatibility.

We've started using extension as this keyword. Maybe we can set extension to "debug" in this new model, instead of adding a new keyword field.

See my comment here: I think overloading extension is not a good solution. A file's extension is not one-to-one with it's type.

tomreitz · 2024-05-14T14:56:56Z

earthmover/nodes/destination.py

+        self.data = (
+            self.upstream_sources[self.source].data
+                .map_partitions(lambda x: x.apply(self.render_row, axis=1), meta=pd.Series('str'))
+        )


This new destination type does nothing, as the name suggests.

tomreitz · 2024-05-14T14:57:59Z

earthmover/nodes/node.py


        self.error_handler.ctx.update(
-            file=self.config.__file__, line=self.config.__line__, node=self, operation=None
+            file=self.config.get("__file__",""), line=self.config.get("__line__",0), node=self, operation=None


Since earthmover show injects nodes into the graph which didn't come from a file, I had to modify the context update in a few places like so.

To circumvent needing to change these lines, we should just initialize the No-Op config block as a YamlMapping and set default values for the __file__ and __line__ attributes in the class.

But what would those default values be? (when nodes are added programmatically, not from a YAML configuration file)

Can __file__ and __line__ be None? That's the only default value I think would make sense, but I don't know how it would come through in the logging messages, or if it would cause errors.

…ommand

jayckaiser · 2024-05-23T16:31:27Z

README.md

+Sort rows by one or more columns.
+```yaml
+      - operation: debug
+        function: head | tail | describe | columns


What is the default function?

There isn't currently one. What do you think? Does head make sense?

jayckaiser · 2024-05-23T16:40:59Z

earthmover/operations/dataframe.py

+        self.earthmover.logger.info(f"debug ({self.func}{rows_str}{transpose_str}) for {transformation_name}:")
+
+        # call function and display debug info
+        if self.func == 'head':


A warning that .head() will raise a warning if the number of rows in the dataframe are less than those specified. We should emulate the head() behavior in the current debug logic above the conditional.

I tested this and it doesn't seem to... not sure why, or maybe that's configurable.

earthmover/operations/dataframe.py

tomreitz · 2024-06-13T21:48:05Z

I'm converting this PR to a draft pending further discussion. The debug operation has been split out to a separate PR (#100).

Tom Reitz added 2 commits May 13, 2024 17:18

adding a new show command, debug operation, and NoOpDestination

f48f1bc

adding cli clags for show command, moving show command into earthmove…

9e199b8

…r class, updates to readme

tomreitz marked this pull request as ready for review May 14, 2024 14:46

tomreitz commented May 14, 2024

View reviewed changes

tomreitz requested a review from jayckaiser May 14, 2024 15:37

switch debug operation rows default to 10 for consistency with show c…

7abb5e1

…ommand

jayckaiser reviewed May 23, 2024

View reviewed changes

earthmover/operations/dataframe.py Outdated Show resolved Hide resolved

fixes per suggestions from Jay

d5f3187

tomreitz mentioned this pull request Jun 13, 2024

adding the debug operation #100

Merged

tomreitz marked this pull request as draft June 13, 2024 21:47

		active_graph = self.filter_graph_on_selector(self.graph, selector=f"{selector}_destination")
		self.execute(active_graph)

Conversation

tomreitz commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomreitz May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomreitz commented Jun 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomreitz commented May 13, 2024 •

edited

Loading

tomreitz May 14, 2024 •

edited

Loading