adding a new show command, debug operation, and NoOpDestination#89
adding a new show command, debug operation, and NoOpDestination#89
Conversation
| nargs="?", | ||
| type=str, | ||
| help='the command to run: `run`, `compile`, `visualize`' | ||
| help='the command to run: `deps`, `compile`, `run`, `show`, `visualize`' |
There was a problem hiding this comment.
I don't think we actually have a visualize command, maybe that should be removed.
| parser.add_argument("--transpose", | ||
| action='store_true', | ||
| help='transposes the output of `earthmover show`' | ||
| ) |
There was a problem hiding this comment.
These three new CLI flags are only used by earthmover show.
| "results_file": "", | ||
| "function": "head", | ||
| "rows": 10, | ||
| "transpose": False, |
There was a problem hiding this comment.
Default values if the flags are omitted. I think these defaults make sense, but happy to discuss alternatives.
| active_graph = self.filter_graph_on_selector(self.graph, selector=f"{selector}_destination") | ||
| self.execute(active_graph) | ||
|
|
||
|
|
There was a problem hiding this comment.
Explaining what happens here; suppose you select the transformation node my_node (with earthmover show -s my_node):
- we create a transformation node
my_node_showwithsource=my_nodewhich contains just one debug operation (per the CLI inputs) - we connect the new transformation node
my_node_showto a new NoOpDestination nodemy_node_destinationso earthmover wouldn't prune off our dangling transformation node - we filter down the graph to just the selected
my_node(and upstream and downstream nodes) - we execute this subgraph
The result is that the debug operation will cause information about my_node to be output to the console.
| elif (type(config)==dict or type(config)==YamlMapping) and 'kind' not in config.keys(): | ||
| # default for backward compatibility | ||
| return object.__new__(FileDestination) | ||
| # else: throw an error? |
There was a problem hiding this comment.
Because I'm adding a new Destination class, I need a way to instantiate it... since type is already a property of the Node superclass, and class is a reserved Python keyword, I landed on kind as the property one can use (in a destination's config) to pick a specific destination.
In the future, we can extend this to other destination kinds, such as file.jsonl, file.csv, file.tsv, file.parquet, perhaps even database.snowflake or database.postgres (with additional column typing config).
The default destination kind (when unspecified by the user) is the existing FileDestination, for backward compatibility.
There was a problem hiding this comment.
We've started using extension as this keyword. Maybe we can set extension to "debug" in this new model, instead of adding a new keyword field.
There was a problem hiding this comment.
See my comment here: I think overloading extension is not a good solution. A file's extension is not one-to-one with it's type.
| self.data = ( | ||
| self.upstream_sources[self.source].data | ||
| .map_partitions(lambda x: x.apply(self.render_row, axis=1), meta=pd.Series('str')) | ||
| ) No newline at end of file |
There was a problem hiding this comment.
This new destination type does nothing, as the name suggests.
|
|
||
| self.error_handler.ctx.update( | ||
| file=self.config.__file__, line=self.config.__line__, node=self, operation=None | ||
| file=self.config.get("__file__",""), line=self.config.get("__line__",0), node=self, operation=None |
There was a problem hiding this comment.
Since earthmover show injects nodes into the graph which didn't come from a file, I had to modify the context update in a few places like so.
There was a problem hiding this comment.
To circumvent needing to change these lines, we should just initialize the No-Op config block as a YamlMapping and set default values for the __file__ and __line__ attributes in the class.
There was a problem hiding this comment.
But what would those default values be? (when nodes are added programmatically, not from a YAML configuration file)
Can __file__ and __line__ be None? That's the only default value I think would make sense, but I don't know how it would come through in the logging messages, or if it would cause errors.
| Sort rows by one or more columns. | ||
| ```yaml | ||
| - operation: debug | ||
| function: head | tail | describe | columns |
There was a problem hiding this comment.
What is the default function?
There was a problem hiding this comment.
There isn't currently one. What do you think? Does head make sense?
| self.earthmover.logger.info(f"debug ({self.func}{rows_str}{transpose_str}) for {transformation_name}:") | ||
|
|
||
| # call function and display debug info | ||
| if self.func == 'head': |
There was a problem hiding this comment.
A warning that .head() will raise a warning if the number of rows in the dataframe are less than those specified. We should emulate the head() behavior in the current debug logic above the conditional.
There was a problem hiding this comment.
I tested this and it doesn't seem to... not sure why, or maybe that's configurable.
|
I'm converting this PR to a draft pending further discussion. The |
This PR adds two new features:
earthmover showcommand, likewise to be used for debuggingas well as a few under-the-hood changes such as a new NoOpDestination (earthmover prunes graph branches with no destination attached, so I had to create this in order to be able to programmatically add a debug operation for
earthmover showwithout actually materializing any file to disk).I will go through and do a self-review, adding some comments about various parts of the changed code.