Skip to content

Conversation

@immanuelazn
Copy link
Collaborator

@immanuelazn immanuelazn commented Nov 13, 2024

Add an sklearn-like interface for creating pipelines in terms of fitting, projection, and combining operations on matrices.

Current thoughts:

  • I deviated from the design docs a little bit to make the inheritance make more sense. I made a default PipelineBase, with Pipeline inheriting form it. I then made a PipelineStep that inherits from PipelineBase. Estimator and Transformer both inherit from this class.

    • PipelineBase and Pipeline differ because Pipelines should have steps. This isn't true for single PipelineSteps
    • PipelineBase and PipelineStep differ to indicate each step has a step_name associated with it. Also to allow for shared interface for transformers/predictors in how they are printed and how they can be concatenated to create a pipeline
  • I had to change transform() to project(), given I found a generic base function with the same name. Additionally, I found predict() in the stats package, and changed it to estimate()

  • Tests will be added in another sister PR, as there are no transformers/estimators that are built to test functionality here.

  • I'm not sure which methods I should provide detail to, given that we are not sure how much of this we want to expose. I provided them to the generics themselves, to allow for a meta-look on how to use methods in both PipelineSteps and Pipeline. However, it isn't clear to me whether I need to continue providing an extensive docstring for every overriden method in child classes.

  • I'm not sure which Classes I should be exposing to the reference either. I found that previous BPCells classes (ie IterableMatrix) aren't heavily described in the reference. I provided some information on Pipeline, Estimator, and Transformer, and exposed them to the reference page. I also tried to provide information on how to create a Transformer, and Estimator yourself on the docstring.

  • I don't think I'm completely sold on using the show() method as an analog of the python __repr__() dunder. I think it could be more useful to make it act more similarly to what you used to display IterableMatrix, ie where we still have information on what steps are in a pipeline, but also macro information, like hyper params or details on what the step has fit to. In this case, we would have a __repr__() analog somewhere else.

  • Probably redundant to have both project() and estimate(). What do you think for just combining them into one?

@immanuelazn immanuelazn marked this pull request as ready for review November 18, 2024 23:35
@immanuelazn
Copy link
Collaborator Author

Closed as we are changing direction to #167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants