You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before training a model, you should split your data into a training set and a test set. Each dataset will go through the data cleaning and preprocessing steps before you put it in a machine learning model.
The Scikit-learn Pipeline is a tool that links all steps of data manipulation together to create a pipeline. - It is also easier to perform GridSearchCV without data leakage from the test set.
The Pipeline constructor takes a list of (name,estimator) pairs (2-tuples) defining a sequence of steps.
ColumnTransformer will transform each group of dataframe columns separately and combine them later. This is useful in the data preprocessing process.
For example, the following ColumnTransformer will apply
num_pipeline (the one which is defined above) to the numerical attributes
cat_pipeline to the categorical attribute
fromsklearn.composeimportColumnTransformernum_attribs= ["longitude", "latitude", "housing_median_age", "total_rooms",
"total_bedrooms", "population", "households", "median_income"]
cat_attribs= ["ocean_proximity"]
cat_pipeline=make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore"))
preprocessing=ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs),
],
remainder='passthrough', # 'drop' and 'passthrough'n_jobs=-1# n_job = -1 means that we'll be using all processors to run in parallel.
)
If you don’t care about naming the transformers, you can use make_column_transformer()
fromsklearn.composeimportmake_column_selector, make_column_transformerpreprocessing=make_column_transformer(
(num_pipeline, make_column_selector(dtype_include=np.number)),
(cat_pipeline, make_column_selector(dtype_include="category")),
remainder='passthrough', # 'drop' and 'passthrough'n_jobs=-1) # n_job = -1 means that we'll be using all processors to run in parallel.
)
)
You can get the column names using preprocessing.get_feature_names_out() and wrap the data in a nice DataFrame as we did before.
X=df.drop(columns='Target')
y=df['Target']
# train-test-splitX_train, X_val, y_train, y_val=train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
# pre-process with the pipelineX_train_pre=preprocessing.fit_transform(X_train, y_train)
X_val_pre=preprocessing.transform(X_val)
column_names=preprocessing.get_feature_names_out()
X_train_pre=pd.DataFrame(X_train_pre, columns=column_names)
X_val_pre=pd.DataFrame(X_val_pre, columns=column_names)
Note 1: the make_column_transformer() is not recommended to use for building the pipeline as the default naming might not reflect the actual underlying transformation
In below example, we know that Age column is processed by pipeline-1 via the transformed column's name pipeline-1__Age. However, we do not know what is the transformation method has been applied to the Age column, which in this case is bining
Since listing all the column names is not very convenient, Scikit-Learn provides a make_column_selector() function that returns a selector function you can use to automatically select all the features of a given type, such as numerical or categorical.
Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.
Custom Function Transformer
For transformations that don’t require any training (i.e. not require .fit()), you can just write a function that takes a NumPy array as input and outputs the transformed array.
The inverse_func argument is optional. It lets you specify an inverse transform function
The transformation function can take hyperparameters as additional arguments
defdiff_mul(X: np.array, multipler: int) ->np.array:
return (X[:, 0] -X[:, 1]) *multiplerdiff_mul_transformer=FunctionTransformer(diff_mul)
diff_mul_transformer.transform(df[['ndp', 'discount']].value,
kw_args={"mutliplier": 2} # provide the "multipler" input to function diff_mul as 2
)
Custom Class Transformer
Custom Class Transformer is to have the transformer with trainable parameters using fit() method and using them later in the transform() method
Custom Class Transformer requires:
BaseEstimator as a base class (and avoid using *args and **kwargs in your constructor), you will also get two extra methods: get_params() and set_params(), which will be useful for automatic hyperparameter tuning.
TransformerMixin as a base class to auto-have .fit_transform()
Define fit(X, y) method with y=None as required and it must return self
Note: X should be np.ndarray type as if it is in the a step in the Pipeline, the data is passed only with the Numpy array, not Pandas dataframe
Define transformer() method
fromsklearn.baseimportBaseEstimator, TransformerMixinfromsklearn.utils.validationimportcheck_array, check_is_fittedfromtypingimportUnionclassStandardScalerClone(BaseEstimator, TransformerMixin):
def__init__(self, with_mean=True): # [REQUIRED] no *args or **kwargs as using BaseEstimator as a base classself.with_mean=with_meandeffit(self, X: np.ndarray, y: np.ndarray=None): # [REQUIRED] y is required even though we don't use itX=check_array(X) # checks that X is an array with finite float valuesself.mean_=X.mean(axis=0) # [REQUIRED] learned attributes have end with "_"self.scale_=X.std(axis=0)
self.n_features_in_=X.shape[1] # [REQUIRED] every estimator stores this in fit()returnself# [REQUIRED] always return self!deftransform(self, X: Union[pd.DataFrame, np.ndarray]) ->Union[pd.DataFrame, np.ndarray]:
check_is_fitted(self) # [REQUIRED] looks for learned attributes (with trailing _)#self.columns = X.columnsX=check_array(X)
assertself.n_features_in_==X.shape[1]
ifself.with_mean:
X=X-self.mean_returnX/self.scale_# def get_feature_names_out(self, input_features):# return [col for col in self.columns]