PreprocessingInspector#
- class chemotools.inspector.PreprocessingInspector(pipeline: Pipeline, X_train: ndarray, y_train: ndarray | None = None, X_test: ndarray | None = None, y_test: ndarray | None = None, X_val: ndarray | None = None, y_val: ndarray | None = None, x_axis: ndarray | None = None)[source]
Bases:
SpectraMixin,_DataHoldingBaseInspector for visualizing the effects of each preprocessing step in a pipeline.
The
PreprocessingInspectortakes a fitted scikit-learnPipelinetogether with the datasets that were used for training (and, optionally, testing/validation). It walks through the pipeline steps, applies each preprocessing transformer cumulatively, and generates one plot per step so that users can visually inspect how each transformation modifies their data.Steps that are model estimators — such as PCA, PLS, classifiers, or regressors — are automatically detected and excluded from the visualization, because they do not represent a preprocessing transformation.
The class also inherits
SpectraMixin, which provides theinspect_spectra()method for a quick raw vs. fully preprocessed comparison.- Parameters:
pipeline (Pipeline) – A fitted scikit-learn
Pipeline. All steps must already be fitted (i.e.pipeline.fit(X)has been called).X_train (array-like of shape (n_samples, n_features)) – Training feature matrix (required).
y_train (array-like of shape (n_samples,), optional) – Training target values. Used for colouring plots when
color_by='y'.X_test (array-like of shape (n_samples, n_features), optional) – Test feature matrix.
y_test (array-like of shape (n_samples,), optional) – Test target values.
X_val (array-like of shape (n_samples, n_features), optional) – Validation feature matrix.
y_val (array-like of shape (n_samples,), optional) – Validation target values.
x_axis (array-like of shape (n_features,), optional) – Feature names or axis values (e.g. wavenumbers). If
None, integer indices are used.
- Variables:
pipeline (Pipeline) – The original fitted pipeline.
model (Pipeline) – Alias for
pipeline(consistent withPCAInspector/PLSRegressionInspector).preprocessing_steps (list of tuple) –
(name, transformer)pairs for every step that will be visualised (model steps are excluded).datasets (dict of str to InspectorDataset) – Dictionary of loaded datasets keyed by
'train','test','val'.n_features_in (int) – Number of input features.
- Raises:
RuntimeError – If the pipeline has not been fitted.
ValueError – If
X_trainhas inconsistent shape with other datasets.
Examples
>>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import StandardScaler, MinMaxScaler >>> from sklearn.decomposition import PCA >>> from chemotools.inspector import PreprocessingInspector >>> >>> pipe = make_pipeline(StandardScaler(), MinMaxScaler(), PCA(n_components=3)) >>> pipe.fit(X_train) >>> >>> inspector = PreprocessingInspector(pipe, X_train, y_train) >>> figures = inspector.inspect() # one plot per preprocessing step >>> figures = inspector.inspect_spectra() # raw vs. fully preprocessed
Attributes
modelReturn the original pipeline.
n_featuresReturn the number of features in original data.
n_samplesReturn the number of samples in each dataset.
pipelineReturn the original pipeline.
preprocessing_stepsReturn the list of
(name, transformer)preprocessing steps.transformerReturn a pipeline containing only the preprocessing steps.
x_axisReturn the feature names/indices.
- property transformer: Pipeline | None
Return a pipeline containing only the preprocessing steps.
This is used by
SpectraMixinto generate raw vs. preprocessed comparison plots viainspect_spectra().
- property pipeline: Pipeline
Return the original pipeline.
- property model: Pipeline
Return the original pipeline.
Alias for
pipeline, provided for consistency withPCAInspectorandPLSRegressionInspector.
- property preprocessing_steps: List[Tuple[str, object]]
Return the list of
(name, transformer)preprocessing steps.
- inspect(dataset: str | Sequence[str] = 'train', color_by: str | Dict[str, np.ndarray] | None = 'y', xlim: Tuple[float, float] | None = None, figsize: Tuple[float, float] = (12, 5), color_mode: Literal['continuous', 'categorical'] | None = None) Dict[str, 'Figure'][source]
Generate one plot per preprocessing step showing cumulative effects.
For a pipeline with steps
[A, B, C, PCA](where PCA is excluded), this method produces:Raw – the original input data
After A –
A.transform(X)After A + B –
B.transform(A.transform(X))After A + B + C –
C.transform(B.transform(A.transform(X)))
- Parameters:
dataset (str or sequence of str, default='train') – Dataset(s) to visualise. When a sequence is given, all datasets are overlaid on the same axes, coloured by dataset name.
color_by (str or dict, default='y') –
Colouring specification (single-dataset mode only):
'y': colour by target values (if available)'sample_index': colour by sample indexdict mapping dataset names to colour arrays
Ignored when multiple datasets are provided (colours by dataset instead).
xlim (tuple of float, optional) – X-axis limits for zooming into a spectral region.
figsize (tuple of float, default=(12, 5)) – Figure size
(width, height)in inches for each subplot.color_mode ({
'continuous','categorical'}, optional) – Override automatic colour-mode detection.
- Returns:
figures – Dictionary mapping step names to matplotlib
Figureobjects. Keys follow the pattern'raw','step_1_<name>','step_2_<name>', etc.- Return type:
dict of str to Figure
Examples
>>> inspector = PreprocessingInspector(pipeline, X_train, y_train) >>> figures = inspector.inspect() >>> figures['raw'].savefig('raw_spectra.png') >>> figures['step_1_standardscaler'].savefig('after_scaling.png')
- summary() PreprocessingSummary[source]
Return a summary of the pipeline and preprocessing steps.
- Returns:
summary – Typed summary dataclass. Printing the returned object produces a human-readable table (via
__repr__).- Return type:
PreprocessingSummary