PreprocessingInspector#

Bases: SpectraMixin, _DataHoldingBase

Inspector for visualizing the effects of each preprocessing step in a pipeline.

The PreprocessingInspector takes a fitted scikit-learn Pipeline together with the datasets that were used for training (and, optionally, testing/validation). It walks through the pipeline steps, applies each preprocessing transformer cumulatively, and generates one plot per step so that users can visually inspect how each transformation modifies their data.

Steps that are model estimators — such as PCA, PLS, classifiers, or regressors — are automatically detected and excluded from the visualization, because they do not represent a preprocessing transformation.

The class also inherits SpectraMixin, which provides the inspect_spectra() method for a quick raw vs. fully preprocessed comparison.

Parameters:

pipeline (Pipeline) – A fitted scikit-learn Pipeline. All steps must already be fitted (i.e. pipeline.fit(X) has been called).
X_train (array-like of shape (n_samples, n_features)) – Training feature matrix (required).
y_train (array-like of shape (n_samples,), optional) – Training target values. Used for colouring plots when color_by='y'.
X_test (array-like of shape (n_samples, n_features), optional) – Test feature matrix.
y_test (array-like of shape (n_samples,), optional) – Test target values.
X_val (array-like of shape (n_samples, n_features), optional) – Validation feature matrix.
y_val (array-like of shape (n_samples,), optional) – Validation target values.
x_axis (array-like of shape (n_features,), optional) – Feature names or axis values (e.g. wavenumbers). If None, integer indices are used.

Variables:

pipeline (Pipeline) – The original fitted pipeline.
model (Pipeline) – Alias for pipeline (consistent with PCAInspector / PLSRegressionInspector).
preprocessing_steps (list of tuple) – (name, transformer) pairs for every step that will be visualised (model steps are excluded).
datasets (dict of str to InspectorDataset) – Dictionary of loaded datasets keyed by 'train', 'test', 'val'.
n_features_in (int) – Number of input features.

Raises:

TypeError – If pipeline is not a Pipeline.
RuntimeError – If the pipeline has not been fitted.
ValueError – If X_train has inconsistent shape with other datasets.

Examples

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler, MinMaxScaler
>>> from sklearn.decomposition import PCA
>>> from chemotools.inspector import PreprocessingInspector
>>>
>>> pipe = make_pipeline(StandardScaler(), MinMaxScaler(), PCA(n_components=3))
>>> pipe.fit(X_train)
>>>
>>> inspector = PreprocessingInspector(pipe, X_train, y_train)
>>> figures = inspector.inspect()          # one plot per preprocessing step
>>> figures = inspector.inspect_spectra()  # raw vs. fully preprocessed

Attributes

`model`	Return the original pipeline.
`n_features`	Return the number of features in original data.
`n_samples`	Return the number of samples in each dataset.
`pipeline`	Return the original pipeline.
`preprocessing_steps`	Return the list of `(name, transformer)` preprocessing steps.
`transformer`	Return a pipeline containing only the preprocessing steps.
`x_axis`	Return the feature names/indices.

property transformer: Pipeline | None

Return a pipeline containing only the preprocessing steps.

This is used by SpectraMixin to generate raw vs. preprocessed comparison plots via inspect_spectra().

property pipeline: Pipeline: Return the original pipeline.

property model: Pipeline

Return the original pipeline.

Alias for pipeline, provided for consistency with PCAInspector and PLSRegressionInspector.

property preprocessing_steps: List[Tuple[str, object]]: Return the list of (name, transformer) preprocessing steps.

inspect(dataset: str | Sequence[str] = 'train', color_by: str | Dict[str, np.ndarray] | None = 'y', xlim: Tuple[float, float] | None = None, figsize: Tuple[float, float] = (12, 5), color_mode: Literal['continuous', 'categorical'] | None = None) → Dict[str, 'Figure'][source]

Generate one plot per preprocessing step showing cumulative effects.

For a pipeline with steps [A, B, C, PCA] (where PCA is excluded), this method produces:

Raw – the original input data
After A – A.transform(X)
After A + B – B.transform(A.transform(X))
After A + B + C – C.transform(B.transform(A.transform(X)))

Parameters:

dataset (str or sequence of str, default='train') – Dataset(s) to visualise. When a sequence is given, all datasets are overlaid on the same axes, coloured by dataset name.
color_by (str or dict, default='y') –
Colouring specification (single-dataset mode only):
- 'y': colour by target values (if available)
- 'sample_index': colour by sample index
- dict mapping dataset names to colour arrays
Ignored when multiple datasets are provided (colours by dataset instead).
xlim (tuple of float, optional) – X-axis limits for zooming into a spectral region.
figsize (tuple of float, default=(12, 5)) – Figure size (width, height) in inches for each subplot.
color_mode ({'continuous', 'categorical'}, optional) – Override automatic colour-mode detection.

Returns:

figures – Dictionary mapping step names to matplotlib Figure objects. Keys follow the pattern 'raw', 'step_1_<name>', 'step_2_<name>', etc.

Return type:

dict of str to Figure

Examples

>>> inspector = PreprocessingInspector(pipeline, X_train, y_train)
>>> figures = inspector.inspect()
>>> figures['raw'].savefig('raw_spectra.png')
>>> figures['step_1_standardscaler'].savefig('after_scaling.png')

summary() → PreprocessingSummary[source]

Return a summary of the pipeline and preprocessing steps.

Returns:: summary – Typed summary dataclass. Printing the returned object produces a human-readable table (via __repr__).
Return type:: PreprocessingSummary