PCAInspector#

class chemotools.inspector.PCAInspector(model: _BasePCA | Pipeline, X_train: ndarray, y_train: ndarray | None = None, X_test: ndarray | None = None, y_test: ndarray | None = None, X_val: ndarray | None = None, y_val: ndarray | None = None, x_axis: Sequence | None = None, confidence: float = 0.95)[source]

Bases: SpectraMixin, LatentVariableMixin, _BaseInspector

Inspector for PCA model diagnostics and visualization.

This class provides a unified interface for inspecting PCA models by creating multiple independent diagnostic plots. Instead of complex dashboards with many subplots, each method produces several separate figure windows that are easier to customize, save, and interact with individually.

The inspector provides convenience methods that create multiple independent plots:

  • inspect(): Creates all diagnostic plots (scores, loadings, explained variance)

  • inspect_spectra(): Creates raw and preprocessed spectra plots (if preprocessing exists)

Parameters:
  • model (_BasePCA or Pipeline) – Fitted PCA model or pipeline ending with PCA

  • X_train (array-like of shape (n_samples, n_features)) – Training data

  • y_train (array-like of shape (n_samples,), optional) – Training labels/targets (for coloring plots)

  • X_test (array-like of shape (n_samples, n_features), optional) – Test data

  • y_test (array-like of shape (n_samples,), optional) – Test labels/targets

  • X_val (array-like of shape (n_samples, n_features), optional) – Validation data

  • y_val (array-like of shape (n_samples,), optional) – Validation labels/targets

  • x_axis (array-like of shape (n_features,), optional) – Feature names (e.g., wavenumbers for spectroscopy) If None, uses feature indices

  • confidence (float, default=0.95) – Confidence level for outlier detection limits (Hotelling’s T² and Q residuals). Must be between 0 and 1. Used to calculate critical values for diagnostic plots.

Variables:
  • model (_BasePCA or Pipeline) – The original model passed to the inspector

  • estimator (_BasePCA) – The PCA estimator

  • transformer (Pipeline or None) – Preprocessing pipeline before PCA (if model was a Pipeline)

  • n_components (int) – Number of principal components

  • n_features (int) – Number of features in original data

  • n_samples (dict) – Number of samples in each dataset

  • x_axis (ndarray) – Feature names/indices

  • confidence (float) – Confidence level for outlier detection

  • hotelling_t2_limit (float) – Critical value for Hotelling’s T² statistic (computed on training data)

  • q_residuals_limit (float) – Critical value for Q residuals statistic (computed on training data)

Examples

>>> from sklearn.decomposition import PCA
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from chemotools.datasets import load_fermentation_train
>>> from chemotools.inspector import PCAInspector
>>>
>>> # Load data
>>> X, y = load_fermentation_train()
>>> # Create and fit pipeline
>>> pipeline = make_pipeline(
...     StandardScaler(),
...     PCA(n_components=5)
... )
>>> pipeline.fit(X)
>>>
>>> # Create inspector
>>> inspector = PCAInspector(pipeline, X, y, x_axis=X.columns)
>>>
>>> # Print summary table
>>> inspector.summary()
>>>
>>> # Create all diagnostic plots (multiple independent figures)
>>> inspector.inspect()  # Creates scores, loadings, and variance plots
>>>
>>> # Compare preprocessing (creates 2 independent figures)
>>> inspector.inspect_spectra()
>>>
>>> # Access underlying data for custom analysis
>>> scores = inspector.get_scores('train')
>>> loadings = inspector.get_loadings([0, 1, 2])

Notes

Memory usage scales linearly with dataset size. For very large datasets (>100,000 samples), consider subsampling for initial exploration.

Attributes

component_label

confidence

Return the confidence level for outlier detection.

estimator

Return the underlying estimator (PCA or PLS).

hotelling_t2_limit

Return the Hotelling's T² critical value at the specified confidence level.

model

Return the original model.

n_components

Return the number of latent variables/components.

n_features

Return the number of features in original data.

n_samples

Return the number of samples in each dataset.

q_residuals_limit

Return the Q residuals critical value at the specified confidence level.

transformer

Return the preprocessing transformer (if any).

x_axis

Return the feature names/indices.

component_label: str = 'PC'
summary() PCASummary[source]

Get a summary of the PCA model.

Returns:

summary – Object containing model information

Return type:

PCASummary

get_latent_scores(dataset: str) ndarray[source]

Hook for LatentVariableMixin - returns scores.

get_latent_explained_variance() ndarray | None[source]

Hook for LatentVariableMixin - returns explained variance ratio.

get_latent_loadings() ndarray[source]

Hook for LatentVariableMixin - returns loadings.

get_scores(dataset: str = 'train') ndarray[source]

Get PCA scores for specified dataset.

Parameters:

dataset ({'train', 'test', 'val'}, default='train') – Which dataset to get scores for

Returns:

scores – PCA scores

Return type:

ndarray of shape (n_samples, n_components)

get_loadings(components: int | Sequence[int] | None = None) ndarray[source]

Get PCA loadings.

Parameters:

components (int, list of int, or None, default=None) – Which components to return. If None, returns all components.

Returns:

loadings – PCA loadings (components transposed)

Return type:

ndarray of shape (n_features, n_components_selected)

get_explained_variance_ratio() ndarray[source]

Get explained variance ratio for all components.

Returns:

explained_variance_ratio – Explained variance ratio

Return type:

ndarray of shape (n_components,)

inspect(dataset: str | Sequence[str] = 'train', components_scores: int | Tuple[int, int] | Sequence[int | Tuple[int, int]] | None = None, loadings_components: int | Sequence[int] | None = None, variance_threshold: float = 0.95, color_by: str | Dict[str, np.ndarray] | Sequence | np.ndarray | None = None, annotate_by: str | Dict[str, np.ndarray] | Sequence | np.ndarray | None = None, plot_config: InspectorPlotConfig | None = None, color_mode: Literal['continuous', 'categorical'] = 'continuous', **kwargs) Dict[str, Figure][source]

Create all diagnostic plots for the PCA model.

Parameters:
  • dataset (str or sequence of str, default='train') – Dataset(s) to visualize. Can be ‘train’, ‘test’, ‘val’, or a list.

  • components_scores (int, tuple, or sequence, optional) –

    Components to plot for scores.

    • Int: Creates one 1D scatter plot (e.g., 0 for PC1 vs sample index)

    • Single tuple (x, y): Creates one 2D scatter plot (e.g., (0, 1) for PC1 vs PC2)

    • Sequence: Creates multiple plots (e.g., ((0, 1), (1, 2), 0) or [0, 1, (0, 1)])

  • loadings_components (int, sequence of int, or None, optional) –

    Which components to show in loadings plot. If None (default), automatically selects all available components:

    • 1 component: 0

    • 2+ components: [0, 1, …, n_components-1] (all components)

  • variance_threshold (float, default=0.95) – Threshold line for explained variance plot

  • color_by (str or dict, optional) –

    Coloring specification:

    • ’y’: Color by y values (if available)

    • ’sample_index’: Color by sample index

    • dict: Map dataset names to color arrays

    • None: Color by dataset (for multi-dataset plots) or ‘y’ (for single dataset)

  • annotate_by (str or dict, optional) –

    Annotations for score plot points. Can be:

    • ’sample_index’: Annotate with sample indices (0, 1, 2, …)

    • ’y’: Annotate with y values (only for single dataset)

    • dict: Dictionary mapping dataset names to annotation arrays e.g., {‘train’: [‘A’, ‘B’, ‘C’], ‘test’: [‘D’, ‘E’]}

    If None (default), no annotations are added.

  • plot_config (InspectorPlotConfig, optional) – Configuration object for plot sizes and styles. If None, defaults are used.

  • color_mode (Literal["continuous", "categorical"], default="continuous") – Mode for coloring points.

  • **kwargs – Optional keyword arguments to override specific fields in plot_config (e.g., scores_figsize=(8, 8)).

Returns:

figures – Dictionary containing all created figures with keys:

  • ’scores_1’, ‘scores_2’, …: Combined scores plots (95% confidence ellipses)

  • ’scores_1_train’, ‘scores_1_test’, …: Dataset-specific copies of each scores plot (only when multiple datasets are provided); each plot uses a dedicated dataset colour

  • ’loadings’: Loadings plot

  • ’variance’: Explained variance plot

  • ’distances’: Diagnostic distances plot (Hotelling’s T² vs Q residuals)

  • ’raw_spectra’, ‘preprocessed_spectra’: Spectra plots (if preprocessing exists)

Combined scores plots render all requested datasets on shared axes, coloured by dataset. The number of ‘scores_N*’ entries depends on the components_scores parameter.

Return type:

dict

Examples

>>> inspector = PCAInspector(pca, X_train, y_train)
>>> # Default: 2 scores plots + loadings + variance + spectra (if preprocessing exists)
>>> figs = inspector.inspect()
>>> # Multiple datasets for comparison
>>> inspector.X_test = X_test
>>> inspector.y_test = y_test
>>> figs = inspector.inspect(dataset=["train", "test"])
>>> # Access individual figures
>>> figs["scores_1_train"].savefig("scores_1_train.png")
>>> figs["scores_1_test"].savefig("scores_1_test.png")
>>> # Single 2D scores plot (PC1 vs PC2)
>>> figs = inspector.inspect(components_scores=(0, 1))
>>> # Single 1D scores plot (PC1 vs sample index or y)
>>> figs = inspector.inspect(components_scores=0)
>>> # Three plots: 2D, 2D, and 1D
>>> figs = inspector.inspect(components_scores=((0, 1), (1, 2), 2))
>>> # Mix of 1D and 2D plots
>>> figs = inspector.inspect(components_scores=[0, 1, (0, 1)])
>>> # Save individual plots
>>> figs['scores_1'].savefig('scores_pc1_pc2.png')
>>> figs['loadings'].savefig('loadings.png')