PCAInspector#

Bases: SpectraMixin, LatentVariableMixin, _BaseInspector

Inspector for PCA model diagnostics and visualization.

This class provides a unified interface for inspecting PCA models by creating multiple independent diagnostic plots. Instead of complex dashboards with many subplots, each method produces several separate figure windows that are easier to customize, save, and interact with individually.

The inspector provides convenience methods that create multiple independent plots:

inspect(): Creates all diagnostic plots (scores, loadings, explained variance)
inspect_spectra(): Creates raw and preprocessed spectra plots (if preprocessing exists)

Parameters:

model (_BasePCA or Pipeline) – Fitted PCA model or pipeline ending with PCA
X_train (array-like of shape (n_samples, n_features)) – Training data
y_train (array-like of shape (n_samples,), optional) – Training labels/targets (for coloring plots)
X_test (array-like of shape (n_samples, n_features), optional) – Test data
y_test (array-like of shape (n_samples,), optional) – Test labels/targets
X_val (array-like of shape (n_samples, n_features), optional) – Validation data
y_val (array-like of shape (n_samples,), optional) – Validation labels/targets
x_axis (array-like of shape (n_features,), optional) – Feature names (e.g., wavenumbers for spectroscopy) If None, uses feature indices
confidence (float, default=0.95) – Confidence level for outlier detection limits (Hotelling’s T² and Q residuals). Must be between 0 and 1. Used to calculate critical values for diagnostic plots.

Variables:

model (_BasePCA or Pipeline) – The original model passed to the inspector
estimator (_BasePCA) – The PCA estimator
transformer (Pipeline or None) – Preprocessing pipeline before PCA (if model was a Pipeline)
n_components (int) – Number of principal components
n_features (int) – Number of features in original data
n_samples (dict) – Number of samples in each dataset
x_axis (ndarray) – Feature names/indices
confidence (float) – Confidence level for outlier detection
hotelling_t2_limit (float) – Critical value for Hotelling’s T² statistic (computed on training data)
q_residuals_limit (float) – Critical value for Q residuals statistic (computed on training data)

Examples

>>> from sklearn.decomposition import PCA
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from chemotools.datasets import load_fermentation_train
>>> from chemotools.inspector import PCAInspector
>>>
>>> # Load data
>>> X, y = load_fermentation_train()
>>> # Create and fit pipeline
>>> pipeline = make_pipeline(
...     StandardScaler(),
...     PCA(n_components=5)
... )
>>> pipeline.fit(X)
>>>
>>> # Create inspector
>>> inspector = PCAInspector(pipeline, X, y, x_axis=X.columns)
>>>
>>> # Print summary table
>>> inspector.summary()
>>>
>>> # Create all diagnostic plots (multiple independent figures)
>>> inspector.inspect()  # Creates scores, loadings, and variance plots
>>>
>>> # Compare preprocessing (creates 2 independent figures)
>>> inspector.inspect_spectra()
>>>
>>> # Access underlying data for custom analysis
>>> scores = inspector.get_scores('train')
>>> loadings = inspector.get_loadings([0, 1, 2])

Notes

Memory usage scales linearly with dataset size. For very large datasets (>100,000 samples), consider subsampling for initial exploration.

Attributes

`component_label`
`confidence`	Return the confidence level for outlier detection.
`estimator`	Return the underlying estimator (PCA or PLS).
`hotelling_t2_limit`	Return the Hotelling's T² critical value at the specified confidence level.
`model`	Return the original model.
`n_components`	Return the number of latent variables/components.
`n_features`	Return the number of features in original data.
`n_samples`	Return the number of samples in each dataset.
`q_residuals_limit`	Return the Q residuals critical value at the specified confidence level.
`transformer`	Return the preprocessing transformer (if any).
`x_axis`	Return the feature names/indices.

component_label: str = 'PC'

summary() → PCASummary[source]

Get a summary of the PCA model.

Returns:: summary – Object containing model information
Return type:: PCASummary

get_latent_scores(dataset: str) → ndarray[source]: Hook for LatentVariableMixin - returns scores.

get_latent_explained_variance() → ndarray | None[source]: Hook for LatentVariableMixin - returns explained variance ratio.

get_latent_loadings() → ndarray[source]: Hook for LatentVariableMixin - returns loadings.

get_scores(dataset: str = 'train') → ndarray[source]

Get PCA scores for specified dataset.

Parameters:: dataset ({'train', 'test', 'val'}, default='train') – Which dataset to get scores for
Returns:: scores – PCA scores
Return type:: ndarray of shape (n_samples, n_components)

get_loadings(components: int | Sequence[int] | None = None) → ndarray[source]

Get PCA loadings.

Parameters:: components (int, list of int, or None, default=None) – Which components to return. If None, returns all components.
Returns:: loadings – PCA loadings (components transposed)
Return type:: ndarray of shape (n_features, n_components_selected)

get_explained_variance_ratio() → ndarray[source]

Get explained variance ratio for all components.

Returns:: explained_variance_ratio – Explained variance ratio
Return type:: ndarray of shape (n_components,)

Create all diagnostic plots for the PCA model.

Parameters:

dataset (str or sequence of str, default='train') – Dataset(s) to visualize. Can be ‘train’, ‘test’, ‘val’, or a list.
components_scores (int, tuple, or sequence, optional) –
Components to plot for scores.
- Int: Creates one 1D scatter plot (e.g., 0 for PC1 vs sample index)
- Single tuple (x, y): Creates one 2D scatter plot (e.g., (0, 1) for PC1 vs PC2)
- Sequence: Creates multiple plots (e.g., ((0, 1), (1, 2), 0) or [0, 1, (0, 1)])
loadings_components (int, sequence of int, or None, optional) –
Which components to show in loadings plot. If None (default), automatically selects all available components:
- 1 component: 0
- 2+ components: [0, 1, …, n_components-1] (all components)
variance_threshold (float, default=0.95) – Threshold line for explained variance plot
color_by (str or dict, optional) –
Coloring specification:
- ’y’: Color by y values (if available)
- ’sample_index’: Color by sample index
- dict: Map dataset names to color arrays
- None: Color by dataset (for multi-dataset plots) or ‘y’ (for single dataset)
annotate_by (str or dict, optional) –
Annotations for score plot points. Can be:
- ’sample_index’: Annotate with sample indices (0, 1, 2, …)
- ’y’: Annotate with y values (only for single dataset)
- dict: Dictionary mapping dataset names to annotation arrays e.g., {‘train’: [‘A’, ‘B’, ‘C’], ‘test’: [‘D’, ‘E’]}
If None (default), no annotations are added.
plot_config (InspectorPlotConfig, optional) – Configuration object for plot sizes and styles. If None, defaults are used.
color_mode (Literal["continuous", "categorical"], default="continuous") – Mode for coloring points.
**kwargs – Optional keyword arguments to override specific fields in plot_config (e.g., scores_figsize=(8, 8)).

Returns:

figures – Dictionary containing all created figures with keys:

’scores_1’, ‘scores_2’, …: Combined scores plots (95% confidence ellipses)
’scores_1_train’, ‘scores_1_test’, …: Dataset-specific copies of each scores plot (only when multiple datasets are provided); each plot uses a dedicated dataset colour
’loadings’: Loadings plot
’variance’: Explained variance plot
’distances’: Diagnostic distances plot (Hotelling’s T² vs Q residuals)
’raw_spectra’, ‘preprocessed_spectra’: Spectra plots (if preprocessing exists)

Combined scores plots render all requested datasets on shared axes, coloured by dataset. The number of ‘scores_N*’ entries depends on the components_scores parameter.

Return type:

dict

Examples

>>> inspector = PCAInspector(pca, X_train, y_train)
>>> # Default: 2 scores plots + loadings + variance + spectra (if preprocessing exists)
>>> figs = inspector.inspect()
>>> # Multiple datasets for comparison
>>> inspector.X_test = X_test
>>> inspector.y_test = y_test
>>> figs = inspector.inspect(dataset=["train", "test"])
>>> # Access individual figures
>>> figs["scores_1_train"].savefig("scores_1_train.png")
>>> figs["scores_1_test"].savefig("scores_1_test.png")
>>> # Single 2D scores plot (PC1 vs PC2)
>>> figs = inspector.inspect(components_scores=(0, 1))
>>> # Single 1D scores plot (PC1 vs sample index or y)
>>> figs = inspector.inspect(components_scores=0)
>>> # Three plots: 2D, 2D, and 1D
>>> figs = inspector.inspect(components_scores=((0, 1), (1, 2), 2))
>>> # Mix of 1D and 2D plots
>>> figs = inspector.inspect(components_scores=[0, 1, (0, 1)])
>>> # Save individual plots
>>> figs['scores_1'].savefig('scores_pc1_pc2.png')
>>> figs['loadings'].savefig('loadings.png')

PCAInspector#

This Page