PCAInspector#
- class chemotools.inspector.PCAInspector(model: _BasePCA | Pipeline, X_train: ndarray, y_train: ndarray | None = None, X_test: ndarray | None = None, y_test: ndarray | None = None, X_val: ndarray | None = None, y_val: ndarray | None = None, x_axis: Sequence | None = None, confidence: float = 0.95)[source]
Bases:
SpectraMixin,LatentVariableMixin,_BaseInspectorInspector for PCA model diagnostics and visualization.
This class provides a unified interface for inspecting PCA models by creating multiple independent diagnostic plots. Instead of complex dashboards with many subplots, each method produces several separate figure windows that are easier to customize, save, and interact with individually.
The inspector provides convenience methods that create multiple independent plots:
inspect(): Creates all diagnostic plots (scores, loadings, explained variance)inspect_spectra(): Creates raw and preprocessed spectra plots (if preprocessing exists)
- Parameters:
model (_BasePCA or Pipeline) – Fitted PCA model or pipeline ending with PCA
X_train (array-like of shape (n_samples, n_features)) – Training data
y_train (array-like of shape (n_samples,), optional) – Training labels/targets (for coloring plots)
X_test (array-like of shape (n_samples, n_features), optional) – Test data
y_test (array-like of shape (n_samples,), optional) – Test labels/targets
X_val (array-like of shape (n_samples, n_features), optional) – Validation data
y_val (array-like of shape (n_samples,), optional) – Validation labels/targets
x_axis (array-like of shape (n_features,), optional) – Feature names (e.g., wavenumbers for spectroscopy) If None, uses feature indices
confidence (float, default=0.95) – Confidence level for outlier detection limits (Hotelling’s T² and Q residuals). Must be between 0 and 1. Used to calculate critical values for diagnostic plots.
- Variables:
model (_BasePCA or Pipeline) – The original model passed to the inspector
estimator (_BasePCA) – The PCA estimator
transformer (Pipeline or None) – Preprocessing pipeline before PCA (if model was a Pipeline)
n_components (int) – Number of principal components
n_features (int) – Number of features in original data
n_samples (dict) – Number of samples in each dataset
x_axis (ndarray) – Feature names/indices
confidence (float) – Confidence level for outlier detection
hotelling_t2_limit (float) – Critical value for Hotelling’s T² statistic (computed on training data)
q_residuals_limit (float) – Critical value for Q residuals statistic (computed on training data)
Examples
>>> from sklearn.decomposition import PCA >>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import StandardScaler >>> from chemotools.datasets import load_fermentation_train >>> from chemotools.inspector import PCAInspector >>> >>> # Load data >>> X, y = load_fermentation_train() >>> # Create and fit pipeline >>> pipeline = make_pipeline( ... StandardScaler(), ... PCA(n_components=5) ... ) >>> pipeline.fit(X) >>> >>> # Create inspector >>> inspector = PCAInspector(pipeline, X, y, x_axis=X.columns) >>> >>> # Print summary table >>> inspector.summary() >>> >>> # Create all diagnostic plots (multiple independent figures) >>> inspector.inspect() # Creates scores, loadings, and variance plots >>> >>> # Compare preprocessing (creates 2 independent figures) >>> inspector.inspect_spectra() >>> >>> # Access underlying data for custom analysis >>> scores = inspector.get_scores('train') >>> loadings = inspector.get_loadings([0, 1, 2])
Notes
Memory usage scales linearly with dataset size. For very large datasets (>100,000 samples), consider subsampling for initial exploration.
Attributes
component_labelconfidenceReturn the confidence level for outlier detection.
estimatorReturn the underlying estimator (PCA or PLS).
hotelling_t2_limitReturn the Hotelling's T² critical value at the specified confidence level.
modelReturn the original model.
n_componentsReturn the number of latent variables/components.
n_featuresReturn the number of features in original data.
n_samplesReturn the number of samples in each dataset.
q_residuals_limitReturn the Q residuals critical value at the specified confidence level.
transformerReturn the preprocessing transformer (if any).
x_axisReturn the feature names/indices.
- component_label: str = 'PC'
- summary() PCASummary[source]
Get a summary of the PCA model.
- Returns:
summary – Object containing model information
- Return type:
PCASummary
- get_latent_explained_variance() ndarray | None[source]
Hook for LatentVariableMixin - returns explained variance ratio.
- get_scores(dataset: str = 'train') ndarray[source]
Get PCA scores for specified dataset.
- Parameters:
dataset ({'train', 'test', 'val'}, default='train') – Which dataset to get scores for
- Returns:
scores – PCA scores
- Return type:
ndarray of shape (n_samples, n_components)
- get_explained_variance_ratio() ndarray[source]
Get explained variance ratio for all components.
- Returns:
explained_variance_ratio – Explained variance ratio
- Return type:
ndarray of shape (n_components,)
- inspect(dataset: str | Sequence[str] = 'train', components_scores: int | Tuple[int, int] | Sequence[int | Tuple[int, int]] | None = None, loadings_components: int | Sequence[int] | None = None, variance_threshold: float = 0.95, color_by: str | Dict[str, np.ndarray] | Sequence | np.ndarray | None = None, annotate_by: str | Dict[str, np.ndarray] | Sequence | np.ndarray | None = None, plot_config: InspectorPlotConfig | None = None, color_mode: Literal['continuous', 'categorical'] = 'continuous', **kwargs) Dict[str, Figure][source]
Create all diagnostic plots for the PCA model.
- Parameters:
dataset (str or sequence of str, default='train') – Dataset(s) to visualize. Can be ‘train’, ‘test’, ‘val’, or a list.
components_scores (int, tuple, or sequence, optional) –
Components to plot for scores.
Int: Creates one 1D scatter plot (e.g., 0 for PC1 vs sample index)
Single tuple (x, y): Creates one 2D scatter plot (e.g., (0, 1) for PC1 vs PC2)
Sequence: Creates multiple plots (e.g., ((0, 1), (1, 2), 0) or [0, 1, (0, 1)])
loadings_components (int, sequence of int, or None, optional) –
Which components to show in loadings plot. If None (default), automatically selects all available components:
1 component: 0
2+ components: [0, 1, …, n_components-1] (all components)
variance_threshold (float, default=0.95) – Threshold line for explained variance plot
color_by (str or dict, optional) –
Coloring specification:
’y’: Color by y values (if available)
’sample_index’: Color by sample index
dict: Map dataset names to color arrays
None: Color by dataset (for multi-dataset plots) or ‘y’ (for single dataset)
annotate_by (str or dict, optional) –
Annotations for score plot points. Can be:
’sample_index’: Annotate with sample indices (0, 1, 2, …)
’y’: Annotate with y values (only for single dataset)
dict: Dictionary mapping dataset names to annotation arrays e.g., {‘train’: [‘A’, ‘B’, ‘C’], ‘test’: [‘D’, ‘E’]}
If None (default), no annotations are added.
plot_config (InspectorPlotConfig, optional) – Configuration object for plot sizes and styles. If None, defaults are used.
color_mode (Literal["continuous", "categorical"], default="continuous") – Mode for coloring points.
**kwargs – Optional keyword arguments to override specific fields in plot_config (e.g., scores_figsize=(8, 8)).
- Returns:
figures – Dictionary containing all created figures with keys:
’scores_1’, ‘scores_2’, …: Combined scores plots (95% confidence ellipses)
’scores_1_train’, ‘scores_1_test’, …: Dataset-specific copies of each scores plot (only when multiple datasets are provided); each plot uses a dedicated dataset colour
’loadings’: Loadings plot
’variance’: Explained variance plot
’distances’: Diagnostic distances plot (Hotelling’s T² vs Q residuals)
’raw_spectra’, ‘preprocessed_spectra’: Spectra plots (if preprocessing exists)
Combined scores plots render all requested datasets on shared axes, coloured by dataset. The number of ‘scores_N*’ entries depends on the
components_scoresparameter.- Return type:
Examples
>>> inspector = PCAInspector(pca, X_train, y_train) >>> # Default: 2 scores plots + loadings + variance + spectra (if preprocessing exists) >>> figs = inspector.inspect() >>> # Multiple datasets for comparison >>> inspector.X_test = X_test >>> inspector.y_test = y_test >>> figs = inspector.inspect(dataset=["train", "test"]) >>> # Access individual figures >>> figs["scores_1_train"].savefig("scores_1_train.png") >>> figs["scores_1_test"].savefig("scores_1_test.png") >>> # Single 2D scores plot (PC1 vs PC2) >>> figs = inspector.inspect(components_scores=(0, 1)) >>> # Single 1D scores plot (PC1 vs sample index or y) >>> figs = inspector.inspect(components_scores=0) >>> # Three plots: 2D, 2D, and 1D >>> figs = inspector.inspect(components_scores=((0, 1), (1, 2), 2)) >>> # Mix of 1D and 2D plots >>> figs = inspector.inspect(components_scores=[0, 1, (0, 1)]) >>> # Save individual plots >>> figs['scores_1'].savefig('scores_pc1_pc2.png') >>> figs['loadings'].savefig('loadings.png')