DModX#

class chemotools.outliers.DModX(model: _BasePCA | _PLS | Pipeline, confidence: float = 0.95, mean_centered: bool = True)[source]

Bases: _ModelResidualsBase

Calculate Distance to Model (DModX) statistics.

DModX measures the distance between an observation and the model plane in the X-space, useful for detecting outliers.

Parameters:
  • model (Union[ModelType, Pipeline]) – A fitted PCA/PLS model or Pipeline ending with such a model

  • confidence (float, default=0.95) – Confidence level for statistical calculations (between 0 and 1)

  • mean_centered (bool, default=True) – Indicates if the input data was mean-centered before modeling

Variables:
  • estimator (ModelType) – The fitted model of type _BasePCA or _PLS

  • transformer (Optional[Pipeline]) – Preprocessing steps before the model

  • n_features_in (int) – Number of features in the input data

  • n_components (int) – Number of components in the model

  • n_samples (int) – Number of samples used to train the model

  • critical_value (float) – The calculated critical value for outlier detection

  • train_sse (float) – The training sum of squared errors (SSE) for the model normalized by degrees of freedom

  • A0 (int) – Adjustment factor for degrees of freedom based on mean centering

References

[1] Max Bylesjö, Mattias Rantalainen, Oliver Cloarec, Johan K. Nicholson,

Elaine Holmes, Johan Trygg. “OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification.” Journal of Chemometrics 20 (8-10), 341-351 (2006).

Examples

>>> from chemotools.datasets import load_fermentation_train
>>> from chemotools.outliers import DModX
>>> from sklearn.decomposition import PCA
>>> # Load sample data
>>> X, _ = load_fermentation_train()
>>> # Instantiate the PCA model
>>> pca = PCA(n_components=3).fit(X)
>>> # Initialize DModX with the fitted PCA model
>>> dmodx = DModX(model=pca, confidence=0.95, mean_centered=True)
DModX(model=PCA(n_components=3), confidence=0.95, mean_centered=True)
>>> dmodx.fit(X)
>>> # Predict outliers in the dataset
>>> outliers = dmodx.predict(X)
>>> # Calculate DModX residuals
>>> residuals = dmodx.predict_residuals(X)
fit(X: ndarray, y: ndarray | None = None) DModX[source]

Fit the model and compute training residual variance.

Parameters:
  • X (np.ndarray of shape (n_samples, n_features)) – The input data used to fit the model.

  • y (None) – Ignored to align with API.

Returns:

self – Fitted estimator with computed training residuals and critical value.

Return type:

DModX

predict(X: ndarray, y: ndarray | None = None) ndarray[source]

Identify outliers in the input data.

Parameters:
  • X (np.ndarray of shape (n_samples, n_features)) – The input data to predict outliers for.

  • y (None) – Ignored to align with API.

Returns:

outliers – Array indicating outliers (-1) and inliers (1).

Return type:

np.ndarray of shape (n_samples,)

predict_residuals(X: ndarray, y: ndarray | None = None, validate: bool = True) ndarray[source]

Calculate normalized DModX statistics for input data.

Parameters:
  • X (np.ndarray of shape (n_samples, n_features)) – The input data to calculate DModX statistics for.

  • y (None) – Ignored.

  • validate (bool, default=True) – If True, validate the input data.

Returns:

dmodx_values – The normalized DModX statistics for each sample.

Return type:

np.ndarray of shape (n_samples,)