.. _dynamic_transformers:

Dynamic transformers
====================

In conventional chemometrics practice, preprocessing is treated as a
**static** operation: a correction is determined from a calibration set and
then applied uniformly to every new spectrum. The preprocessing step holds
everything it needs — a stored mean spectrum, a fitted baseline, a set of
PLS loadings — and the data alone is sufficient at prediction time.

This works well for a large class of problems, but it breaks down when the
correction you need to apply depends on **information that is only available
at inference time**. In practice this information falls into two categories:

* **Measurement metadata** — instrument-level quantities recorded alongside
  the spectrum but not part of it: the x-axis calibration, laser power,
  integration time, detector temperature.
* **Process data** — sample- or batch-level context that changes between
  runs or samples: a fresh background measurement, a dilution factor, a reference
  standard collected just before the sample or other process parameters such as 
  temperature or humidity.

Some concrete examples from spectroscopy:

* The **x-axis grid** of the instrument drifted between calibration and
  deployment. Each new spectrum arrives on a slightly different wavenumber
  array.
* You want to **normalize by laser power** or integration time, values that
  are logged per measurement but are not part of the spectrum itself.
* A **background spectrum** is measured fresh before each sample batch and
  must be subtracted at inference, not at fit time.

``chemotools`` addresses this with a set of *dynamic transformers* — estimators
that accept additional **per-call parameters** alongside ``X`` at transform
time, delivered through `scikit-learn's metadata routing framework
<https://scikit-learn.org/stable/metadata_routing.html>`_.

.. list-table:: Dynamic transformers in ``chemotools``
   :widths: 35 30 35
   :header-rows: 1

   * - Transformer
     - Metadata argument
     - Use case
   * - :class:`~chemotools.adaptation.XAxisInterpolator`
     - ``x_axis``
     - Align spectra measured on different wavenumber grids
   * - ``ScaleBy`` *(coming soon)*
     - ``scale``
     - Normalize by laser power, integration time, temperature factor
   * - ``SubtractBackground`` *(coming soon)*
     - ``reference``
     - Subtract a freshly measured background at inference

Example: aligning spectra to a common grid
-------------------------------------------

In Raman spectroscopy, each instrument has a slightly different
pixel-to-wavenumber calibration. Spectra from different instruments share the
same chemistry but arrive on different x-axis grids — so they cannot be
stacked into a matrix until they are resampled onto a common one.

Five simulated spectra, each with a Gaussian peak at 1100 cm⁻¹ but on a
slightly different grid, illustrate the problem.

**Setting up the data**

.. code-block:: python

    import numpy as np
    import sklearn
    import matplotlib.pyplot as plt
    from chemotools.adaptation import XAxisInterpolator

    sklearn.set_config(enable_metadata_routing=True)  # explained in "How metadata routing works" below

    N       = 1000                   # pixels per spectrum
    sigma   = 20                     # peak width (pixels)
    offsets = [-10, -5, 0, 5, 10]   # pixel-grid offset per instrument

    raw_spectra, raw_x_axes = [], []

    for offset in offsets:
        peak = N // 2 + offset
        y = np.exp(-0.5 * ((np.arange(N) - peak) / sigma) ** 2)
        x = np.arange(N) + (1100 - peak)      # x[peak] == 1100 wn
        raw_spectra.append(y)
        raw_x_axes.append(x)

    raw_spectra = np.array(raw_spectra)   # shape (5, 1000)
    raw_x_axes  = np.array(raw_x_axes)    # shape (5, 1000)

**Step 1 — what the instrument gives you**

Each spectrum is delivered as an array of intensity values indexed by pixel
number. When you plot them on a common pixel axis, the peaks appear at
different positions — each instrument's zero point is slightly different.

.. code-block:: python

    zoom = 40

    fig, ax = plt.subplots(figsize=(6, 4))
    for y in raw_spectra:
        ax.plot(y)
    ax.set_xlim(N // 2 - zoom, N // 2 + zoom)
    ax.set(title="Raw spectra — pixel index", xlabel="Pixel index", ylabel="Intensity")
    plt.tight_layout()
    plt.show()

.. image:: ../_static/images/explore/dynamic_transformers/raw_pixel.png
   :alt: Raw spectra indexed by pixel number — peaks at different positions
   :align: center
   :width: 500

|

Peaks land at different pixel positions — the grids are misaligned. If you
stacked these rows into a matrix as-is and fed it to a PLS model, column *k*
would represent a different wavenumber for each instrument, so every learned
regression coefficient would point at the wrong feature.

**Step 2 — plot against wavenumber**

Each spectrum comes with its own wavenumber axis. Plotting against it shows
the peaks coincide at 1100 cm⁻¹, but the arrays are still all different.

.. code-block:: python

    fig, ax = plt.subplots(figsize=(6, 4))
    for y, x in zip(raw_spectra, raw_x_axes):
        ax.plot(x, y)
    ax.axvline(1100, color="k", linestyle="--", linewidth=1)
    ax.set_xlim(1100 - zoom, 1100 + zoom)
    ax.set(title="Raw spectra — wavenumber axis", xlabel="Wavenumber (cm⁻¹)", ylabel="Intensity")
    plt.tight_layout()
    plt.show()

.. image:: ../_static/images/explore/dynamic_transformers/raw_wavenumber.png
   :alt: Raw spectra on their own wavenumber axes — peaks align at 1100 cm⁻¹
   :align: center
   :width: 500

|

**Step 3 — interpolate onto a common grid**

:class:`~chemotools.adaptation.XAxisInterpolator` takes a ``common_x_axis``
defined once at construction time and, at every ``transform`` call, resamples
each row from its own ``x_axis`` onto that shared grid. The per-spectrum
axis is passed as metadata — not baked into the transformer — so it can
change freely between calls.

.. code-block:: python

    x_common = np.linspace(650, 1550, N)

    interpolator = (
        XAxisInterpolator(
            common_x_axis=x_common, method="linear", left=0, right=0
        )  # left/right fill values outside the grid
        .set_fit_request(x_axis=True)
        .set_transform_request(x_axis=True)
    )

    aligned_spectra = interpolator.fit_transform(raw_spectra, x_axis=raw_x_axes)

.. code-block:: python

    fig, ax = plt.subplots(figsize=(6, 4))
    for y in aligned_spectra:
        ax.plot(y)
    ax.set(
        title="Aligned spectra — common-axis index",
        xlabel="Common-axis index",
        ylabel="Intensity",
    )
    plt.tight_layout()
    ax.set_xlim(420, 580)
    plt.show()

.. image:: ../_static/images/explore/dynamic_transformers/aligned.png
   :alt: Aligned spectra on the common grid — peaks overlap perfectly
   :align: center
   :width: 500

|

All five peaks now sit at the same column index. The matrix ``aligned_spectra``
can be fed directly into any subsequent step or model.

How metadata routing works
---------------------------

The two method calls on the interpolator — ``set_fit_request(x_axis=True)``
and ``set_transform_request(x_axis=True)`` — register ``x_axis`` as a
metadata argument for the ``fit`` and ``transform`` phases respectively.
When you pass ``x_axis`` to a pipeline call, scikit-learn delivers it only
to the step that declared it; every other step is unaffected.

``set_fit_request`` covers ``fit`` and ``fit_transform``;
``set_transform_request`` covers ``transform``. Both are declared in the
example because a :class:`~sklearn.pipeline.Pipeline` calling
``fit_transform`` routes metadata through both phases.

Using it inside a Pipeline
---------------------------

The pipeline below continues from the same variables defined above:

.. code-block:: python

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from chemotools.scatter import MultiplicativeScatterCorrection

    pipe = Pipeline(
        [
            (
                "interpolate",
                XAxisInterpolator(
                    common_x_axis=x_common, method="linear", left=0, right=0
                )  # left/right fill values outside the grid
                .set_fit_request(x_axis=True)
                .set_transform_request(x_axis=True),
            ),
            ("msc", MultiplicativeScatterCorrection()),
            ("scaler", StandardScaler()),
        ]
    )

    # x_axis is routed to "interpolate" only; the other steps never see it
    X_preprocessed = pipe.fit_transform(raw_spectra, x_axis=raw_x_axes)

.. note::

   Only the step that declared ``set_transform_request(x_axis=True)`` receives
   ``x_axis``. The other steps in the pipeline are unaffected.

Shared vs. per-sample grids
-----------------------------

Not every batch comes from multiple instruments. When all spectra in a call
share the same grid, you can pass a single 1-D array instead of a matrix.
``x_axis`` accepts two shapes:

* **Shape** ``(n_features,)`` — the same grid for every spectrum in the
  call. Use this when all spectra in a batch come from the same instrument
  (e.g., a single measurement session where the grid is fixed).

* **Shape** ``(n_samples, n_features)`` — one grid per row. Use this when
  combining spectra from multiple instruments in one batch (as in the
  example above, where each of the five spectra has its own offset).

.. code-block:: python

    # Shared grid — all spectra measured on the same instrument axis
    x_shared = raw_x_axes[0]                          # shape (1000,)
    X_aligned_shared = interpolator.transform(raw_spectra, x_axis=x_shared)

    # Per-sample grids — each spectrum has its own axis
    X_aligned_per = interpolator.transform(raw_spectra, x_axis=raw_x_axes)  # shape (5, 1000)

Interpolation methods
----------------------

:class:`~chemotools.adaptation.XAxisInterpolator` supports three methods,
selectable via the ``method`` parameter:

.. list-table::
   :widths: 15 55 30
   :header-rows: 1

   * - ``method``
     - Description
     - When to use
   * - ``"linear"``
     - Piecewise linear interpolation.
     - Fast; best when spectra are smooth and grids are closely spaced.
   * - ``"cubic"``
     - Natural cubic spline (via :func:`scipy.interpolate.CubicSpline`).
     - Good all-round choice; smooth and accurate.
   * - ``"pchip"``
     - Piecewise cubic Hermite (via :func:`scipy.interpolate.PchipInterpolator`).
     - Preserves monotonicity; avoids overshooting near peaks.

Points outside the input grid are filled with ``left`` / ``right`` (both
default to :data:`numpy.nan`). You can change these to ``0.0`` or any other
sentinel value if your downstream steps cannot handle ``NaN``.

That is the core idea behind dynamic transformers: pipelines that stay
self-contained and reusable even when correction parameters are not known
until prediction time.

.. seealso::

   :doc:`XAxisInterpolator <../methods/generated/chemotools.adaptation.XAxisInterpolator>` — full API reference.