**Preprocessing pipelines** =========================== Pipelines are becoming increasingly popular in machine learning workflows. In essence, pipelines are a sequence of connected data processing steps, where the output of one step is the input of the next. They are very useful for: - automating complex workflows, - improving efficiency, - reducing errors in data processing and analysis and - simplifying model persistence and deployment. All preprocessing techniques in ``chemotools`` are compatible with ``scikit-learn`` and can be used in pipelines. As an example, we will study the case where we would like to apply the following preprocessing techniques to our spectra: - Range Cut - Linear Correction - Savitzky-Golay derivate - Mean Centering (Standard Scaler) - PLS regression **Traditional flow** -------------------------- In a traditional flow, would apply each preprocessing technique individually to the spectra as shown in the image below: .. image:: ./_figures/pipelines_no_pipeline.png :alt: Traditional workflow :align: center :width: 300 The code to perform this workflow would look like this: .. code-block:: python from chemotools.feature_selection import RangeCut from chemotools.baseline import LinearCorrection from chemotools.derivative import SavitzkyGolay from sklearn.cross_decomposition import PLSRegression from sklearn.preprocessing import StandardScaler # Range Cut # Define the Range Cut range_cut = RangeCut(start=950, end=1550, wavenumbers=wavenumbers) # Fit and apply Ranve Cut spectra_cut = range_cut.fit_transform(spectra) # Linear Correction # Define the Linear Correction linear_correction = LinearCorrection() # Fit and apply Linear Correction spectra_corrected = linear_correction.fit_transform(spectra_cut) # Savitzky-Golay # Define the Savitzky-Golay savitzky_golay = SavitzkyGolay(window_size=21, polynomial_order=2, derivate_order=1) # Fit and apply Savitzky-Golay spectra_derivate = savitzky_golay.fit_transform(spectra_corrected) # Mean Cetering (Standard Scaler) # Define the Standard Scaler standard_scaler = StandardScaler(with_mean=True, with_std=False) # Fit and apply Standard Scaler spectra_centered = standard_scaler.fit_transform(spectra_derivate) # PLS regression # Define the PLS regression pls = PLSRegression(n_components=2, scale=False) # Fit the model pls.fit(spectra_centered, reference) # Apply model to make predictions prediction = pls.predict(spectra_centered) This is a tedious and error-prone workflow, especially when the number of preprocessing steps increases. In addition, persisting the model and deploying it to a production environment is not straightforward, as each preprocessing step needs to be persisted and deployed individually. **Pipeline flow** -------------------------- In a pipeline flow, we can combine all preprocessing steps into a single object. This simplifies the workflow and reduces the risk of errors. The figure below shows the same workflow as above, but using a pipeline: .. image:: ./_figures/pipelines_pipeline.png :alt: Pipeline workflow :align: center :width: 800 An outline of the code to perform the pipeline is shown in the image below: .. image:: ./_figures/pipelines_code.png :alt: Pipeline code :align: center :width: 800 The code to perform the pipeline is shown below: .. code-block:: python from chemotools.feature_selection import RangeCut from chemotools.baseline import LinearCorrection from chemotools.derivative import SavitzkyGolay from sklearn.cross_decomposition import PLSRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler # Define the pipeline pipeline = make_pipeline( RangeCut(start=950, end=1550, wavenumbers=wavenumbers), LinearCorrection(), SavitzkyGolay(window_size=21, polynomial_order=2, derivate_order=1), StandardScaler(with_mean=True, with_std=False), PLSRegression(n_components=2, scale=False) ) # Fit the model pipeline.fit(spectra, reference) # Apply model to make predictions prediction = pipeline.predict(spectra) It is now possible to visualize the pipeline and the different preprocessing steps that are applied to the spectra. .. raw:: html :file: _figures/pipelines_pipeline_visualization.html .. note:: Notice that in the traditional workflow, the different preprocessing objects had to be persisted individually. In the pipeline workflow, the entire pipeline can be persisted and deployed to a production environment. See the `Persisting your models`_ section for more information. .. _Persisting your models: #persisting-your-models