{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Glucose monitoring with PLS**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> š”**NOTE:** This document is a Jupyter notebook. You can download the source file and run it in your Jupyter environment!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Introduction**\n", "This dataset provides information on lignocellulosic ethanol fermentation through spectroscopic data collected data using attenuated total reflectance, mid-infrared (ATR-MIR) spectroscopy, along with reference measurements using high-performance liquid chromatography (HPLC) for validation.\n", "\n", "The project contains two datasets:\n", "\n", "- **Training Dataset:** Contains spectral data with corresponding HPLC measurements used to train the partial least squares (PLS) models.\n", "- **Testing Dataset:** Includes a time series of spectra collected during fermentation, plus off-line HPLC measurements.\n", "\n", "For more information about these datasets and how they can be used to monitor fermentation, please see our article: \"Transforming Data to Information: A Parallel Hybrid Model for Real-Time State Estimation in Lignocellulosic Ethanol Fermentation.\" (Note that the data in the article differs from the data provided here.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Objective**\n", "In this exercise, you will build a PLS model to monitor glucose concentration in real-time using ATR-MIR spectroscopy. You'll train the model using a small training set of spiked spectra, then test its performance with spectra from an actual fermentation process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Before starting**\n", "Before we start, you need to be sure to have the following dependencies installed:\n", "\n", "```\n", "chemotools\n", "matplotlib\n", "numpy\n", "pandas\n", "scikit-learn\n", "```\n", "\n", "You can install them using\n", "\n", "```bash\n", "pip install chemotools\n", "pip install matplotlib\n", "pip install numpy\n", "pip install pandas\n", "pip install scikit-learn\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Loading the training data**\n", "You can access the from the ```chemotools.datasets``` module with the ```load_fermentation_train()``` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from chemotools.datasets import load_fermentation_train\n", "\n", "spectra, hplc = load_fermentation_train()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```load_fermentation_train()``` function returns two ```pandas.DataFrame```:\n", "\n", "- ```spectra```: This dataframe contains spectral data, with columns representing wavenumbers and rows representing samples.\n", "\n", "- ```hplc```: Here, youāll find reference HPLC measurements for the glucose concentration (in g/L) of each sample, stored in a single column labeled ```glucose```.\n", "\n", "> š”**NOTE:** If you are interested in working with ```polars.DataFrame``` you can simply use ```load_fermentation_train(set_output=\"polars\")```. Note that if you choose to work with ```polars.DataFrame``` the wavenumbers are given in the column names as ```str``` and not as ```float```. This is because ```polars``` does not support column names with types other than ```str```. To extract the wavenumbers as ```float``` from the ```polars.DataFrame``` you can use the ```df.columns.to_numpy(dtype=np.float64)``` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Exploring the training data**\n", "Before starting with data modeling, itās important to get familiar with your data. Let's start by answering some basic questions: \n", "\n", "- *How many samples are there?* \n", "- *and how many wavenumbers are available?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of samples: 21\n", "Number of wavenumbers: 1047\n" ] } ], "source": [ "print(f\"Number of samples: {spectra.shape[0]}\")\n", "print(f\"Number of wavenumbers: {spectra.shape[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the basics down, letās take a closer look at the data.\n", "\n", "For the spectral data, you can use the ```pandas.DataFrame.head()``` method to examine the first 5 rows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 428.0 | \n", "429.0 | \n", "431.0 | \n", "432.0 | \n", "434.0 | \n", "436.0 | \n", "437.0 | \n", "439.0 | \n", "440.0 | \n", "442.0 | \n", "... | \n", "1821.0 | \n", "1823.0 | \n", "1824.0 | \n", "1825.0 | \n", "1826.0 | \n", "1828.0 | \n", "1829.0 | \n", "1830.0 | \n", "1831.0 | \n", "1833.0 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.493506 | \n", "-0.383721 | \n", "0.356846 | \n", "0.205714 | \n", "0.217082 | \n", "0.740331 | \n", "0.581749 | \n", "0.106719 | \n", "0.507973 | \n", "0.298643 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "0.108225 | \n", "0.488372 | \n", "-0.037344 | \n", "0.448571 | \n", "0.338078 | \n", "0.632597 | \n", "0.368821 | \n", "0.462451 | \n", "0.530752 | \n", "0.221719 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "0.095238 | \n", "0.790698 | \n", "0.174274 | \n", "0.314286 | \n", "0.106762 | \n", "0.560773 | \n", "0.182510 | \n", "0.482213 | \n", "0.482916 | \n", "0.341629 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 | \n", "0.666667 | \n", "0.197674 | \n", "0.352697 | \n", "0.385714 | \n", "0.405694 | \n", "0.508287 | \n", "0.463878 | \n", "0.430830 | \n", "0.455581 | \n", "0.527149 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "0.532468 | \n", "-0.255814 | \n", "0.078838 | \n", "0.057143 | \n", "0.238434 | \n", "0.997238 | \n", "0.399240 | \n", "0.201581 | \n", "0.533030 | \n", "0.246606 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows Ć 1047 columns
\n", "\n", " | glucose | \n", "
---|---|
count | \n", "21.000000 | \n", "
mean | \n", "19.063895 | \n", "
std | \n", "12.431570 | \n", "
min | \n", "0.000000 | \n", "
25% | \n", "9.057189 | \n", "
50% | \n", "18.395220 | \n", "
75% | \n", "29.135105 | \n", "
max | \n", "38.053004 | \n", "
PLSRegression(n_components=6, scale=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PLSRegression(n_components=6, scale=False)