Working with DataFrames#

For the pandas.DataFrame and polars.DataFrame lovers. By default, all scikit-learn and chemotools transformers output numpy.ndarray. However, now it is possible to configure your chemotools preprocessing methods to produce either a pandas.DataFrame or a polars.DataFrame objects as output. This is possible after implementing the new set_output() API from scikit-learn (>= 1.2.2 for pandas and >= 1.4.0 for polars) (documentation). The same API implemented in other scikit-learn preprocessing methods like the StandardScaler() is now available for the chemotools transformers.

Note

From version 0.1.3, the set_output() is available for all chemotools functions!

Below there are two examples of how to use this new API:

Example 1: Using the set_output() API with a single preprocessing method#

1. Load your spectral data as a pandas.DataFrame#

First load your spectral data. In this case, we assume a file called spectra.csv where each row represents a spectrum and each column represents wavenumbers.

import pandas as pd
from chemotools.baseline import AirPls

# Load your data as a pandas DataFrame
spectra = pd.read_csv('data/spectra.csv', index_col=0)

The spectra variable is a pandas.DataFrame object with the indices representing the sample names and the columns representing the wavenumbers.

2. Create a chemotools preprocessing object and set the output to pandas#

Next, we create the AirPls object and set the output to pandas.

# Create an AirPLS object and set the output to pandas
airpls = AirPls().set_output(transform='pandas')

The set_output() method accepts the following arguments:

  • transform: The output format. Can be 'pandas' or 'default' (the default format will output a numpy.ndarray).

Hint

If you wanted to set the output to polars you would use transform='polars' in the set_output() method (AirPLS().set_output(transform='polars')).

3. Fit and transform the spectra#

# Fit and transform the spectra
spectra_airpls = airpls.fit_transform(spectra)

The output of the fit_transform() method is now a pandas.DataFrame object.

Hint

Notice that by default the indices and the columns of the input data are not maintained to the output, and the spectra_airpls DataFrame has default indices and columns.

Example 2: Using the set_output() API with a pipeline#

Similarly, the set_output() API can be used with pipelines. The following code shows how to create a pipeline that performs:

  • Multiplicative scatter correction

  • Standard scaling

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from chemotools.scatter import MultiplicativeScatterCorrection

pipeline = make_pipeline(MultiplicativeScatterCorrection(), StandardScaler())
pipeline.set_output(transform="pandas")

output = pipeline.fit_transform(spectra)

Hint

If you wanted to set the output to polars you would use transform='polars' in the set_output() method (pipeline.set_output(transform='polars')).