**Working with DataFrames** =========================== For the ``pandas.DataFrame`` and ``polars.DataFrame`` lovers. By default, all ``scikit-learn`` and ``chemotools`` transformers output ``numpy.ndarray``. However, now it is possible to configure your ``chemotools`` preprocessing methods to produce either a ``pandas.DataFrame`` or a ``polars.DataFrame`` objects as output. This is possible after implementing the new ``set_output()`` API from ``scikit-learn`` (>= 1.2.2 for ``pandas`` and >= 1.4.0 for ``polars``) (`documentation `_). The same API implemented in other ``scikit-learn`` preprocessing methods like the ``StandardScaler()`` is now available for the ``chemotools`` transformers. .. note:: From version 0.1.3, the ``set_output()`` is available for all ``chemotools`` functions! Below there are two examples of how to use this new API: Example 1: Using the ``set_output()`` API with a single preprocessing method ----------------------------------------------------------------------------- 1. Load your spectral data as a ``pandas.DataFrame`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ First load your spectral data. In this case, we assume a file called ``spectra.csv`` where each row represents a spectrum and each column represents wavenumbers. .. code-block:: python import pandas as pd from chemotools.baseline import AirPls # Load your data as a pandas DataFrame spectra = pd.read_csv('data/spectra.csv', index_col=0) The ``spectra`` variable is a ``pandas.DataFrame`` object with the indices representing the sample names and the columns representing the wavenumbers. 2. Create a ``chemotools`` preprocessing object and set the output to ``pandas`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Next, we create the ``AirPls`` object and set the output to ``pandas``. .. code-block:: python # Create an AirPLS object and set the output to pandas airpls = AirPls().set_output(transform='pandas') The ``set_output()`` method accepts the following arguments: - ``transform``: The output format. Can be ``'pandas'`` or ``'default'`` (the default format will output a ``numpy.ndarray``). .. hint:: If you wanted to set the output to ``polars`` you would use ``transform='polars'`` in the ``set_output()`` method (``AirPLS().set_output(transform='polars')``). 3. Fit and transform the spectra ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python # Fit and transform the spectra spectra_airpls = airpls.fit_transform(spectra) The output of the ``fit_transform()`` method is now a ``pandas.DataFrame`` object. .. hint:: Notice that by default the indices and the columns of the input data are not maintained to the output, and the ``spectra_airpls`` DataFrame has default indices and columns. Example 2: Using the ``set_output()`` API with a pipeline --------------------------------------------------------- Similarly, the ``set_output()`` API can be used with pipelines. The following code shows how to create a pipeline that performs: - Multiplicative scatter correction - Standard scaling .. code-block:: python import pandas as pd from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from chemotools.scatter import MultiplicativeScatterCorrection pipeline = make_pipeline(MultiplicativeScatterCorrection(), StandardScaler()) pipeline.set_output(transform="pandas") output = pipeline.fit_transform(spectra) .. hint:: If you wanted to set the output to ``polars`` you would use ``transform='polars'`` in the ``set_output()`` method (``pipeline.set_output(transform='polars')``).