Statistics

aircraftdetective.utility.statistics ¶

_compute_polynomials_from_dataframe ¶

_compute_polynomials_from_dataframe(
    df, col_name_x, list_col_names_y, degree, plot=False
)

Computes polynomial fits of a given degree for each column in a dataframe.

Given a dataframe with at least two columns, computes a polynomial fit of the specified degree for each column against the specified x-axis column. Returns a dictionary containing the polynomial fits and their corresponding R-squared values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing the data for polynomial fitting.	required
`col_name_x`	`str`	Name of the column to be used as the x-axis for polynomial fitting.	required
`degree`	`int`	Degree of the polynomial to fit.	required

Returns:

Type	Description
`dict[str, Any]`	A dictionary where keys are column names and values are the corresponding polynomial fits. Of the kind: `{ 'Column1': Polynomial object, 'Column1_r2': float, 'Column2': Polynomial object, 'Column2_r2': float, ... }`

Example

Editor (session: default) Run

import pandas as pd
from aircraftdetective.utility.statistics import _compute_polynomials_from_dataframe
data = {
    'Year': [2000, 2005, 2010, 2015],
    'Value1': [10, 15, 20, 25],
    'Value2': [30, 25, 20, 15]
}
df = pd.DataFrame(data)
_compute_polynomials_from_dataframe(df, 'Year', ['Value1', 'Value2'], degree=2)

Output Clear

Source code in aircraftdetective/utility/statistics.py

def _compute_polynomials_from_dataframe(
    df: pd.DataFrame,
    col_name_x: str,
    list_col_names_y: list[str],
    degree: int,
    plot: bool = False
) -> dict[str, Any]:
    r"""
    Computes polynomial fits of a given degree for each column in a dataframe.

    Given a dataframe with at least two columns, computes a polynomial fit of the specified degree
    for each column against the specified x-axis column. Returns a dictionary containing the polynomial
    fits and their corresponding R-squared values.

    See Also
    --------
    [`numpy.polynomial.Polynomial.fit`](https://numpy.org/doc/2.0/reference/generated/numpy.polynomial.polynomial.Polynomial.fit.html)  
    [`aircraftdetective.utility.statistics._r_squared`][]

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing the data for polynomial fitting.
    col_name_x : str
        Name of the column to be used as the x-axis for polynomial fitting.
    degree : int
        Degree of the polynomial to fit.

    Returns
    -------
    dict[str, Any]
        A dictionary where keys are column names and values are the corresponding polynomial fits.  
        Of the kind:  
        ```
        {
            'Column1': Polynomial object,
            'Column1_r2': float,
            'Column2': Polynomial object,
            'Column2_r2': float,
            ...
        }
        ```

    Example
    -------
    ```pyodide install='aircraftdetective'
    import pandas as pd
    from aircraftdetective.utility.statistics import _compute_polynomials_from_dataframe
    data = {
        'Year': [2000, 2005, 2010, 2015],
        'Value1': [10, 15, 20, 25],
        'Value2': [30, 25, 20, 15]
    }
    df = pd.DataFrame(data)
    _compute_polynomials_from_dataframe(df, 'Year', ['Value1', 'Value2'], degree=2)
    ```
    """
    if not isinstance(df, pd.DataFrame):
        raise ValueError("df must be a Pandas DataFrame")
    if df.empty:
        raise ValueError("df cannot be empty")
    if col_name_x not in df.columns:
        raise ValueError(f"col_name_x '{col_name_x}' not found in df columns")
    if not isinstance(degree, int) or degree < 0:
        raise ValueError("degree must be a non-negative integer")
    if len(df) <= degree:
        raise ValueError("number of data points must be greater than degree")

    df_func = df.copy()

    df_func.sort_values(by=col_name_x, ascending=True, inplace=True)
    df_func.dropna(subset=[col_name_x], inplace=True)

    dict_polynomials = {}
    for col in list_col_names_y:
        x_unfiltered = df_func[col_name_x].astype("float64")
        y_unfiltered = df_func[col].astype("float64")
        mask = y_unfiltered.notna() # ensure all NaNs are removed, otherwise the fit will fail
        x = x_unfiltered[mask]
        y = y_unfiltered[mask]
        polynomial_fit = np.polynomial.Polynomial.fit(
            x=x,
            y=y,
            deg=degree,
        )
        _r_squared_polynomial = _r_squared(y, polynomial_fit(x))
        dict_polynomials[col] = polynomial_fit
        dict_polynomials[f'{col}_r2'] = _r_squared_polynomial

    if plot is True:
        fig = go.Figure()
        for col in list_col_names_y:
            if col not in dict_polynomials:
                continue

            x_data = df_func[col_name_x].astype("float64")
            y_data = df_func[col].astype("float64")
            fig.add_trace(go.Scatter(
                x=x_data,
                y=y_data,
                mode='markers',
                name=f'{col} (Original Data)',
                marker=dict(opacity=0.7)
            ))

            polynomial_fit = dict_polynomials[col]
            r_squared = dict_polynomials[f'{col}_r2']
            x_fit = np.linspace(x_data.min(), x_data.max(), num=200)
            y_fit = polynomial_fit(x_fit)
            fig.add_trace(go.Scatter(
                x=x_fit, 
                y=y_fit,
                mode='lines',
                name=f'{col} (Fit, R²={r_squared:.3f})',
                line=dict(width=3)
            ))

        fig.show()

    return dict_polynomials

_r_squared ¶

_r_squared(y, y_pred)

Given a NumPy ndarray of observed values and a NumPy ndarrayof prediced values, determines the coefficient of determination ($R^2$) $$ R^2 = 1 - \frac{RSS}{TSS} $$ with $$ RSS = \sum (y_i - \hat{y}_i)^2 $$ $$ TSS = \sum (y_i - \bar{y})^2 $$ where:

Symbol	Description
$R^2$	Coefficient of determination
$RSS$	Residual sum of squares
$TSS$	Total sum of squares
$y_i$	Actual value
$\hat{y}_i$	Predicted value
$\bar{y}$	Mean of actual values

References

Eqn. (5.2.4) in Draper and Smith (1998)
Coefficient of Determination on Wikipedia