Skip to content

MESQUAL Pandas Util flatten_df

flatten_df

flatten_df(df: DataFrame) -> DataFrame

Transform a time-series DataFrame into a flat format with one value per row.

Converts a DataFrame with multi-level columns (objects/variables/properties) and time-based indices into a long-format DataFrame where each row contains a single value with its corresponding metadata.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with potentially multi-level columns and indices. Typically represents time-series data with multiple variables, objects, or properties.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Flattened DataFrame in long format where:

  • Each row represents one data point
  • Original index levels become columns
  • Original column levels become the 'variable' column
  • Data values are in the 'value' column

Examples:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample multi-level DataFrame
>>> dt_idx = pd.date_range('2024-01-01', periods=3, freq='h', name='datetime')
>>> cols = pd.MultiIndex.from_product([['DE', 'FR'], ['price']], names=['zone', 'type'])
>>> df = pd.DataFrame(np.random.rand(3, 2), index=dt_idx, columns=cols)
>>> print(df.head())
    zone                    DE            FR
    type                 price volume  price volume
    datetime
    2024-01-01 00:00:00  37.45  95.07  73.20  59.87
    2024-01-01 06:00:00  15.60  15.60   5.81  86.62
    2024-01-01 12:00:00  60.11  70.81   2.06  96.99
    2024-01-01 18:00:00  83.24  21.23  18.18  18.34
>>>
>>> # Flatten the DataFrame
>>> flat_df = flatten_df(df)
>>> print(flat_df.head())
                  datetime zone    type  value
    0  2024-01-01 00:00:00   DE   price  37.45
    1  2024-01-01 06:00:00   DE   price  15.60
    2  2024-01-01 12:00:00   DE   price  60.11
    3  2024-01-01 18:00:00   DE   price  83.24
    4  2024-01-01 00:00:00   DE  volume  95.07
    5  2024-01-01 06:00:00   DE  volume  15.60
    6  2024-01-01 12:00:00   DE  volume  70.81
    7  2024-01-01 18:00:00   DE  volume  21.23
    8  2024-01-01 00:00:00   FR   price  73.20
    9  2024-01-01 06:00:00   FR   price   5.81
    10 2024-01-01 12:00:00   FR   price   2.06
    11 2024-01-01 18:00:00   FR   price  18.18
    12 2024-01-01 00:00:00   FR  volume  59.87
    13 2024-01-01 06:00:00   FR  volume  86.62
Source code in submodules/mesqual/mesqual/utils/pandas_utils/flatten_df.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def flatten_df(df: pd.DataFrame) -> pd.DataFrame:
    """Transform a time-series DataFrame into a flat format with one value per row.

    Converts a DataFrame with multi-level columns (objects/variables/properties)
    and time-based indices into a long-format DataFrame where each row contains
    a single value with its corresponding metadata.

    Args:
        df (pd.DataFrame): Input DataFrame with potentially multi-level columns
            and indices. Typically represents time-series data with multiple
            variables, objects, or properties.

    Returns:
        pd.DataFrame: Flattened DataFrame in long format where:

            - Each row represents one data point
            - Original index levels become columns
            - Original column levels become the 'variable' column
            - Data values are in the 'value' column

    Examples:

        >>> import pandas as pd
        >>> import numpy as np
        >>>
        >>> # Create sample multi-level DataFrame
        >>> dt_idx = pd.date_range('2024-01-01', periods=3, freq='h', name='datetime')
        >>> cols = pd.MultiIndex.from_product([['DE', 'FR'], ['price']], names=['zone', 'type'])
        >>> df = pd.DataFrame(np.random.rand(3, 2), index=dt_idx, columns=cols)
        >>> print(df.head())
            zone                    DE            FR
            type                 price volume  price volume
            datetime
            2024-01-01 00:00:00  37.45  95.07  73.20  59.87
            2024-01-01 06:00:00  15.60  15.60   5.81  86.62
            2024-01-01 12:00:00  60.11  70.81   2.06  96.99
            2024-01-01 18:00:00  83.24  21.23  18.18  18.34
        >>>
        >>> # Flatten the DataFrame
        >>> flat_df = flatten_df(df)
        >>> print(flat_df.head())
                          datetime zone    type  value
            0  2024-01-01 00:00:00   DE   price  37.45
            1  2024-01-01 06:00:00   DE   price  15.60
            2  2024-01-01 12:00:00   DE   price  60.11
            3  2024-01-01 18:00:00   DE   price  83.24
            4  2024-01-01 00:00:00   DE  volume  95.07
            5  2024-01-01 06:00:00   DE  volume  15.60
            6  2024-01-01 12:00:00   DE  volume  70.81
            7  2024-01-01 18:00:00   DE  volume  21.23
            8  2024-01-01 00:00:00   FR   price  73.20
            9  2024-01-01 06:00:00   FR   price   5.81
            10 2024-01-01 12:00:00   FR   price   2.06
            11 2024-01-01 18:00:00   FR   price  18.18
            12 2024-01-01 00:00:00   FR  volume  59.87
            13 2024-01-01 06:00:00   FR  volume  86.62
    """

    data = df.copy()
    if any(i is None for i in data.columns.names):
        if data.columns.nlevels == 1:
            data.columns.name = 'columns'
        else:
            data.columns.names = [f'column_level_{i}' if name is None else name for i, name in enumerate(df.columns.names)]
    if any(i is None for i in data.index.names):
        if data.index.nlevels == 1:
            data.index.name = 'index'
        else:
            data.index.names = [f'index_level_{i}' if name is None else name for i, name in enumerate(df.index.names)]

    depth_cols = data.columns.nlevels
    idx_names = list(data.index.names)
    if depth_cols > 1:
        idx_cols = [(i, ) + tuple('' for _ in range(depth_cols - 1)) for i in idx_names]
    else:
        idx_cols = idx_names
    data = data.reset_index().melt(id_vars=idx_cols)
    data = data.rename(columns={tup: name for tup, name in zip(idx_cols, idx_names)})

    return data