Pandas Pipe

Pandas pipe functionality allows to write clean data preperation steps. Instead of having varaibles flying around like df1, df2 ,... the pipe chains a series of function calls on on dataframe. The mental model goes along like this

df -> apply function -> apply function -> ...

By seperating out each step as a function this has the advantage that you can save theem in a seperate python file where you can test them with unitests. Below is a small example which illustrates the functionality. Note it might be the case that the pipeline changes the original dataframe thats why the first step in the pipeline returns just a copy (there is probably a better way to do it) secondly it should b possible to use the logging module to get a better insight of what the pipeline steps do.

Source https://calmcode.io/pandas-pipe/end.html

list_df = pd.read_html("https://de.wikipedia.org/wiki/Liste_der_L%C3%A4nder_nach_Bruttoinlandsprodukt?oldformat=true")
def deal_first_col(df_pipe):

    df_pipe.columns = ['drop','Land','BIP in MIO US $ 2018', 'veränderung']



    return df_pipe.iloc[:, 1:]

def make_copy(df_pipe):

    return df_pipe.copy()

def set_dtypes(df_pipe, dtype_dict):
    df_pipe['veränderung'] = df_pipe['veränderung'].str.replace(r",", ".")

    df_pipe['veränderung'] = df_pipe['veränderung'].str.replace(r"\xa0", "")
    df_pipe['veränderung'] = df_pipe['veränderung'].str.replace(r"%", "")
    df_pipe['veränderung'] = df_pipe['veränderung'].str.replace("−", "-")

    df_pipe['BIP in MIO US $ 2018'] = df_pipe['BIP in MIO US $ 2018'].str.replace(r".", "")

    return df_pipe.astype(dtype_dict)
df = list_df[0]

(df
    .pipe(make_copy)
    .pipe(deal_first_col)
    .dropna()
    .pipe(set_dtypes, {'BIP in MIO US $ 2018': int,
                        'veränderung': float})
    )
Land BIP in MIO US $ 2018 veränderung
0 Welt 84929508 5.80
1 Vereinigte Staaten 20580250 5.43
3 Europäische Union 18736855 4.67
4 Volksrepublik ChinaA1 13368073 10.83
5 Japan 4971767 2.30
... ... ... ...
194 Palau 284 -0.70
195 Marshallinseln 214 2.88
196 Kiribati 189 1.61
197 Nauru 112 1.82
198 Tuvalu 42 5.00

195 rows × 3 columns

Helper Functions

Plot for the Blog Post

Sources

References