Skip to content Skip to sidebar Skip to footer

Run Regression Analysis On Multiple Subsets Of Pandas Columns Efficiently

I could have chosen to go for a shorter question that only focuses on the core-problem here that is list permutations. But the reason I'm bringing statsmodels and pandas into the q

Solution 1:

Based on the help I got here, I've been able to put together a function that takes all columns in a pandas dataframe, defines a dependent variable, and returns all unique combinations of the remaining variables. The result differs a bit from the desired result as defined above but makes more sense for practical use, I think. I'm still hoping that others will be able to post even better solutions.

Here it is:

# Importsimport pandas as pd
import numpy as np
import itertools

# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars) 
df_1 = df_1.set_index(rng)

# The functiondefStepWise(columns, dependent):
    """ Takes the columns of a pandas dataframe, defines a dependent variable
        and returns all unique combinations of the remaining (independent) variables.

    """

    independent = columns.copy()
    independent.remove(dependent)

    lst1 = []
    lst2 = []
    for i in np.arange(1, len(independent)+1):
        #print(list(itertools.combinations(independent, i)))
        elem = list(itertools.combinations(independent, i))
        lst1.append(elem)
        lst2.extend(elem)

    combosIndependent = [list(elem) for elem in lst2]
    combosAll =  [[dependent, other] for other in combosIndependent]
    return(combosAll)

lExec = StepWise(columns = list(df_1), dependent = 'y')
print(lExec)

enter image description here

If you combine this with snippet 3 above, you can easily store the results of multiple regression analyses on a specified dependent variable in a pandas data frame.

Post a Comment for "Run Regression Analysis On Multiple Subsets Of Pandas Columns Efficiently"