Big dataset for multi-class classification can't be dasked and split, normal one can't be handled

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

I have a huge dataframe (550MB), [the lending club one available here, and I have to predict the class of the grades. One successful attempt is described here.The dask dataframe is :

Unnamed: 0 Unnamed: 0.1 loan_amnt funded_amnt funded_amnt_inv term int_rate installment annual_inc issue_d ... addr_state_SD addr_state_TN addr_state_TX addr_state_UT addr_state_VA addr_state_VT addr_state_WA addr_state_WI addr_state_WV addr_state_WY
0 41131 931434 24000 24000 24000.0 0 8.49 757.51 80000.0 2015 ... 0 0 0 0 0 0 1 0 0 0
1 41132 942549 6000 6000 6000.0 0 11.22 197.06 52000.0 2015 ... 0 0 0 0 0 0 0 0 0 0
2 41135 931619 8000 8000 8000.0 0 9.80 257.39 55000.0 2015 ... 0 0 0 0 0 0 0 0 0 0

So it's a mumti-class classification problem. Yet, if I try to import the data using pandas, it seems to freeze, if I try to do a dask dataframe I can't use train_test_split as I get :

NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

So how can I do a classification on even a small amount of this dataset ?

Here is my code so far

import dask.dataframe as dd
# predicting model
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# define baseline model
def baseline_model():
# create model
model = Sequential()
# input layer
model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
# hidden layer
model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
# output layer
model.add(Dense(output_dim, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

from dask.distributed import Client

# load dataset
X = result.loc[:, result.columns != 'TARGET']
Y = result['TARGET']

import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from sklearn.externals.joblib import parallel_backend

client = Client()
with joblib.parallel_backend('dask'):
    print("Before train test split")
    train_X, test_X, train_y, test_y = train_test_split(X,result['TARGET'], test_size = 0.2, random_state = 0)
    print("before one hot encoder 1")
    train_y = pd.get_dummies(train_y)
    print("before one hot encoder 1")
    test_y = pd.get_dummies(test_y)
    print("Before Keras Classifier")
    estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=1)
    kfold = KFold(n_splits=10, shuffle=True)
    results = cross_val_score(estimator, X, YDummies, cv=kfold)
    print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Here is the full error message :

C:\Users\antoi\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21 .

warnings.warn(msg, category=DeprecationWarning)

Before train test split

---------------------------------------------------------------------------

NotImplementedError Traceback (most recent call last)

<ipython-input-8-7bcf8dee84cb> in <module>

6 with joblib.parallel_backend('dask'):

7 print("Before train test split")

----> 8 train_X, test_X, train_y, test_y = train_test_split(X,result['TARGET'], test_size = 0.2, random_state = 0)

9 print("before one hot encoder 1")

10 train_y = pd.get_dummies(train_y)

~\AppData\Roaming\Python\Python36\site-packages\sklearn\model_selection\_split.py in train_test_split(*arrays, **options)

2122

2123 return list(chain.from_iterable((safe_indexing(a, train),

-> 2124 safe_indexing(a, test)) for a in arrays))

2125

2126

~\AppData\Roaming\Python\Python36\site-packages\sklearn\model_selection\_split.py in <genexpr>(.0)

2122

2123 return list(chain.from_iterable((safe_indexing(a, train),

-> 2124 safe_indexing(a, test)) for a in arrays))

2125

2126

~\AppData\Roaming\Python\Python36\site-packages\sklearn\utils\__init__.py in safe_indexing(X, indices)

206 # Pandas Dataframes and Series

207 try:

--> 208 return X.iloc[indices]

209 except ValueError:

210 # Cython typed memoryviews internally used in pandas do not support

~\AppData\Roaming\Python\Python36\site-packages\dask\dataframe\indexing.py in __getitem__(self, key)

52 )

53 if not isinstance(key, tuple):

---> 54 raise NotImplementedError(msg)

55

56 if len(key) > 2:

NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

The only idea I have now is to do it by chunks like :

for chunk in pd.read_csv(<filepath>, chunksize=<your_chunksize_here>)
    do_processing()
    train_algorithm()

Author

Account Strength

80%

Account Age

6 years

Verified Email

Yes

Verified Flair

Total Karma

328

Link Karma

211

Comment Karma

Profile updated: 5 days ago

Posts updated: 5 months ago

MikeREDDITR

Subreddit

r/bigdata

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 5 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/bigdata/com...