Using Pandas¶

Pandas uses dataFrames to represent data. Pandas has many helper functions to read data.

In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv('grades.csv')
df
Out[2]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false
2 Alice 90 67 87 31 true
3 Bob 78 76 68 89 true
4 Claire 54 32 21 11 false

Common operations on a dataFrame:

Drop a row (return a new dataFrame)

In [3]:
df.drop(2)
Out[3]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false
3 Bob 78 76 68 89 true
4 Claire 54 32 21 11 false

Drop several rows:

In [4]:
df.drop([2,3,4])
Out[4]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false

Drop some columns

In [5]:
df.drop('passed', axis=1)
Out[5]:
name Ex 1 Ex 2 Ex 3 Ex 4
0 John 86 57 45 32
1 Mary 13 36 24 53
2 Alice 90 67 87 31
3 Bob 78 76 68 89
4 Claire 54 32 21 11

Drop several columns:

In [6]:
df.drop(['Ex 2', 'Ex 3'], axis=1)
Out[6]:
name Ex 1 Ex 4 passed
0 John 86 32 true
1 Mary 13 53 false
2 Alice 90 31 true
3 Bob 78 89 true
4 Claire 54 11 false

Transform columns:

In [7]:
df['passed'].apply(lambda x: x=='true')
Out[7]:
0     True
1    False
2     True
3     True
4    False
Name: passed, dtype: bool

The dataFrame is not modified!

In [8]:
df
Out[8]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 true
1 Mary 13 36 24 53 false
2 Alice 90 67 87 31 true
3 Bob 78 76 68 89 true
4 Claire 54 32 21 11 false

To modify it assign the modified column to istself:

In [9]:
df['passed'] = df['passed'].apply(lambda x: x=='true')
df
Out[9]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed
0 John 86 57 45 32 True
1 Mary 13 36 24 53 False
2 Alice 90 67 87 31 True
3 Bob 78 76 68 89 True
4 Claire 54 32 21 11 False

Create new columns:

In [10]:
df['average'] = (df['Ex 1'] + df['Ex 2']+df['Ex 3']+df['Ex 4'])/ 4
df
Out[10]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed average
0 John 86 57 45 32 True 55.00
1 Mary 13 36 24 53 False 31.50
2 Alice 90 67 87 31 True 68.75
3 Bob 78 76 68 89 True 77.75
4 Claire 54 32 21 11 False 29.50
In [11]:
df['mention'] = df['average'] > 70
df
Out[11]:
name Ex 1 Ex 2 Ex 3 Ex 4 passed average mention
0 John 86 57 45 32 True 55.00 False
1 Mary 13 36 24 53 False 31.50 False
2 Alice 90 67 87 31 True 68.75 False
3 Bob 78 76 68 89 True 77.75 True
4 Claire 54 32 21 11 False 29.50 False

DataFrames can be used as input to sklearn functions.

In [12]:
from sklearn.preprocessing import StandardScaler
sScaler = StandardScaler()
In [13]:
firstTerm = df[['Ex 1', 'Ex 2']]
sScaler.fit(firstTerm)
Out[13]:
StandardScaler()
In [14]:
sScaler.transform(firstTerm)
Out[14]:
array([[ 0.76533169,  0.19834601],
       [-1.79747626, -1.02673227],
       [ 0.90575952,  0.78171661],
       [ 0.48447602,  1.30675016],
       [-0.35809097, -1.26008051]])
In [15]:
from sklearn.neighbors import KNeighborsClassifier
In [16]:
kn = KNeighborsClassifier(n_neighbors=2)
kn.fit(df[['Ex 1','Ex 2', 'Ex 3', 'Ex 4']],df['passed'])
Out[16]:
KNeighborsClassifier(n_neighbors=2)
In [17]:
kn.predict([
    [10,10,10,10],
    [50,60,70,80]
])
Out[17]:
array([False,  True])