> Lectures > Lecture 8 > Using Pandas

Using Pandas

Pandas uses dataFrames to represent data. Pandas has many helper functions to read data.

import pandas as pd
import numpy as np

df = pd.read_csv('grades.csv')
df

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed
0	John	86	57	45	32	true
1	Mary	13	36	24	53	false
2	Alice	90	67	87	31	true
3	Bob	78	76	68	89	true
4	Claire	54	32	21	11	false

Common operations on a dataFrame:

Drop a row (return a new dataFrame)

df.drop(2)

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed
0	John	86	57	45	32	true
1	Mary	13	36	24	53	false
3	Bob	78	76	68	89	true
4	Claire	54	32	21	11	false

Drop several rows:

df.drop([2,3,4])

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed
0	John	86	57	45	32	true
1	Mary	13	36	24	53	false

Drop some columns

df.drop('passed', axis=1)

	name	Ex 1	Ex 2	Ex 3	Ex 4
0	John	86	57	45	32
1	Mary	13	36	24	53
2	Alice	90	67	87	31
3	Bob	78	76	68	89
4	Claire	54	32	21	11

Drop several columns:

df.drop(['Ex 2', 'Ex 3'], axis=1)

	name	Ex 1	Ex 4	passed
0	John	86	32	true
1	Mary	13	53	false
2	Alice	90	31	true
3	Bob	78	89	true
4	Claire	54	11	false

Transform columns:

df['passed'].apply(lambda x: x=='true')

0     True
1    False
2     True
3     True
4    False
Name: passed, dtype: bool

The dataFrame is not modified!

df

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed
0	John	86	57	45	32	true
1	Mary	13	36	24	53	false
2	Alice	90	67	87	31	true
3	Bob	78	76	68	89	true
4	Claire	54	32	21	11	false

To modify it assign the modified column to istself:

df['passed'] = df['passed'].apply(lambda x: x=='true')
df

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed
0	John	86	57	45	32	True
1	Mary	13	36	24	53	False
2	Alice	90	67	87	31	True
3	Bob	78	76	68	89	True
4	Claire	54	32	21	11	False

Create new columns:

df['average'] = (df['Ex 1'] + df['Ex 2']+df['Ex 3']+df['Ex 4'])/ 4
df

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed	average
0	John	86	57	45	32	True	55.00
1	Mary	13	36	24	53	False	31.50
2	Alice	90	67	87	31	True	68.75
3	Bob	78	76	68	89	True	77.75
4	Claire	54	32	21	11	False	29.50

df['mention'] = df['average'] > 70
df

	name	Ex 1	Ex 2	Ex 3	Ex 4	passed	average	mention
0	John	86	57	45	32	True	55.00	False
1	Mary	13	36	24	53	False	31.50	False
2	Alice	90	67	87	31	True	68.75	False
3	Bob	78	76	68	89	True	77.75	True
4	Claire	54	32	21	11	False	29.50	False

DataFrames can be used as input to sklearn functions.

from sklearn.preprocessing import StandardScaler
sScaler = StandardScaler()

firstTerm = df[['Ex 1', 'Ex 2']]
sScaler.fit(firstTerm)

StandardScaler()

sScaler.transform(firstTerm)

array([[ 0.76533169,  0.19834601],
       [-1.79747626, -1.02673227],
       [ 0.90575952,  0.78171661],
       [ 0.48447602,  1.30675016],
       [-0.35809097, -1.26008051]])

from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier(n_neighbors=2)
kn.fit(df[['Ex 1','Ex 2', 'Ex 3', 'Ex 4']],df['passed'])

KNeighborsClassifier(n_neighbors=2)

kn.predict([
    [10,10,10,10],
    [50,60,70,80]
])

array([False,  True])