In [91]:
%reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
In [101]:
import os
cwd = os.getcwd()
In [100]:
cwd
Out[100]:
'D:\\Users\\ceviherdian'

DATA SCIENTIST

In this tutorial, I only explain you what you need to be a data scientist neither more nor less.

Data scientist need to have these skills:

1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python

2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.

3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.

4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.

5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.

In this part, you learn:

•Diagnose data for cleaning

•Exploratory data analysis

•Visual exploratory data analysis

•Tidy data

•Pivoting data

•Concatenating data

•Data types

•Missing data and testing with assert

3. Cleaning Data

A. Diagnose data for cleaning

B. Exploratory data analysis

C. Visual exploratory data analysis

D. Tidy data

E. Pivoting data

F. Concatenating data

G. Data types

H. Missing data and testing with assert

A. Diagnose data for cleaning

We need to diagnose and clean data before exploring.

Unclean data:

•Column name inconsistency like upper-lower case letter or space between words

•missing data

•different language

We will use head, tail, columns, shape and info methods to diagnose data

Data & Package:

In [104]:
#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool
In [95]:
data = pd.read_csv('pokemon.csv')
In [16]:
# head shows first 5 rows
data.head()
Out[16]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [17]:
# tail shows last 5 rows
data.tail()
Out[17]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
795 796 Diancie Rock Fairy 50 100 150 100 150 50 6 True
796 797 Mega Diancie Rock Fairy 50 160 110 160 110 110 6 True
797 798 Hoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True
798 799 Hoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True
799 800 Volcanion Fire Water 80 110 120 130 90 70 6 True
In [18]:
# columns gives column names of features
data.columns
Out[18]:
Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')
In [19]:
# shape gives number of rows and columns in a tuble
data.shape
Out[19]:
(800, 12)
In [20]:
# info gives data type like dataframe, 
#number of sample or row, number of feature or column, feature types and memory usage
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
#             800 non-null int64
Name          799 non-null object
Type 1        800 non-null object
Type 2        414 non-null object
HP            800 non-null int64
Attack        800 non-null int64
Defense       800 non-null int64
Sp. Atk       800 non-null int64
Sp. Def       800 non-null int64
Speed         800 non-null int64
Generation    800 non-null int64
Legendary     800 non-null bool
dtypes: bool(1), int64(8), object(3)
memory usage: 69.6+ KB

B. Exploratory data analysis

value_counts(): Frequency counts

outliers: the value that is considerably higher or lower from rest of the data

•Lets say value at 75% is Q3 and value at 25% is Q1.

•Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1). (Q3-Q1) = IQR

We will use describe() method. Describe method includes:

•count: number of entries

•mean: average of entries

•std: standart deviation

•min: minimum entry

•25%: first quantile

•50%: median or second quantile

•75%: third quantile

•max: maximum entry

What is quantile?

•1,4,5,6,8,9,11,12,13,14,15,16,17

•The median is the number that is in middle of the sequence. In this case it would be 11.

•The lower quartile is the median in between the smallest number and the median i.e. in between 1 and 11, which is 6.

•The upper quartile, you find the median between the median and the largest number i.e. between 11 and 17, which will be 14 according to the question above.

In [27]:
#For example lets look frequency of pokemom types
print(data['Type 1'].value_counts(dropna =False))  # if there are nan values that also be counted
# As it can be seen below there are 112 water pokemon or 70 grass pokemon
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Ghost        32
Dragon       32
Ground       32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64
In [28]:
# For example max HP is 255 or min defense is 5
data.describe() #ignore null entries
Out[28]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation
count 800.0000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.00000
mean 400.5000 69.258750 79.001250 73.842500 72.820000 71.902500 68.277500 3.32375
std 231.0844 25.534669 32.457366 31.183501 32.722294 27.828916 29.060474 1.66129
min 1.0000 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000 1.00000
25% 200.7500 50.000000 55.000000 50.000000 49.750000 50.000000 45.000000 2.00000
50% 400.5000 65.000000 75.000000 70.000000 65.000000 70.000000 65.000000 3.00000
75% 600.2500 80.000000 100.000000 90.000000 95.000000 90.000000 90.000000 5.00000
max 800.0000 255.000000 190.000000 230.000000 194.000000 230.000000 180.000000 6.00000

C. Visual exploratory data analysis

  • Box plots: visualize basic statistics like outliers, min/max or quantiles
In [31]:
# For example: compare attack of pokemons that are legendary  or not
# Black line at top is max
# Blue line at top is 75%
# Red line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
# There are no outliers
data.boxplot(column='Attack',by = 'Legendary')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0xa636898>

D. Tidy data

We tidy data with melt(). Describing melt is confusing. Therefore lets make example to understand it.

In [34]:
# Firstly I create new data from pokemons data to explain melt nore easily.
data_new = data.head()    # I only take 5 rows into new data
data_new
Out[34]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [40]:
# lets melt
# id_vars = what we do not wish to melt
# value_vars = what we want to melt
melted = pd.melt(frame=data_new,id_vars = 'Name', value_vars= ['Attack','Defense'])
melted
Out[40]:
Name variable value
0 Bulbasaur Attack 49
1 Ivysaur Attack 62
2 Venusaur Attack 82
3 Mega Venusaur Attack 100
4 Charmander Attack 52
5 Bulbasaur Defense 49
6 Ivysaur Defense 63
7 Venusaur Defense 83
8 Mega Venusaur Defense 123
9 Charmander Defense 43

E. Pivoting data

Reverse of melting.

In [41]:
# Index is name
# I want to make that columns are variable
# Finally values in columns are value
melted.pivot(index = 'Name', columns = 'variable',values='value')
Out[41]:
variable Attack Defense
Name
Bulbasaur 49 49
Charmander 52 43
Ivysaur 62 63
Mega Venusaur 100 123
Venusaur 82 83

F. Concatenating data

In [43]:
# Firstly lets create 2 data frame
data1 = data.head()
data2= data.tail()
conc_data_row = pd.concat([data1,data2],axis =0,ignore_index =True) # axis = 0 : adds dataframes in row
conc_data_row
Out[43]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
5 796 Diancie Rock Fairy 50 100 150 100 150 50 6 True
6 797 Mega Diancie Rock Fairy 50 160 110 160 110 110 6 True
7 798 Hoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True
8 799 Hoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True
9 800 Volcanion Fire Water 80 110 120 130 90 70 6 True
In [44]:
data1 = data['Attack'].head()
data2= data['Defense'].head()
conc_data_col = pd.concat([data1,data2],axis =1) # axis = 0 : adds dataframes in row
conc_data_col
Out[44]:
Attack Defense
0 49 49
1 62 63
2 82 83
3 100 123
4 52 43

G. Data types

There are 5 basic data types: object(string), boolean, integer, float and categorical.

We can make conversion data types like from str to categorical or from int to float

Why is category important:

•make dataframe smaller in memory 

•can be utilized for anlaysis especially for sklear(we will learn later)
In [47]:
data.dtypes
Out[47]:
#              int64
Name          object
Type 1        object
Type 2        object
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object
In [48]:
# lets convert object(str) to categorical and int to float.
data['Type 1'] = data['Type 1'].astype('category')
data['Speed'] = data['Speed'].astype('float')
In [49]:
# As you can see Type 1 is converted from object to categorical
# And Speed ,s converted from int to float
data.dtypes
Out[49]:
#                int64
Name            object
Type 1        category
Type 2          object
HP               int64
Attack           int64
Defense          int64
Sp. Atk          int64
Sp. Def          int64
Speed          float64
Generation       int64
Legendary         bool
dtype: object

H. Missing data and testing with assert

If we encounter with missing data, what we can do:

•leave as is

•drop them with dropna()

•fill missing value with fillna()

•fill missing values with test statistics like mean 

Assert statement: check that you can turn on or turn off when you are done with your testing of the program

In [96]:
# Lets look at does pokemon data have nan value
# As you can see there are 800 entries. However Type 2 has 414 non-null object so it has 386 null object.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
#             800 non-null int64
Name          799 non-null object
Type 1        800 non-null object
Type 2        414 non-null object
HP            800 non-null int64
Attack        800 non-null int64
Defense       800 non-null int64
Sp. Atk       800 non-null int64
Sp. Def       800 non-null int64
Speed         800 non-null int64
Generation    800 non-null int64
Legendary     800 non-null bool
dtypes: bool(1), int64(8), object(3)
memory usage: 69.6+ KB
In [97]:
# Lets chech Type 2
data["Type 2"].value_counts(dropna =False)
# As you can see, there are 386 NAN value
Out[97]:
NaN         386
Flying       97
Ground       35
Poison       34
Psychic      33
Fighting     26
Grass        25
Fairy        23
Steel        22
Dark         20
Dragon       18
Water        14
Ice          14
Rock         14
Ghost        14
Fire         12
Electric      6
Normal        4
Bug           3
Name: Type 2, dtype: int64
In [77]:
# Lets drop nan values
data1=data   # also we will use data to fill missing value so I assign it to data1 variable
data1["Type 2"].dropna(inplace = True)  # inplace = True means we do not assign it to new variable. 
#Changes automatically assigned to data
# So does it work ?
In [59]:
#  Lets check with assert statement
# Assert statement:
assert 1==1 # return nothing because it is true
In [65]:
# In order to run all code, we need to make this line comment
 assert 1==2 
# return error because it is false
  File "<ipython-input-65-642b4064fc58>", line 2
    assert 1==2
    ^
IndentationError: unexpected indent
In [66]:
assert  data['Type 2'].notnull().all() # returns nothing because we drop nan values
In [67]:
data["Type 2"].fillna('empty',inplace = True)
In [68]:
assert  data['Type 2'].notnull().all() # returns nothing because we do not have nan values
In [64]:
# # With assert statement we can check a lot of thing. For example
# assert data.columns[1] == 'Name'
# assert data.Speed.dtypes == np.int

Cheers!

/itsmecevi