In [1]:
%reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
In [2]:
import os
cwd = os.getcwd()
In [3]:
cwd
Out[3]:
'D:\\Users\\ceviherdian'

DATA SCIENTIST

In this tutorial, I only explain you what you need to be a data scientist neither more nor less.

Data scientist need to have these skills:

1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python

2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.

3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.

4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.

5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.

4. Pandas Foundation

Data structures & analysis

A. Review of pandas

B. Building data frames from scratch

C. Visual exploratory data analysis

D. Statistical exploratory data analysis

E. Indexing pandas time series

F. Resampling pandas time series

A. Review of pandas

As you notice, We learn some basics of pandas, we will go deeper in pandas.

•single column = series

•NaN = not a number

•dataframe.values = numpy

B. Building data frames from scratch

•We can build data frames from csv as we did earlier.

•Also we can build dataframe from dictionaries

* zip() method: This function returns a list of tuples, 
where the i-th tuple contains the i-th element from each of the argument sequences or iterables.

•Adding new column

•Broadcasting: Create new column and assign a value to entire column

Data & Package:

In [13]:
#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool
In [20]:
data = pd.read_csv('pokemon.csv')
In [14]:
# data frames from dictionary
country = ["Spain","France"]
population = ["11","12"]
list_label = ["country","population"]
list_col = [country,population]
zipped = list(zip(list_label,list_col))
data_dict = dict(zipped)
df = pd.DataFrame(data_dict)
df
Out[14]:
country population
0 Spain 11
1 France 12
In [15]:
# Add new columns
df["capital"] = ["madrid","paris"]
df
Out[15]:
country population capital
0 Spain 11 madrid
1 France 12 paris
In [16]:
# Broadcasting
df["income"] = 0 #Broadcasting entire column
df
Out[16]:
country population capital income
0 Spain 11 madrid 0
1 France 12 paris 0

C. Visual exploratory data analysis

•Plot

•Subplot

•Histogram:

◾bins: number of bins

◾range(tuble): min and max values of bins

◾normed(boolean): normalize or not

◾cumulative(boolean): compute cumulative distribution
In [21]:
# Plotting all data 
data1 = data.loc[:,["Attack","Defense","Speed"]]
data1.plot()
# it is confusing
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x9867940>
In [22]:
# subplots
data1.plot(subplots = True)
plt.show()
In [23]:
# scatter plot  
data1.plot(kind = "scatter",x="Attack",y = "Defense")
plt.show()
In [24]:
# hist plot  
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True)
D:\Users\ceviherdian\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0xae51128>
In [25]:
# histogram subplot with non cumulative and cumulative
fig, axes = plt.subplots(nrows=2,ncols=1)
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[0])
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[1],cumulative = True)
plt.savefig('graph.png')
plt
D:\Users\ceviherdian\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
D:\Users\ceviherdian\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[25]:
<module 'matplotlib.pyplot' from 'D:\\Users\\ceviherdian\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

D. Statistical exploratory data analysis

Lets look at one more time. •count: number of entries

•mean: average of entries

•std: standart deviation

•min: minimum entry

•25%: first quantile

•50%: median or second quantile

•75%: third quantile

•max: maximum entry

In [28]:
data.describe()
Out[28]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation
count 800.0000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.00000
mean 400.5000 69.258750 79.001250 73.842500 72.820000 71.902500 68.277500 3.32375
std 231.0844 25.534669 32.457366 31.183501 32.722294 27.828916 29.060474 1.66129
min 1.0000 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000 1.00000
25% 200.7500 50.000000 55.000000 50.000000 49.750000 50.000000 45.000000 2.00000
50% 400.5000 65.000000 75.000000 70.000000 65.000000 70.000000 65.000000 3.00000
75% 600.2500 80.000000 100.000000 90.000000 95.000000 90.000000 90.000000 5.00000
max 800.0000 255.000000 190.000000 230.000000 194.000000 230.000000 180.000000 6.00000

E. Indexing pandas time series

•datetime = object

•parse_dates(boolean): Transform date to ISO 8601 (yyyy-mm-dd hh:mm:ss ) format

In [34]:
time_list = ["1992-03-08","1992-04-12"]
print(type(time_list[1])) # As you can see date is string
# however we want it to be datetime object
datetime_object = pd.to_datetime(time_list)
print(type(datetime_object))
<class 'str'>
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
In [32]:
# close warning
import warnings
warnings.filterwarnings("ignore")
# In order to practice lets take head of pokemon data and add it a time list
data2 = data.head()
date_list = ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"]
datetime_object = pd.to_datetime(date_list)
data2["date"] = datetime_object
# lets make date as index
data2= data2.set_index("date")
data2 
Out[32]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
date
1992-01-10 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1992-02-10 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
1992-03-10 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
1993-03-15 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
1993-03-16 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [35]:
# Now we can select according to our date index
print(data2.loc["1993-03-16"])
print(data2.loc["1992-03-10":"1993-03-16"])
#                      5
Name          Charmander
Type 1              Fire
Type 2               NaN
HP                    39
Attack                52
Defense               43
Sp. Atk               60
Sp. Def               50
Speed                 65
Generation             1
Legendary          False
Name: 1993-03-16 00:00:00, dtype: object
            #           Name Type 1  Type 2  HP  Attack  Defense  Sp. Atk  \
date                                                                        
1992-03-10  3       Venusaur  Grass  Poison  80      82       83      100   
1993-03-15  4  Mega Venusaur  Grass  Poison  80     100      123      122   
1993-03-16  5     Charmander   Fire     NaN  39      52       43       60   

            Sp. Def  Speed  Generation  Legendary  
date                                               
1992-03-10      100     80           1      False  
1993-03-15      120     80           1      False  
1993-03-16       50     65           1      False  

F. Resampling pandas time series

•Resampling: statistical method over different time intervals

◾Needs string to specify frequency like "M" = month or "A" = year

•Downsampling: reduce date time rows to slower frequency like from daily to weekly

•Upsampling: increase date time rows to faster frequency like from daily to hourly

•Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’

◾https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html
In [38]:
# We will use data2 that we create at previous part
data2.resample("A").mean()
Out[38]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
date
1992-12-31 2.0 61.666667 64.333333 65.0 81.666667 81.666667 61.666667 1.0 False
1993-12-31 4.5 59.500000 76.000000 83.0 91.000000 85.000000 72.500000 1.0 False
In [39]:
# Lets resample with month
data2.resample("M").mean()
# As you can see there are a lot of nan because data2 does not include all months
Out[39]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
date
1992-01-31 1.0 45.0 49.0 49.0 65.0 65.0 45.0 1.0 0.0
1992-02-29 2.0 60.0 62.0 63.0 80.0 80.0 60.0 1.0 0.0
1992-03-31 3.0 80.0 82.0 83.0 100.0 100.0 80.0 1.0 0.0
1992-04-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-05-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-06-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-07-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-08-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-09-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-10-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-11-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1992-12-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1993-01-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1993-02-28 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1993-03-31 4.5 59.5 76.0 83.0 91.0 85.0 72.5 1.0 0.0
In [40]:
# In real life (data is real. Not created from us like data2) we can solve this problem with interpolate
# We can interpolete from first value
data2.resample("M").first().interpolate("linear")
Out[40]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
date
1992-01-31 1.000000 Bulbasaur Grass Poison 45.0 49.0 49.000000 65.000000 65.000000 45.0 1.0 0.0
1992-02-29 2.000000 Ivysaur Grass Poison 60.0 62.0 63.000000 80.000000 80.000000 60.0 1.0 0.0
1992-03-31 3.000000 Venusaur Grass Poison 80.0 82.0 83.000000 100.000000 100.000000 80.0 1.0 0.0
1992-04-30 3.083333 NaN NaN NaN 80.0 83.5 86.333333 101.833333 101.666667 80.0 1.0 0.0
1992-05-31 3.166667 NaN NaN NaN 80.0 85.0 89.666667 103.666667 103.333333 80.0 1.0 0.0
1992-06-30 3.250000 NaN NaN NaN 80.0 86.5 93.000000 105.500000 105.000000 80.0 1.0 0.0
1992-07-31 3.333333 NaN NaN NaN 80.0 88.0 96.333333 107.333333 106.666667 80.0 1.0 0.0
1992-08-31 3.416667 NaN NaN NaN 80.0 89.5 99.666667 109.166667 108.333333 80.0 1.0 0.0
1992-09-30 3.500000 NaN NaN NaN 80.0 91.0 103.000000 111.000000 110.000000 80.0 1.0 0.0
1992-10-31 3.583333 NaN NaN NaN 80.0 92.5 106.333333 112.833333 111.666667 80.0 1.0 0.0
1992-11-30 3.666667 NaN NaN NaN 80.0 94.0 109.666667 114.666667 113.333333 80.0 1.0 0.0
1992-12-31 3.750000 NaN NaN NaN 80.0 95.5 113.000000 116.500000 115.000000 80.0 1.0 0.0
1993-01-31 3.833333 NaN NaN NaN 80.0 97.0 116.333333 118.333333 116.666667 80.0 1.0 0.0
1993-02-28 3.916667 NaN NaN NaN 80.0 98.5 119.666667 120.166667 118.333333 80.0 1.0 0.0
1993-03-31 4.000000 Mega Venusaur Grass Poison 80.0 100.0 123.000000 122.000000 120.000000 80.0 1.0 0.0
In [41]:
# Or we can interpolate with mean()
data2.resample("M").mean().interpolate("linear")
Out[41]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
date
1992-01-31 1.000 45.000000 49.0 49.0 65.00 65.00 45.000 1.0 0.0
1992-02-29 2.000 60.000000 62.0 63.0 80.00 80.00 60.000 1.0 0.0
1992-03-31 3.000 80.000000 82.0 83.0 100.00 100.00 80.000 1.0 0.0
1992-04-30 3.125 78.291667 81.5 83.0 99.25 98.75 79.375 1.0 0.0
1992-05-31 3.250 76.583333 81.0 83.0 98.50 97.50 78.750 1.0 0.0
1992-06-30 3.375 74.875000 80.5 83.0 97.75 96.25 78.125 1.0 0.0
1992-07-31 3.500 73.166667 80.0 83.0 97.00 95.00 77.500 1.0 0.0
1992-08-31 3.625 71.458333 79.5 83.0 96.25 93.75 76.875 1.0 0.0
1992-09-30 3.750 69.750000 79.0 83.0 95.50 92.50 76.250 1.0 0.0
1992-10-31 3.875 68.041667 78.5 83.0 94.75 91.25 75.625 1.0 0.0
1992-11-30 4.000 66.333333 78.0 83.0 94.00 90.00 75.000 1.0 0.0
1992-12-31 4.125 64.625000 77.5 83.0 93.25 88.75 74.375 1.0 0.0
1993-01-31 4.250 62.916667 77.0 83.0 92.50 87.50 73.750 1.0 0.0
1993-02-28 4.375 61.208333 76.5 83.0 91.75 86.25 73.125 1.0 0.0
1993-03-31 4.500 59.500000 76.0 83.0 91.00 85.00 72.500 1.0 0.0

Cheers!

/itsmecevi