%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y

import os
cwd = os.getcwd()

cwd

'D:\\Users\\ceviherdian'

DATA SCIENTIST¶

In this tutorial, I only explain you what you need to be a data scientist neither more nor less.

Data scientist need to have these skills:

1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python

2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.

3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.

4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.

5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.

4. Pandas Foundation¶

Data structures & analysis

A. Review of pandas

B. Building data frames from scratch

C. Visual exploratory data analysis

D. Statistical exploratory data analysis

E. Indexing pandas time series

F. Resampling pandas time series

A. Review of pandas¶

As you notice, We learn some basics of pandas, we will go deeper in pandas.

•single column = series

•NaN = not a number

•dataframe.values = numpy

B. Building data frames from scratch¶

•We can build data frames from csv as we did earlier.

•Also we can build dataframe from dictionaries

* zip() method: This function returns a list of tuples, 
where the i-th tuple contains the i-th element from each of the argument sequences or iterables.

•Adding new column

•Broadcasting: Create new column and assign a value to entire column

Data & Package:

#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool

data = pd.read_csv('pokemon.csv')

# data frames from dictionary
country = ["Spain","France"]
population = ["11","12"]
list_label = ["country","population"]
list_col = [country,population]
zipped = list(zip(list_label,list_col))
data_dict = dict(zipped)
df = pd.DataFrame(data_dict)
df

# Add new columns
df["capital"] = ["madrid","paris"]
df

# Broadcasting
df["income"] = 0 #Broadcasting entire column
df

C. Visual exploratory data analysis¶

•Plot

•Subplot

•Histogram:

◾bins: number of bins

◾range(tuble): min and max values of bins

◾normed(boolean): normalize or not

◾cumulative(boolean): compute cumulative distribution

# Plotting all data 
data1 = data.loc[:,["Attack","Defense","Speed"]]
data1.plot()
# it is confusing

<matplotlib.axes._subplots.AxesSubplot at 0x9867940>

# subplots
data1.plot(subplots = True)
plt.show()

# scatter plot  
data1.plot(kind = "scatter",x="Attack",y = "Defense")
plt.show()

# hist plot  
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True)

D:\Users\ceviherdian\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

<matplotlib.axes._subplots.AxesSubplot at 0xae51128>

# histogram subplot with non cumulative and cumulative
fig, axes = plt.subplots(nrows=2,ncols=1)
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[0])
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[1],cumulative = True)
plt.savefig('graph.png')
plt

D:\Users\ceviherdian\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
D:\Users\ceviherdian\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

<module 'matplotlib.pyplot' from 'D:\\Users\\ceviherdian\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

D. Statistical exploratory data analysis¶

Lets look at one more time. •count: number of entries

•mean: average of entries

•std: standart deviation

•min: minimum entry

•25%: first quantile

•50%: median or second quantile

•75%: third quantile

•max: maximum entry

data.describe()

E. Indexing pandas time series¶

•datetime = object

•parse_dates(boolean): Transform date to ISO 8601 (yyyy-mm-dd hh:mm:ss ) format

time_list = ["1992-03-08","1992-04-12"]
print(type(time_list[1])) # As you can see date is string
# however we want it to be datetime object
datetime_object = pd.to_datetime(time_list)
print(type(datetime_object))

<class 'str'>
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

# close warning
import warnings
warnings.filterwarnings("ignore")
# In order to practice lets take head of pokemon data and add it a time list
data2 = data.head()
date_list = ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"]
datetime_object = pd.to_datetime(date_list)
data2["date"] = datetime_object
# lets make date as index
data2= data2.set_index("date")
data2

# Now we can select according to our date index
print(data2.loc["1993-03-16"])
print(data2.loc["1992-03-10":"1993-03-16"])

#                      5
Name          Charmander
Type 1              Fire
Type 2               NaN
HP                    39
Attack                52
Defense               43
Sp. Atk               60
Sp. Def               50
Speed                 65
Generation             1
Legendary          False
Name: 1993-03-16 00:00:00, dtype: object
            #           Name Type 1  Type 2  HP  Attack  Defense  Sp. Atk  \
date                                                                        
1992-03-10  3       Venusaur  Grass  Poison  80      82       83      100   
1993-03-15  4  Mega Venusaur  Grass  Poison  80     100      123      122   
1993-03-16  5     Charmander   Fire     NaN  39      52       43       60   

            Sp. Def  Speed  Generation  Legendary  
date                                               
1992-03-10      100     80           1      False  
1993-03-15      120     80           1      False  
1993-03-16       50     65           1      False

F. Resampling pandas time series¶

•Resampling: statistical method over different time intervals

◾Needs string to specify frequency like "M" = month or "A" = year

•Downsampling: reduce date time rows to slower frequency like from daily to weekly

•Upsampling: increase date time rows to faster frequency like from daily to hourly

•Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’

◾https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html

# We will use data2 that we create at previous part
data2.resample("A").mean()

# Lets resample with month
data2.resample("M").mean()
# As you can see there are a lot of nan because data2 does not include all months

# In real life (data is real. Not created from us like data2) we can solve this problem with interpolate
# We can interpolete from first value
data2.resample("M").first().interpolate("linear")

# Or we can interpolate with mean()
data2.resample("M").mean().interpolate("linear")

Cheers!

/itsmecevi

	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
count	800.0000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.00000
mean	400.5000	69.258750	79.001250	73.842500	72.820000	71.902500	68.277500	3.32375
std	231.0844	25.534669	32.457366	31.183501	32.722294	27.828916	29.060474	1.66129
min	1.0000	1.000000	5.000000	5.000000	10.000000	20.000000	5.000000	1.00000
25%	200.7500	50.000000	55.000000	50.000000	49.750000	50.000000	45.000000	2.00000
50%	400.5000	65.000000	75.000000	70.000000	65.000000	70.000000	65.000000	3.00000
75%	600.2500	80.000000	100.000000	90.000000	95.000000	90.000000	90.000000	5.00000
max	800.0000	255.000000	190.000000	230.000000	194.000000	230.000000	180.000000	6.00000

	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
date
1992-12-31	2.0	61.666667	64.333333	65.0	81.666667	81.666667	61.666667	1.0	False
1993-12-31	4.5	59.500000	76.000000	83.0	91.000000	85.000000	72.500000	1.0	False

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
date
1992-01-31	1.000000	Bulbasaur	Grass	Poison	45.0	49.0	49.000000	65.000000	65.000000	45.0	1.0	0.0
1992-02-29	2.000000	Ivysaur	Grass	Poison	60.0	62.0	63.000000	80.000000	80.000000	60.0	1.0	0.0
1992-03-31	3.000000	Venusaur	Grass	Poison	80.0	82.0	83.000000	100.000000	100.000000	80.0	1.0	0.0
1992-04-30	3.083333	NaN	NaN	NaN	80.0	83.5	86.333333	101.833333	101.666667	80.0	1.0	0.0
1992-05-31	3.166667	NaN	NaN	NaN	80.0	85.0	89.666667	103.666667	103.333333	80.0	1.0	0.0
1992-06-30	3.250000	NaN	NaN	NaN	80.0	86.5	93.000000	105.500000	105.000000	80.0	1.0	0.0
1992-07-31	3.333333	NaN	NaN	NaN	80.0	88.0	96.333333	107.333333	106.666667	80.0	1.0	0.0
1992-08-31	3.416667	NaN	NaN	NaN	80.0	89.5	99.666667	109.166667	108.333333	80.0	1.0	0.0
1992-09-30	3.500000	NaN	NaN	NaN	80.0	91.0	103.000000	111.000000	110.000000	80.0	1.0	0.0
1992-10-31	3.583333	NaN	NaN	NaN	80.0	92.5	106.333333	112.833333	111.666667	80.0	1.0	0.0
1992-11-30	3.666667	NaN	NaN	NaN	80.0	94.0	109.666667	114.666667	113.333333	80.0	1.0	0.0
1992-12-31	3.750000	NaN	NaN	NaN	80.0	95.5	113.000000	116.500000	115.000000	80.0	1.0	0.0
1993-01-31	3.833333	NaN	NaN	NaN	80.0	97.0	116.333333	118.333333	116.666667	80.0	1.0	0.0
1993-02-28	3.916667	NaN	NaN	NaN	80.0	98.5	119.666667	120.166667	118.333333	80.0	1.0	0.0
1993-03-31	4.000000	Mega Venusaur	Grass	Poison	80.0	100.0	123.000000	122.000000	120.000000	80.0	1.0	0.0

	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
date
1992-01-31	1.000	45.000000	49.0	49.0	65.00	65.00	45.000	1.0	0.0
1992-02-29	2.000	60.000000	62.0	63.0	80.00	80.00	60.000	1.0	0.0
1992-03-31	3.000	80.000000	82.0	83.0	100.00	100.00	80.000	1.0	0.0
1992-04-30	3.125	78.291667	81.5	83.0	99.25	98.75	79.375	1.0	0.0
1992-05-31	3.250	76.583333	81.0	83.0	98.50	97.50	78.750	1.0	0.0
1992-06-30	3.375	74.875000	80.5	83.0	97.75	96.25	78.125	1.0	0.0
1992-07-31	3.500	73.166667	80.0	83.0	97.00	95.00	77.500	1.0	0.0
1992-08-31	3.625	71.458333	79.5	83.0	96.25	93.75	76.875	1.0	0.0
1992-09-30	3.750	69.750000	79.0	83.0	95.50	92.50	76.250	1.0	0.0
1992-10-31	3.875	68.041667	78.5	83.0	94.75	91.25	75.625	1.0	0.0
1992-11-30	4.000	66.333333	78.0	83.0	94.00	90.00	75.000	1.0	0.0
1992-12-31	4.125	64.625000	77.5	83.0	93.25	88.75	74.375	1.0	0.0
1993-01-31	4.250	62.916667	77.0	83.0	92.50	87.50	73.750	1.0	0.0
1993-02-28	4.375	61.208333	76.5	83.0	91.75	86.25	73.125	1.0	0.0
1993-03-31	4.500	59.500000	76.0	83.0	91.00	85.00	72.500	1.0	0.0

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
date
1992-01-10	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1992-02-10	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
1992-03-10	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
1993-03-15	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
1993-03-16	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

	country	population
0	Spain	11
1	France	12

	country	population	capital
0	Spain	11	madrid
1	France	12	paris

	country	population	capital	income
0	Spain	11	madrid	0
1	France	12	paris	0