In [61]:
import os
cwd = os.getcwd()
In [62]:
cwd
Out[62]:
'D:\\Users\\ceviherdian'

DATA SCIENTIST

In this tutorial, I only explain you what you need to be a data scientist neither more nor less.

Data scientist need to have these skills:

1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python

2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.

3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.

4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.

5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.

In this part, you learn:

•how to import csv file

•plotting line,scatter and histogram

•basic dictionary features

•basic pandas features like filtering that is actually something always used and main for being data scientist

•While and for loops

1. INTRODUCTION TO PYTHON

A. MATPLOTLIB

B. DICTIONARY

C. PANDAS

D. Logic,control flow and filtering

E. Loop data structures

A. MATPLOTLIB

Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.

•Line plot is better when x axis is time.

•Scatter is better when there is correlation between two variables

•Histogram is better when we need to see distribution of numerical data.

•Customization: Colors,labels,thickness of line, title, opacity, grid, figsize, ticks of axis and linestyle

Data and Package:

In [34]:
#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool
In [21]:
data = pd.read_csv('pokemon.csv')
In [22]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
#             800 non-null int64
Name          799 non-null object
Type 1        800 non-null object
Type 2        414 non-null object
HP            800 non-null int64
Attack        800 non-null int64
Defense       800 non-null int64
Sp. Atk       800 non-null int64
Sp. Def       800 non-null int64
Speed         800 non-null int64
Generation    800 non-null int64
Legendary     800 non-null bool
dtypes: bool(1), int64(8), object(3)
memory usage: 69.6+ KB
In [23]:
data.head(5)
Out[23]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [26]:
data.columns
Out[26]:
Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')
In [27]:
# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1,alpha = 0.5,grid = True,linestyle = ':')
data.Defense.plot(color = 'r',label = 'Defense',linewidth=1, alpha = 0.5,grid = True,linestyle = '-.')
plt.legend(loc='upper right')     # legend = puts label into plot
plt.xlabel('x axis')              # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot')            # title = title of plot
plt.show()
In [28]:
# Scatter Plot 
# x = attack, y = defense
data.plot(kind='scatter', x='Attack', y='Defense',alpha = 0.5,color = 'red')
plt.xlabel('Attack')              # label = name of label
plt.ylabel('Defence')
plt.title('Attack Defense Scatter Plot')            # title = title of plot
Out[28]:
Text(0.5,1,'Attack Defense Scatter Plot')
In [30]:
# Histogram
# bins = number of bar in figure
data.Speed.plot(kind = 'hist',bins = 50,figsize = (12,12),color='red')
plt.show()

More about matplotlib: https://matplotlib.org/index.html

B. DICTIONARY

Why we need dictionary?

•It has 'key' and 'value'

•Faster than lists What is key and value. Example:

•dictionary = {'spain' : 'madrid'}

•Key is spain.

•Values is madrid.

It's that easy.

Lets practice some other properties like keys(), values(), update, add, check, remove key, remove all entries and remove dicrionary.

In [35]:
#create dictionary and look its keys and values
dictionary = {'spain' : 'madrid','usa' : 'vegas'}
print(dictionary.keys())
print(dictionary.values())
dict_keys(['spain', 'usa'])
dict_values(['madrid', 'vegas'])
In [36]:
# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary['spain'] = "barcelona"    # update existing entry
print(dictionary)
dictionary['france'] = "paris"       # Add new entry
print(dictionary)
del dictionary['spain']              # remove entry with key 'spain'
print(dictionary)
print('france' in dictionary)        # check include or not
dictionary.clear()                   # remove all entries in dict
print(dictionary)
{'spain': 'barcelona', 'usa': 'vegas'}
{'spain': 'barcelona', 'usa': 'vegas', 'france': 'paris'}
{'usa': 'vegas', 'france': 'paris'}
True
{}
In [37]:
# In order to run all code you need to take comment this line
# del dictionary         # delete entire dictionary     
print(dictionary)       # it gives error because dictionary is deleted
{}

C. PANDAS

What we need to know about pandas?

•CSV: comma - separated values

In [41]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
In [42]:
data = pd.read_csv('pokemon.csv')
In [43]:
series = data['Defense']        # data['Defense'] = series
print(type(series))
data_frame = data[['Defense']]  # data[['Defense']] = data frame
print(type(data_frame))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

More about pandas: https://pandas.pydata.org/

D. Logic, control flow and filtering

Before continue with pandas, we need to learn logic, control flow and filtering. Comparison operator: ==, <, >, <= Boolean operators: and, or ,not Filtering pandas

In [45]:
# Comparison operator
print(3 > 2)
print(3!=2)
# Boolean operators
print(True and False)
print(True or False)
True
True
False
True
In [46]:
# 1 - Filtering Pandas data frame
x = data['Defense']>200     # There are only 3 pokemons who have higher defense value than 200
data[x]
Out[46]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
224 225 Mega Steelix Steel Ground 75 125 230 55 95 30 2 False
230 231 Shuckle Bug Rock 20 10 230 10 230 5 2 False
333 334 Mega Aggron Steel NaN 70 140 230 60 80 50 3 False
In [47]:
# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]
Out[47]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
224 225 Mega Steelix Steel Ground 75 125 230 55 95 30 2 False
333 334 Mega Aggron Steel NaN 70 140 230 60 80 50 3 False
In [49]:
# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]
Out[49]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
224 225 Mega Steelix Steel Ground 75 125 230 55 95 30 2 False
333 334 Mega Aggron Steel NaN 70 140 230 60 80 50 3 False
In [50]:
# This is also same with previous code line. Therefore we can also use '&' for filtering.
data[(data['Defense']>200) & (data['Attack']>100)]
Out[50]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
224 225 Mega Steelix Steel Ground 75 125 230 55 95 30 2 False
333 334 Mega Aggron Steel NaN 70 140 230 60 80 50 3 False

E. Loop data structures

We will learn most basic while and for loops

In [54]:
# Stay in loop if condition( i is not equal 5) is true
i = 0
while i != 5 :
    print('i is: ',i)
    i +=1 
print(i,' is equal to 5')
i is:  0
i is:  1
i is:  2
i is:  3
i is:  4
5  is equal to 5
In [55]:
# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
    print('i is: ',i)
print('')
i is:  1
i is:  2
i is:  3
i is:  4
i is:  5

In [57]:
# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
    print('i is: ',i)
print('')

# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index, value in enumerate(lis):
    print(index," : ",value)
print('')   

# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary = {'spain':'madrid','france':'paris'}
for key,value in dictionary.items():
    print(key," : ",value)
print('')

# For pandas we can achieve index and value
for index,value in data[['Attack']][0:1].iterrows():
    print(index," : ",value)
i is:  1
i is:  2
i is:  3
i is:  4
i is:  5

0  :  1
1  :  2
2  :  3
3  :  4
4  :  5

spain  :  madrid
france  :  paris

0  :  Attack    49
Name: 0, dtype: int64

Cheers!

/itsmecevi