import os
cwd = os.getcwd()

cwd

'D:\\Users\\ceviherdian'

DATA SCIENTIST¶

In this tutorial, I only explain you what you need to be a data scientist neither more nor less.

Data scientist need to have these skills:

1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python

2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.

3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.

4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.

5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.

In this part, you learn:

•how to import csv file

•plotting line,scatter and histogram

•basic dictionary features

•basic pandas features like filtering that is actually something always used and main for being data scientist

•While and for loops

1. INTRODUCTION TO PYTHON¶

A. MATPLOTLIB

B. DICTIONARY

C. PANDAS

D. Logic,control flow and filtering

E. Loop data structures

A. MATPLOTLIB¶

Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.

•Line plot is better when x axis is time.

•Scatter is better when there is correlation between two variables

•Histogram is better when we need to see distribution of numerical data.

•Customization: Colors,labels,thickness of line, title, opacity, grid, figsize, ticks of axis and linestyle

Data and Package:

#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool

data = pd.read_csv('pokemon.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
#             800 non-null int64
Name          799 non-null object
Type 1        800 non-null object
Type 2        414 non-null object
HP            800 non-null int64
Attack        800 non-null int64
Defense       800 non-null int64
Sp. Atk       800 non-null int64
Sp. Def       800 non-null int64
Speed         800 non-null int64
Generation    800 non-null int64
Legendary     800 non-null bool
dtypes: bool(1), int64(8), object(3)
memory usage: 69.6+ KB

data.head(5)

data.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1,alpha = 0.5,grid = True,linestyle = ':')
data.Defense.plot(color = 'r',label = 'Defense',linewidth=1, alpha = 0.5,grid = True,linestyle = '-.')
plt.legend(loc='upper right')     # legend = puts label into plot
plt.xlabel('x axis')              # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot')            # title = title of plot
plt.show()

# Scatter Plot 
# x = attack, y = defense
data.plot(kind='scatter', x='Attack', y='Defense',alpha = 0.5,color = 'red')
plt.xlabel('Attack')              # label = name of label
plt.ylabel('Defence')
plt.title('Attack Defense Scatter Plot')            # title = title of plot

Text(0.5,1,'Attack Defense Scatter Plot')

# Histogram
# bins = number of bar in figure
data.Speed.plot(kind = 'hist',bins = 50,figsize = (12,12),color='red')
plt.show()

More about matplotlib: https://matplotlib.org/index.html

B. DICTIONARY¶

Why we need dictionary?

•It has 'key' and 'value'

•Faster than lists What is key and value. Example:

•dictionary = {'spain' : 'madrid'}

•Key is spain.

•Values is madrid.

It's that easy.

Lets practice some other properties like keys(), values(), update, add, check, remove key, remove all entries and remove dicrionary.

#create dictionary and look its keys and values
dictionary = {'spain' : 'madrid','usa' : 'vegas'}
print(dictionary.keys())
print(dictionary.values())

dict_keys(['spain', 'usa'])
dict_values(['madrid', 'vegas'])

# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary['spain'] = "barcelona"    # update existing entry
print(dictionary)
dictionary['france'] = "paris"       # Add new entry
print(dictionary)
del dictionary['spain']              # remove entry with key 'spain'
print(dictionary)
print('france' in dictionary)        # check include or not
dictionary.clear()                   # remove all entries in dict
print(dictionary)

{'spain': 'barcelona', 'usa': 'vegas'}
{'spain': 'barcelona', 'usa': 'vegas', 'france': 'paris'}
{'usa': 'vegas', 'france': 'paris'}
True
{}

# In order to run all code you need to take comment this line
# del dictionary         # delete entire dictionary     
print(dictionary)       # it gives error because dictionary is deleted

{}

C. PANDAS¶

What we need to know about pandas?

•CSV: comma - separated values

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

data = pd.read_csv('pokemon.csv')

series = data['Defense']        # data['Defense'] = series
print(type(series))
data_frame = data[['Defense']]  # data[['Defense']] = data frame
print(type(data_frame))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

More about pandas: https://pandas.pydata.org/

D. Logic, control flow and filtering¶

Before continue with pandas, we need to learn logic, control flow and filtering. Comparison operator: ==, <, >, <= Boolean operators: and, or ,not Filtering pandas

# Comparison operator
print(3 > 2)
print(3!=2)
# Boolean operators
print(True and False)
print(True or False)

True
True
False
True

# 1 - Filtering Pandas data frame
x = data['Defense']>200     # There are only 3 pokemons who have higher defense value than 200
data[x]

# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]

# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]

# This is also same with previous code line. Therefore we can also use '&' for filtering.
data[(data['Defense']>200) & (data['Attack']>100)]

E. Loop data structures¶

We will learn most basic while and for loops

# Stay in loop if condition( i is not equal 5) is true
i = 0
while i != 5 :
    print('i is: ',i)
    i +=1 
print(i,' is equal to 5')

i is:  0
i is:  1
i is:  2
i is:  3
i is:  4
5  is equal to 5

# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
    print('i is: ',i)
print('')

i is:  1
i is:  2
i is:  3
i is:  4
i is:  5

# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
    print('i is: ',i)
print('')

# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index, value in enumerate(lis):
    print(index," : ",value)
print('')   

# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary = {'spain':'madrid','france':'paris'}
for key,value in dictionary.items():
    print(key," : ",value)
print('')

# For pandas we can achieve index and value
for index,value in data[['Attack']][0:1].iterrows():
    print(index," : ",value)

i is:  1
i is:  2
i is:  3
i is:  4
i is:  5

0  :  1
1  :  2
2  :  3
3  :  4
4  :  5

spain  :  madrid
france  :  paris

0  :  Attack    49
Name: 0, dtype: int64

Cheers!

/itsmecevi

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
224	225	Mega Steelix	Steel	Ground	75	125	230	55	95	30	2	False
230	231	Shuckle	Bug	Rock	20	10	230	10	230	5	2	False
333	334	Mega Aggron	Steel	NaN	70	140	230	60	80	50	3	False

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
224	225	Mega Steelix	Steel	Ground	75	125	230	55	95	30	2	False
333	334	Mega Aggron	Steel	NaN	70	140	230	60	80	50	3	False

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
224	225	Mega Steelix	Steel	Ground	75	125	230	55	95	30	2	False
333	334	Mega Aggron	Steel	NaN	70	140	230	60	80	50	3	False

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
224	225	Mega Steelix	Steel	Ground	75	125	230	55	95	30	2	False
333	334	Mega Aggron	Steel	NaN	70	140	230	60	80	50	3	False