import os
cwd = os.getcwd()
cwd
In this tutorial, I only explain you what you need to be a data scientist neither more nor less.
Data scientist need to have these skills:
1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python
2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.
3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.
4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.
5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.
In this part, you learn:
•how to import csv file
•plotting line,scatter and histogram
•basic dictionary features
•basic pandas features like filtering that is actually something always used and main for being data scientist
•While and for loops
A. MATPLOTLIB
B. DICTIONARY
C. PANDAS
D. Logic,control flow and filtering
E. Loop data structures
Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.
•Line plot is better when x axis is time.
•Scatter is better when there is correlation between two variables
•Histogram is better when we need to see distribution of numerical data.
•Customization: Colors,labels,thickness of line, title, opacity, grid, figsize, ticks of axis and linestyle
Data and Package:
#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns # visualization tool
data = pd.read_csv('pokemon.csv')
data.info()
data.head(5)
data.columns
# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1,alpha = 0.5,grid = True,linestyle = ':')
data.Defense.plot(color = 'r',label = 'Defense',linewidth=1, alpha = 0.5,grid = True,linestyle = '-.')
plt.legend(loc='upper right') # legend = puts label into plot
plt.xlabel('x axis') # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot') # title = title of plot
plt.show()
# Scatter Plot
# x = attack, y = defense
data.plot(kind='scatter', x='Attack', y='Defense',alpha = 0.5,color = 'red')
plt.xlabel('Attack') # label = name of label
plt.ylabel('Defence')
plt.title('Attack Defense Scatter Plot') # title = title of plot
# Histogram
# bins = number of bar in figure
data.Speed.plot(kind = 'hist',bins = 50,figsize = (12,12),color='red')
plt.show()
More about matplotlib: https://matplotlib.org/index.html
Why we need dictionary?
•It has 'key' and 'value'
•Faster than lists What is key and value. Example:
•dictionary = {'spain' : 'madrid'}
•Key is spain.
•Values is madrid.
It's that easy.
Lets practice some other properties like keys(), values(), update, add, check, remove key, remove all entries and remove dicrionary.
#create dictionary and look its keys and values
dictionary = {'spain' : 'madrid','usa' : 'vegas'}
print(dictionary.keys())
print(dictionary.values())
# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary['spain'] = "barcelona" # update existing entry
print(dictionary)
dictionary['france'] = "paris" # Add new entry
print(dictionary)
del dictionary['spain'] # remove entry with key 'spain'
print(dictionary)
print('france' in dictionary) # check include or not
dictionary.clear() # remove all entries in dict
print(dictionary)
# In order to run all code you need to take comment this line
# del dictionary # delete entire dictionary
print(dictionary) # it gives error because dictionary is deleted
What we need to know about pandas?
•CSV: comma - separated values
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
data = pd.read_csv('pokemon.csv')
series = data['Defense'] # data['Defense'] = series
print(type(series))
data_frame = data[['Defense']] # data[['Defense']] = data frame
print(type(data_frame))
More about pandas: https://pandas.pydata.org/
Before continue with pandas, we need to learn logic, control flow and filtering. Comparison operator: ==, <, >, <= Boolean operators: and, or ,not Filtering pandas
# Comparison operator
print(3 > 2)
print(3!=2)
# Boolean operators
print(True and False)
print(True or False)
# 1 - Filtering Pandas data frame
x = data['Defense']>200 # There are only 3 pokemons who have higher defense value than 200
data[x]
# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]
# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]
# This is also same with previous code line. Therefore we can also use '&' for filtering.
data[(data['Defense']>200) & (data['Attack']>100)]
We will learn most basic while and for loops
# Stay in loop if condition( i is not equal 5) is true
i = 0
while i != 5 :
print('i is: ',i)
i +=1
print(i,' is equal to 5')
# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
print('i is: ',i)
print('')
# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
print('i is: ',i)
print('')
# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index, value in enumerate(lis):
print(index," : ",value)
print('')
# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary = {'spain':'madrid','france':'paris'}
for key,value in dictionary.items():
print(key," : ",value)
print('')
# For pandas we can achieve index and value
for index,value in data[['Attack']][0:1].iterrows():
print(index," : ",value)
Cheers!
/itsmecevi