%reset
import os
cwd = os.getcwd()
cwd
In this tutorial, I only explain you what you need to be a data scientist neither more nor less.
Data scientist need to have these skills:
1.Basic Tools: Like python, R or SQL. You do not need to know everything. What you only need is to learn how to use python
2.Basic Statistics: Like mean, median or standart deviation. If you know basic statistics, you can use python easily.
3.Data Munging: Working with messy and difficult data. Like a inconsistent date and string formatting. As you guess, python helps us.
4.Data Visualization: Title is actually explanatory. We will visualize the data with python like matplot and seaborn libraries.
5.Machine Learning: You do not need to understand math behind the machine learning technique. You only need is understanding basics of machine learning and learning how to implement it while using python.
Data structures & analysis
A. Review of pandas
B. Building data frames from scratch
C. Visual exploratory data analysis
D. Statistical exploratory data analysis
E. Indexing pandas time series
F. Resampling pandas time series
As you notice, We learn some basics of pandas, we will go deeper in pandas.
•single column = series
•NaN = not a number
•dataframe.values = numpy
•We can build data frames from csv as we did earlier.
•Also we can build dataframe from dictionaries
* zip() method: This function returns a list of tuples,
where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
•Adding new column
•Broadcasting: Create new column and assign a value to entire column
Data & Package:
#Package: matplotlib, seaborn,numpy, and pandas (for dataframe data structure manipulation)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns # visualization tool
data = pd.read_csv('pokemon.csv')
# data frames from dictionary
country = ["Spain","France"]
population = ["11","12"]
list_label = ["country","population"]
list_col = [country,population]
zipped = list(zip(list_label,list_col))
data_dict = dict(zipped)
df = pd.DataFrame(data_dict)
df
# Add new columns
df["capital"] = ["madrid","paris"]
df
# Broadcasting
df["income"] = 0 #Broadcasting entire column
df
•Plot
•Subplot
•Histogram:
◾bins: number of bins
◾range(tuble): min and max values of bins
◾normed(boolean): normalize or not
◾cumulative(boolean): compute cumulative distribution
# Plotting all data
data1 = data.loc[:,["Attack","Defense","Speed"]]
data1.plot()
# it is confusing
# subplots
data1.plot(subplots = True)
plt.show()
# scatter plot
data1.plot(kind = "scatter",x="Attack",y = "Defense")
plt.show()
# hist plot
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True)
# histogram subplot with non cumulative and cumulative
fig, axes = plt.subplots(nrows=2,ncols=1)
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[0])
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[1],cumulative = True)
plt.savefig('graph.png')
plt
Lets look at one more time. •count: number of entries
•mean: average of entries
•std: standart deviation
•min: minimum entry
•25%: first quantile
•50%: median or second quantile
•75%: third quantile
•max: maximum entry
data.describe()
•datetime = object
•parse_dates(boolean): Transform date to ISO 8601 (yyyy-mm-dd hh:mm:ss ) format
time_list = ["1992-03-08","1992-04-12"]
print(type(time_list[1])) # As you can see date is string
# however we want it to be datetime object
datetime_object = pd.to_datetime(time_list)
print(type(datetime_object))
# close warning
import warnings
warnings.filterwarnings("ignore")
# In order to practice lets take head of pokemon data and add it a time list
data2 = data.head()
date_list = ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"]
datetime_object = pd.to_datetime(date_list)
data2["date"] = datetime_object
# lets make date as index
data2= data2.set_index("date")
data2
# Now we can select according to our date index
print(data2.loc["1993-03-16"])
print(data2.loc["1992-03-10":"1993-03-16"])
•Resampling: statistical method over different time intervals
◾Needs string to specify frequency like "M" = month or "A" = year
•Downsampling: reduce date time rows to slower frequency like from daily to weekly
•Upsampling: increase date time rows to faster frequency like from daily to hourly
•Interpolate: Interpolate values according to different methods like ‘linear’, ‘time’ or index’
◾https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html
# We will use data2 that we create at previous part
data2.resample("A").mean()
# Lets resample with month
data2.resample("M").mean()
# As you can see there are a lot of nan because data2 does not include all months
# In real life (data is real. Not created from us like data2) we can solve this problem with interpolate
# We can interpolete from first value
data2.resample("M").first().interpolate("linear")
# Or we can interpolate with mean()
data2.resample("M").mean().interpolate("linear")
Cheers!
/itsmecevi