1. Introduction
2. Litelature Review
3. Data Description
4. Model Analysis
5. R Code Analysis
* DataSet Details
* Describe DataSet
* Create a Contigency Table for each variable in dataset
* Create two-way contingency tables for the categorical variables
* Boxplot Creation
* Histogram for the variables
* Visualize your correlation matrix using corrgram
* Create a scatter plot matrix
* Run a suitable test to check your hypothesis for your suitable assumptions
* T- Test Hypothesis
6. Result
7. References
Quality of Work
Performance Pressure
Friends
Location
Money
This dataset come from kaggle HR Analytics
Satisfaction Level : Employee Satisfaction (can be interpreted as a %)
Last evaluation : Employee Evaluation (can be interpreted as a %)
Projects : Number of Projects (per year)
Average monthly hours : Average monthly hours
Time spent at company : Time spent at company
Accident : Whether they have had a work accident
Promotion Last 5 yrs : Whether they have had a promotion in the last 5 years
Department : Type of Job Position
Salary : Salary level (1= low, 2= medium, 3= high)
Left : Whether the employee has left (0= remains employed, 1= left)
In this case we used regression model (linear) to know the significant value of parameters
head(hr)
fullmodel <- lm(satisfaction_level ~ salary + average_montly_hours + number_project + time_spend_company + promotion_last_5years + last_evaluation + Work_accident + left,data=hr )
summary(fullmodel)
Call:
lm(formula = satisfaction_level ~ salary + average_montly_hours +
number_project + time_spend_company + promotion_last_5years +
last_evaluation + Work_accident + left, data = hr)
Residuals:
Min 1Q Median 3Q Max
-0.64740 -0.13677 -0.01193 0.17004 0.52773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.148e-01 1.159e-02 53.065 < 2e-16 ***
salarylow 1.200e-02 6.961e-03 1.724 0.0847 .
salarymedium 1.306e-02 6.956e-03 1.878 0.0604 .
average_montly_hours 1.913e-04 4.127e-05 4.636 3.58e-06 ***
number_project -4.090e-02 1.691e-03 -24.183 < 2e-16 ***
time_spend_company -5.525e-03 1.295e-03 -4.267 2.00e-05 ***
promotion_last_5years 9.285e-03 1.272e-02 0.730 0.4655
last_evaluation 2.460e-01 1.167e-02 21.071 < 2e-16 ***
Work_accident -3.356e-05 5.238e-03 -0.006 0.9949
left -2.241e-01 4.449e-03 -50.360 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2227 on 14989 degrees of freedom
Multiple R-squared: 0.1982, Adjusted R-squared: 0.1977
F-statistic: 411.6 on 9 and 14989 DF, p-value: < 2.2e-16
Average monthly hours, number project, time spent at company, last evaluation, and whether or not the employee left have significant p-values.
R-squared and adjusted R-squared are both at about 0.19. This is not typically a good value, but is rather common in any data analysis of human behavior. For example, in psychology studies this R-squared level would not eliminate the model’s validity, especially if p-values indicate significance
More abour summary function in linear model Regression:https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
revisedmodel<- lm(satisfaction_level ~ average_montly_hours + number_project + time_spend_company + last_evaluation,data=hr)
summary(revisedmodel)
Call:
lm(formula = satisfaction_level ~ average_montly_hours + number_project +
time_spend_company + last_evaluation, data = hr)
Residuals:
Min 1Q Median 3Q Max
-0.61923 -0.19061 0.02274 0.19617 0.59000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.152e-01 1.066e-02 57.70 <2e-16 ***
average_montly_hours 5.183e-05 4.469e-05 1.16 0.246
number_project -3.894e-02 1.835e-03 -21.23 <2e-16 ***
time_spend_company -1.498e-02 1.383e-03 -10.83 <2e-16 ***
last_evaluation 2.622e-01 1.266e-02 20.71 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2417 on 14994 degrees of freedom
Multiple R-squared: 0.05522, Adjusted R-squared: 0.05497
F-statistic: 219.1 on 4 and 14994 DF, p-value: < 2.2e-16
This model has even lower R-squared and adjusted R-squared values of about 0.05, which again can be attributed to the fact that human behavior is very difficult to predict. The variables remained significant, except for average monthly hours. Based on these p-values, we concluded the model and remove the insignificant variable for the next model.
dim(hr)
[1] 14999 10
#install.package("psych")
#library(psych)
describe(hr)
table_salary<-with(hr,table(salary))
table_salary
salary
high low medium
1237 7316 6446
table_satisfication<-with(hr,table(satisfaction_level))
table_satisfication
satisfaction_level
0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27
195 358 335 30 54 73 76 79 72 63 74 69 67 60 54 80 34 30 30
0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46
31 38 39 59 50 36 48 37 139 241 189 175 209 171 155 224 211 203 95
0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
42 149 209 229 187 196 179 185 179 187 210 182 219 193 208 188 209 187 199
0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84
228 177 162 209 205 171 230 246 257 226 234 252 241 217 222 220 241 234 247
0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
207 200 225 187 237 220 224 198 169 167 181 203 176 183 172 111
table_lastevaluation<-with(hr,table(last_evaluation))
table_lastevaluation
last_evaluation
0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54
22 55 50 52 57 59 56 50 44 115 211 173 292 332 353 345 309 324 350
0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73
358 322 333 225 255 221 234 233 236 235 201 222 245 222 193 213 196 211 223
0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92
260 238 216 263 214 241 251 255 237 269 294 316 273 326 235 296 313 287 269
0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
269 263 258 249 276 263 258 283
table_numberproject<-with(hr,table(number_project))
table_numberproject
number_project
2 3 4 5 6 7
2388 4055 4365 2761 1174 256
table_avgmontlyhours<-with(hr,table(average_montly_hours))
table_avgmontlyhours
average_montly_hours
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
6 14 23 11 19 16 17 17 28 17 19 10 18 18 12 26 10 29 15 14 10 18 12 10
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
10 24 11 20 13 19 25 72 65 63 59 69 100 87 114 153 104 122 88 120 129 115 112 127
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
102 134 110 118 123 148 108 147 112 122 121 125 153 126 124 121 136 87 96 73 78 78 73 94
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
92 86 76 83 70 96 78 76 81 81 85 73 88 78 75 84 80 93 76 68 73 85 75 80
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
96 67 71 67 79 70 86 79 58 86 80 72 68 73 83 71 72 72 72 79 72 71 78 68
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
76 87 79 85 64 81 84 93 112 95 93 77 76 93 59 77 97 102 74 76 83 90 108 96
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263
93 85 98 112 98 124 102 108 86 93 100 98 86 101 113 115 87 126 110 98 124 102 86 110
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
111 91 105 88 93 102 93 104 86 88 94 82 30 21 35 32 29 34 36 25 24 33 50 30
288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310
6 19 15 17 15 13 16 12 21 7 13 6 11 24 8 6 17 18 18 14 20 16 18
table_timespend<-with(hr,table(time_spend_company))
table_timespend
time_spend_company
2 3 4 5 6 7 8 10
3244 6443 2557 1473 718 188 162 214
table_workaccident<-with(hr,table(Work_accident))
table_workaccident
Work_accident
0 1
12830 2169
table_left<-with(hr,table(left))
table_left
left
0 1
11428 3571
table_promotion<-with(hr,table(promotion_last_5years))
table_promotion
promotion_last_5years
0 1
14680 319
table_sales<-with(hr,table(Department))
table_sales
Department
accounting hr IT management marketing product_mng RandD sales
767 739 1227 630 858 902 787 4140
support technical
2229 2720
table_salary<-with(hr,table(salary))
table_salary
salary
high low medium
1237 7316 6446
table_project_spend<-xtabs(~number_project+time_spend_company,data=hr)
head(table_project_spend,5)
time_spend_company
number_project 2 3 4 5 6 7 8 10
2 224 1854 136 83 53 16 12 10
3 1255 1782 530 135 139 58 62 94
4 1144 1798 577 445 215 64 46 76
5 554 866 431 592 224 38 34 22
6 66 136 673 180 87 12 8 12
table_satisfication_salary<-xtabs(~satisfaction_level+salary,data=hr)
head(table_satisfication_salary,5)
salary
satisfaction_level high low medium
0.09 4 113 78
0.1 9 205 144
0.11 2 210 123
0.12 1 11 18
0.13 4 27 23
table_Department_salary<-xtabs(~Department+salary,data=hr)
head(table_Department_salary,5)
salary
Department high low medium
accounting 74 358 335
hr 45 335 359
IT 83 609 535
management 225 180 225
marketing 80 402 376
table_avgmontlyhours_salary<-xtabs(~average_montly_hours+salary,data=hr)
head(table_avgmontlyhours_salary,5)
salary
average_montly_hours high low medium
96 1 3 2
97 2 7 5
98 0 15 8
99 2 5 4
100 0 8 11
table_accident_salary<-xtabs(~Work_accident+salary,data=hr)
table_accident_salary
salary
Work_accident high low medium
0 1045 6276 5509
1 192 1040 937
table_promotion_salary<-xtabs(~promotion_last_5years+salary,data=hr)
table_promotion_salary
salary
promotion_last_5years high low medium
0 1165 7250 6265
1 72 66 181
table_project_timespend<-xtabs(~number_project+time_spend_company,data=hr)
table_project_timespend
time_spend_company
number_project 2 3 4 5 6 7 8 10
2 224 1854 136 83 53 16 12 10
3 1255 1782 530 135 139 58 62 94
4 1144 1798 577 445 215 64 46 76
5 554 866 431 592 224 38 34 22
6 66 136 673 180 87 12 8 12
7 1 7 210 38 0 0 0 0
boxplot(satisfaction_level ~salary,data=hr, horizontal=TRUE,
ylab="Salary Level", xlab="Satisfaction level", las=1,
main="Analysis of Salary of Employee on the basis of their satisfaction level",
col=c("red","blue","green")
)
boxplot(satisfaction_level ~left, data=hr, horizontal=TRUE,
ylab="Left", xlab="Satisfaction level", las=1,
main="Analysis of of Employee Left on the basis of their satisfaction level",
col=c("Yellow","Orange")
)
boxplot(number_project~left,data=hr, horizontal=TRUE,
ylab="Left", xlab="No of Projects", las=1,
main="Analysis of of Employee Left on the basis of their Number of Projects",
col=c("Red","Magenta")
)
boxplot(average_montly_hours ~left, data=hr,horizontal=TRUE,
ylab="Left", xlab="Average Monthly Hours", las=1,
main="Analysis of of Employee Left on the basis of their Average Monthly Hours",
col=c("Yellow","Orange")
)
boxplot(Work_accident~left,data=hr, horizontal=TRUE,
ylab="Left", xlab="Work Accident", las=1,
main="Analysis of of Employee Left on the basis of their Work Accident",
col=c("Yellow","Orange")
)
boxplot(last_evaluation ~left,data=hr, horizontal=TRUE,
ylab="Left", xlab="Last Evaluation", las=1,
main="Analysis of of Employee Left on the basis of their Last Evaluation",
col=c("Yellow","Orange")
)
hist(hr$satisfaction_level, main=" Variation in Satisfaction Level ", xlab="Satisfaction Level",breaks=10,ylab="Frequency", col="green")
hist(hr$last_evaluation, main=" Variation in Last Evaluation ", xlab="Last Evaluation",breaks=10,ylab="Frequency", col="blue")
hist(hr$satisfaction_level, main=" Variation in Time Spent in the Company ", xlab="Time Spent in the Company",breaks=10,ylab="Frequency", col="yellow")
hist(hr$average_montly_hours, main=" Variation in Average Monthly Hours ", xlab="Average Monthly Hours",breaks=10,ylab="Frequency", col="red")
plot(y=hr$salary, x=hr$Department,
col="red",
main="Relationship Btw salary and sales",
ylab="Salary", xlab="Sales")
plot(y=hr$average_montly_hours, x=hr$Department,
col="green",
main="Relationship Btw Average Monthly Hours and sales",
ylab="Average Monthly Hours", xlab="Sales")
library(corrplot)
corrplot 0.84 loaded
correlationMatrix <- cor(hr[,c(1:8)])
corrplot(correlationMatrix, method="circle")
cor(hr[ ,c(1,2,3,4,5,6,7,8)])
satisfaction_level last_evaluation number_project average_montly_hours
satisfaction_level 1.00000000 0.105021214 -0.142969586 -0.020048113
last_evaluation 0.10502121 1.000000000 0.349332589 0.339741800
number_project -0.14296959 0.349332589 1.000000000 0.417210634
average_montly_hours -0.02004811 0.339741800 0.417210634 1.000000000
time_spend_company -0.10086607 0.131590722 0.196785891 0.127754910
Work_accident 0.05869724 -0.007104289 -0.004740548 -0.010142888
left -0.38837498 0.006567120 0.023787185 0.071287179
promotion_last_5years 0.02560519 -0.008683768 -0.006063958 -0.003544414
time_spend_company Work_accident left promotion_last_5years
satisfaction_level -0.100866073 0.058697241 -0.38837498 0.025605186
last_evaluation 0.131590722 -0.007104289 0.00656712 -0.008683768
number_project 0.196785891 -0.004740548 0.02378719 -0.006063958
average_montly_hours 0.127754910 -0.010142888 0.07128718 -0.003544414
time_spend_company 1.000000000 0.002120418 0.14482217 0.067432925
Work_accident 0.002120418 1.000000000 -0.15462163 0.039245435
left 0.144822175 -0.154621634 1.00000000 -0.061788107
promotion_last_5years 0.067432925 0.039245435 -0.06178811 1.000000000
library(corrgram)
corrgram(hr, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main = "Corrgram of all variables")
library(car)
#The following object is masked from 'package:psych':
scatterplotMatrix(formula = ~left + satisfaction_level + time_spend_company + Work_accident +average_montly_hours , data = hr,smooth= TRUE)
cor.test(hr$left,hr$satisfaction_level)
Pearson's product-moment correlation
data: hr$left and hr$satisfaction_level
t = -51.613, df = 14997, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4018809 -0.3747001
sample estimates:
cor
-0.388375
cor.test(hr$left,hr$time_spend_company)
Pearson's product-moment correlation
data: hr$left and hr$time_spend_company
t = 17.924, df = 14997, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1291176 0.1604541
sample estimates:
cor
0.1448222
cor.test(hr$left,hr$last_evaluation)
Pearson's product-moment correlation
data: hr$left and hr$last_evaluation
t = 0.80424, df = 14997, p-value = 0.4213
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.009437678 0.022568555
sample estimates:
cor
0.00656712
cor.test(hr$left,hr$number_project)
Pearson's product-moment correlation
data: hr$left and hr$number_project
t = 2.9139, df = 14997, p-value = 0.003575
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.007786343 0.039775850
sample estimates:
cor
0.02378719
t.test(hr$satisfaction_level~hr$left)
Welch Two Sample t-test
data: hr$satisfaction_level by hr$left
t = 46.636, df = 5167, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2171815 0.2362417
sample estimates:
mean in group 0 mean in group 1
0.6668096 0.4400980
t.test(hr$time_spend_company~hr$left)
Welch Two Sample t-test
data: hr$time_spend_company by hr$left
t = -22.631, df = 9625.6, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5394767 -0.4534706
sample estimates:
mean in group 0 mean in group 1
3.380032 3.876505
t.test(hr$average_montly_hours~hr$left)
Welch Two Sample t-test
data: hr$average_montly_hours by hr$left
t = -7.5323, df = 4875.1, p-value = 5.907e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.534631 -6.183384
sample estimates:
mean in group 0 mean in group 1
199.0602 207.4192
t.test(hr$last_evaluation~hr$left)
Welch Two Sample t-test
data: hr$last_evaluation by hr$left
t = -0.72534, df = 5154.9, p-value = 0.4683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.009772224 0.004493874
sample estimates:
mean in group 0 mean in group 1
0.7154734 0.7181126
Emplpoyee who has low salary, with satisfaction level < 0.5 and is putting in average monthly hours > 200. The probability of the employee leaving is 70% (3 parameters from total 10 parameters)
https://smallbusiness.chron.com/meaning-attrition-used-hr-61183.html
https://www.linkedin.com/pulse/top-5-reasons-employee-attrition-how-deal-mahidhar-reddy/