-
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathLinear-reg.html
More file actions
1078 lines (976 loc) · 67.6 KB
/
Linear-reg.html
File metadata and controls
1078 lines (976 loc) · 67.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<title>Regression models</title>
<meta content="" name="description">
<meta content="" name="keywords">
<!-- Favicons -->
<link href="assets/img/Favicon-1.png" rel="icon">
<link href="assets/img/Favicon-1.png" rel="apple-touch-icon">
<!-- Google Fonts -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,600,600i,700,700i|Raleway:300,300i,400,400i,500,500i,600,600i,700,700i|Poppins:300,300i,400,400i,500,500i,600,600i,700,700i" rel="stylesheet">
<!-- Vendor CSS Files -->
<link href="assets/vendor/aos/aos.css" rel="stylesheet">
<link href="assets/vendor/bootstrap/css/bootstrap.min.css" rel="stylesheet">
<link href="assets/vendor/bootstrap-icons/bootstrap-icons.css" rel="stylesheet">
<link href="assets/vendor/boxicons/css/boxicons.min.css" rel="stylesheet">
<link href="assets/vendor/glightbox/css/glightbox.min.css" rel="stylesheet">
<link href="assets/vendor/swiper/swiper-bundle.min.css" rel="stylesheet">
<!-- Creating a python code section-->
<link rel="stylesheet" href="assets/css/prism.css">
<script src="assets/js/prism.js"></script>
<!-- Template Main CSS File -->
<link href="assets/css/style.css" rel="stylesheet">
<!-- To set the icon, visit https://fontawesome.com/account-->
<script src="https://kit.fontawesome.com/5d25c1efd3.js" crossorigin="anonymous"></script>
<!-- end of icon-->
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<style>
/* Add some basic styling for code */
pre {
background-color: #f4f4f4;
padding: 10px;
border: 1px solid #ddd;
border-radius: 5px;
font-family: monospace;
white-space: pre-wrap;
}
</style>
<!-- =======================================================
* Template Name: iPortfolio
* Updated: Sep 18 2023 with Bootstrap v5.3.2
* Template URL: https://bootstrapmade.com/iportfolio-bootstrap-portfolio-websites-template/
* Author: BootstrapMade.com
* License: https://bootstrapmade.com/license/
======================================================== -->
</head>
<body>
<!-- ======= Mobile nav toggle button ======= -->
<i class="bi bi-list mobile-nav-toggle d-xl-none"></i>
<!-- ======= Header ======= -->
<header id="header">
<div class="d-flex flex-column">
<div class="profile">
<img src="assets/img/myphoto.jpeg" alt="" class="img-fluid rounded-circle">
<h1 class="text-light"><a href="index.html">Arun</a></h1>
<div class="social-links mt-3 text-center">
<a href="https://www.linkedin.com/in/arunp77/" target="_blank" class="linkedin"><i class="bx bxl-linkedin"></i></a>
<a href="https://github.com/arunp77" target="_blank" class="github"><i class="bx bxl-github"></i></a>
<a href="https://twitter.com/arunp77_" target="_blank" class="twitter"><i class="bx bxl-twitter"></i></a>
<a href="https://www.instagram.com/arunp77/" target="_blank" class="instagram"><i class="bx bxl-instagram"></i></a>
<a href="https://arunp77.medium.com/" target="_blank" class="medium"><i class="bx bxl-medium"></i></a>
</div>
</div>
<nav id="navbar" class="nav-menu navbar">
<ul>
<li><a href="index.html#hero" class="nav-link scrollto active"><i class="bx bx-home"></i> <span>Home</span></a></li>
<li><a href="index.html#about" class="nav-link scrollto"><i class="bx bx-user"></i> <span>About</span></a></li>
<li><a href="index.html#resume" class="nav-link scrollto"><i class="bx bx-file-blank"></i> <span>Resume</span></a></li>
<li><a href="index.html#portfolio" class="nav-link scrollto"><i class="bx bx-book-content"></i> <span>Portfolio</span></a></li>
<li><a href="index.html#skills-and-tools" class="nav-link scrollto"><i class="bx bx-wrench"></i> <span>Skills and Tools</span></a></li>
<li><a href="index.html#language" class="nav-link scrollto"><i class="bi bi-menu-up"></i> <span>Languages</span></a></li>
<li><a href="index.html#awards" class="nav-link scrollto"><i class="bi bi-award-fill"></i> <span>Awards</span></a></li>
<li><a href="index.html#professionalcourses" class="nav-link scrollto"><i class="bx bx-book-alt"></i> <span>Professional Certification</span></a></li>
<li><a href="index.html#publications" class="nav-link scrollto"><i class="bx bx-news"></i> <span>Publications</span></a></li>
<li><a href="index.html#extra-curricular" class="nav-link scrollto"><i class="bx bx-rocket"></i> <span>Extra-Curricular Activities</span></a></li>
<!-- <li><a href="#contact" class="nav-link scrollto"><i class="bx bx-envelope"></i> <span>Contact</span></a></li> -->
</ul>
</nav><!-- .nav-menu -->
</div>
</header><!-- End Header -->
<main id="main">
<!-- ======= Breadcrumbs ======= -->
<section id="breadcrumbs" class="breadcrumbs">
<div class="container">
<div class="d-flex justify-content-between align-items-center">
<h2>Machine learning</h2>
<ol>
<li><a href="machine-learning.html" class="clickable-box">Content section</a></li>
<li><a href="index.html#portfolio" class="clickable-box">Portfolio section</a></li>
</ol>
</div>
</div>
</section><!-- End Breadcrumbs -->
<!------ right dropdown menue ------->
<div class="right-side-list">
<div class="dropdown">
<button class="dropbtn"><strong>Shortcuts:</strong></button>
<div class="dropdown-content">
<ul>
<li><a href="cloud-compute.html"><i class="fas fa-cloud"></i> Cloud</a></li>
<li><a href="AWS-GCP.html"><i class="fas fa-cloud"></i> AWS-GCP</a></li>
<li><a href="amazon-s3.html"><i class="fas fa-cloud"></i> AWS S3</a></li>
<li><a href="ec2-confi.html"><i class="fas fa-server"></i> EC2</a></li>
<li><a href="Docker-Container.html"><i class="fab fa-docker" style="color: rgb(29, 27, 27);"></i> Docker</a></li>
<li><a href="Jupyter-nifi.html"><i class="fab fa-python" style="color: rgb(34, 32, 32);"></i> Jupyter-nifi</a></li>
<li><a href="snowflake-task-stream.html"><i class="fas fa-snowflake"></i> Snowflake</a></li>
<li><a href="data-model.html"><i class="fas fa-database"></i> Data modeling</a></li>
<li><a href="sql-basics.html"><i class="fas fa-table"></i> QL</a></li>
<li><a href="sql-basic-details.html"><i class="fas fa-database"></i> SQL</a></li>
<li><a href="Bigquerry-sql.html"><i class="fas fa-database"></i> Bigquerry</a></li>
<li><a href="scd.html"><i class="fas fa-archive"></i> SCD</a></li>
<li><a href="sql-project.html"><i class="fas fa-database"></i> SQL project</a></li>
<!-- Add more subsections as needed -->
</ul>
</div>
</div>
</div>
<!-- ======= Portfolio Details Section ======= -->
<section id="portfolio-details" class="portfolio-details">
<div class="container">
<div class="row gy-4">
<h1>Linear Regression & Gradient decent method</h1>
<div class="col-lg-8">
<div class="portfolio-details-slider swiper">
<div class="swiper-wrapper align-items-center">
<figure>
<img src="assets/img/data-engineering/Linear-reg1.png" alt="" style="max-width: 60%; max-height: auto;">
<figcaption></figcaption>
</figure>
</div>
<div class="swiper-pagination"></div>
</div>
</div>
<div class="col-lg-4 grey-box">
<div class="section-title">
<h3>Content</h3>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#Relationship-of-regression-lines">Relationship of regression lines</a></li>
<li><a href="#Types-of-Linear-Regression">Types of Linear Regression</a></li>
<li><a href="#Mathematical-1">Mathematical Explanation</a></li>
<li><a href="#evaluation-metrics-for-LR">Evaluation Metrics for Linear Regression</a></li>
<li><a href="#OLS">Ordinary least squares method (OLS)</a></li>
<li><a href="#gradient-decent">Gradient-decent method for linear regression</a></li>
<li><a href="#Maximum-likelihood-estimation">Maximum-likelihood-estimation</a></li>
<li><a href="Example-simple-linear">Example on simple linear regression in python</a></li>
<li><a href="#Example-multiple-regression">Example on multiple linear regression in python</a></li>
<li><a href="#reference">Reference</a></li>
</ol>
</div>
</div>
</div>
<section id="introduction">
<h2>Introduction</h2>
<p>Linear regression is a popular and widely used algorithm in machine learning for predicting continuous numeric values. It models the relationship between independent variables (input features) and a dependent variable (target variable) by fitting a linear equation to the observed data. In this section, we will provide a brief overview of linear regression, including the mathematical explanation and figures to aid understanding</p>
<p>The linear regression algorithm aims to find the best-fit line that represents the relationship between the input features (<code>x</code>) and the target variable (<code>y</code>). The equation for a simple linear regression can be expressed as:</p>
<figure>
<img src="assets/img/data-engineering/Linear-reg0.png" alt="" style="max-width: 40%; max-height: 40%;">
<figcaption></figcaption>
</figure>
$$y = m x +c$$
<p>where</p>
<ul>
<li><code>y</code> represents the target variable or the dependent variable we want to predict.</li>
<li><code>x</code> represents the input feature or the independent variable.</li>
<li><code>m</code> represents the slope of the line, which represents the rate of change of <code>y</code> with respect to <code>x</code>.</li>
<li><code>c</code> represents the <code>y</code>-intercept, which is the value of <code>y</code> when <code>x</code> is equal to <code>0</code>.</li>
</ul>
<!------------------------>
<h4 id="Relationship-of-regression-lines">Relationship of regression lines</h4>
A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship:
<ol>
<li><strong>Positive Linear Relationship:</strong> If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.</li>
<li><strong>Negative Linear Relationship:</strong> If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.</li>
</ol>
<img src="assets/img/data-engineering/line-slope.png" alt="" style="max-width: 80%; max-height: auto">
<!------------------------>
<h4 id="Types-of-Linear-Regression">Types of Linear Regression</h4>
<p>Linear regression can be further divided into two types of the algorithm:</p>
<ol>
<li><strong>Simple Linear Regression:</strong> If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
$$y= \beta_0+ \beta_1 x$$
</li>
<li><strong>Multiple Linear regression:</strong> If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
$$y= \beta_0+ \beta_1 x_1 + \beta_2 x_2 + ... +\beta_n x_n $$
</li>
</ol>
<!------------------------>
<h4>Assumptions of Linear Regression</h4>
<ol>
<li><strong>Linearity of residuals: </strong>The relationship between the independent variables and the dependent variable is assumed to be linear. This means that the change in the dependent variable is directly proportional to the change in the independent variables.
<figure>
<img src="assets/img/data-engineering/Linearity.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption></figcaption>
</figure>
</li>
<li><strong>Independence: </strong>The observations in the dataset are assumed to be independent of each other. There should be no correlation or dependence between the residuals (the differences between the actual and predicted values) of the dependent variable for different observations.
<figure>
<img src="assets/img/data-engineering/independence.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption></figcaption>
</figure>
</li>
<li><strong>Normal distribution of residuals: </strong>The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero. This is done in order to check whether the selected line is actually the line of best fit or not. If the error terms are non-normally distributed, suggests that there are a few unusual data points that must be studied closely to make a better model.
<figure>
<img src="assets/img/data-engineering/Disti-assumption.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption></figcaption>
</figure>
</li>
<li><strong>The equal variance of residuals: </strong>The error terms must have constant variance. This phenomenon is known as Homoscedasticity. The presence of non-constant variance in the error terms is referred to as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or extreme leverage values.
<figure>
<img src="assets/img/data-engineering/eual-variance-assumption.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption></figcaption>
</figure>
</li>
</ol>
</section>
<section>
<h3 id="Mathematical-1">Mathematical Explanation:</h3>
<p>There are parameters <code>β<sub>0</sub></code>, <code>β<sub>1</sub></code>, and ϵ, such that for any fixed value of the independent variable \(x\) through the model equation:</p>
$$y=\beta_0 + \beta_1 x +\epsilon$$
<p>where</p>
<ul>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math> = Dependent Variable (Target Variable)</li>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> = Independent Variable (predictor Variable)</li>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>β<!-- β --></mi> <mn>0</mn> </msub> </math> = intercept of the line (Gives an additional degree of freedom)</li>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>β<!-- β --></mi> <mn>1</mn> </msub> </math> = Linear regression coefficient (scale factor to each input value).</li>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>ϵ<!-- ϵ --></mi> </math> = random error.</li>
</ul>
<p>The goal of linear regression is to estimate the values of the regression coefficients</p>
<img src="assets/img/data-engineering/Multi-lin-reg.png" alt="" style="max-width: 60%; max-height: 60%;">
<p>This algorithm explains the linear relationship between the dependent(output) variable <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi></math>
and the independent(predictor) variable <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> using a straight line
<math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>y</mi> <mo>=</mo> <msub> <mi>β<!-- β --></mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>β<!-- β --></mi> <mn>1</mn> </msub> <mi>x</mi> </math></p>
<h5>Goal</h5>
<ul>
<li>The goal of the linear regression algorithm is to get the best values for <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>β<!-- β --></mi> <mn>0</mn> </msub> </math>
and <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>β<!-- β --></mi> <mn>1</mn> </msub> </math> to find the best fit line. </li>
<li>The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.</li>
<li><p>For a datset with <math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>n</mi> </math> observation <math xmlns="http://www.w3.org/1998/Math/MathML"> <mo stretchy="false">(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo stretchy="false">)</mo> </math>,
where <math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>3....</mn> <mo>,</mo> <mi>n</mi> </math> the above function can be written as follows</p>
<p><math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>=</mo> <msub> <mi>β<!-- β --></mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>β<!-- β --></mi> <mn>1</mn> </msub> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>ϵ<!-- ϵ --></mi> <mi>i</mi> </msub> </math></p>
<p>where <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>y</mi> <mi>i</mi> </msub> </math> is the value of the observation of the dependent variable (outcome variable) in the smaple, <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>x</mi> <mi>i</mi> </msub> </math> is the value of <math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>i</mi> <mi>t</mi> <mi>h</mi> </math> observation
of the independent variable or feature in the sample, <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>ϵ<!-- ϵ --></mi> <mi>i</mi> </msub> </math> is the random error (also known as residuals) in predicting the value of <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>y</mi> <mi>i</mi> </msub> </math>,
<math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>β<!-- β --></mi> <mn>0</mn> </msub> </math> and <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>β<!-- β --></mi> <mn>i</mn> </msub> </math> are the regression parameters (or regression coefficients or feature weights).</p>
</li>
</ul>
In simple linear regression, there is only one independent variable (<math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>x</mi> </math>) and one dependent variable (<math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>y</mi> </math>).
The parameters (coefficients) in simple linear regression can be calculated using the method of <strong>ordinary least squares (OLS)</strong> or <strong>gradient descent method</strong>.
<p>The estimated parameters provide the values of the intercept and slope that best fit the data according to the simple linear regression model.</p>
<hr>
<section id="evaluation-metrics-for-LR">
<h3>Model Evaluation</h3>
<p>To train an accurate linear regression model, we need a way to quantify how good (or bad) our model performs. In machine learning, we call such performance-measuring functions loss functions. Several popular loss functions exist for regression problems.
To measure our model's performance, we'll use one of the most popular: mean-squared error (MSE). Here are some commonly used evaluation metrics: </p>
<ol>
<li><strong>Mean Squared Error (MSE): </strong>MSE quantifies how close a predicted value is to the true value, so we'll use it to quantify how close a regression line is to a set of points.
The Mean Squared Error measures the average squared difference between the predicted values and the actual values of the dependent variable. It is calculated by taking the average of the squared residuals.
$$\boxed{\text{MSE} = \frac{1}{n} \sum \left(y^{(i)} - h_\theta(x^{(i)})\right)^2}$$
where:
<ul>
<li>\(n\) is the number of data points.</li>
<li>\(y^{(i)}\) is the actual value of the dependent variable for the $i-th$ data point.</li>
<li>\(h_\theta (x^{(i)})\) is the predicted value of the dependent variable for the i-th data point.</li>
</ul>
<p>A lower MSE value indicates better model performance, with zero being the best possible value.</p>
</li>
<li><strong>Root Mean Squared Error (RMSE): </strong>The Root Mean Squared Error is the square root of the MSE and provides a more interpretable measure of the average prediction error.
$$\boxed{\text{RMSE} = \sqrt{\text{MSE}}}$$
<p>Like the MSE, a lower RMSE value indicates better model performance.</p>
</li>
<li><strong>Mean Absolute Error (MAE): </strong>
The Mean Absolute Error measures the average absolute difference between the predicted values and the actual values of the dependent variable. It is less sensitive to outliers compared to MSE.
$$\boxed{\text{MAE} = \frac{1}{n} \sum |y^{(i)} - h_\theta(x^{(i)})|}$$
<p>A lower MAE value indicates better model performance.</p>
</li>
<li><strong>R-squared (<math xmlns="http://www.w3.org/1998/Math/MathML"> <msup> <mi>R</mi> <mn>2</mn> </msup> </math>) Coefficient of Determination</strong>
The R-squared value represents the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from <code>0</code> to <code>1</code>,
where <code>1</code> indicates that the model perfectly predicts the dependent variable. A negative <math xmlns="http://www.w3.org/1998/Math/MathML"> <msup> <mi>R</mi> <mn>2</mn> </msup> </math> means that our model is doing worse.
$$\boxed{R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}}$$
where:
<ul>
<li><p>Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data point in the plot/data. It is the measure of the difference between the expected and the actual observed output.</p>
$$\text{RSS} = \sum_{i=1}^{n} \left(y^{(i)} - h_\theta(x^{(i)})\right)^2$$
</li>
<li><p>Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of the response variable. Mathematically TSS is</p>
$$\text{TSS} = \sum \left( y^{(i)}- \bar{y}\right)^2$$
<p>where:</p>
<ul>
<li>\(n\) = is the number of data points.</li>
<li>\(\bar{y}\) is the mean of the observed values of the dependent variable.</li>
</ul>
</li>
</ul>
<br>
<p>A higher <math xmlns="http://www.w3.org/1998/Math/MathML"> <msup> <mi>R</mi> <mn>2</mn> </msup> </math>
value indicates a better fit of the model to the data. <math xmlns="http://www.w3.org/1998/Math/MathML"> <msup> <mi>R</mi> <mn>2</mn> </msup> </math>
is commonly interpreted as the percentage of the variation in the dependent variable that is explained by the independent variables.
However, it is important to note that <math xmlns="http://www.w3.org/1998/Math/MathML"> <msup> <mi>R</mi> <mn>2</mn> </msup> </math>
does not determine the causal relationship between the independent and dependent variables. It is solely a measure of how well the model fits the data.</p>
<div style="background-color: #f2f2f2; padding: 15px;">
<p>
<strong>Note: </strong> A higher R-squared value indicates a better fit of the model to the data. However, it's essential to consider other factors and use
R-squared in conjunction with other evaluation metrics to fully assess the model's performance. R-squared has limitations, especially in the case of overfitting,
where a model may fit the training data very well but perform poorly on new, unseen data.
</p>
</div>
</li>
<li><strong>Adjusted R-squared: </strong>
<p>The Adjusted R-squared accounts for the number of independent variables in the model. It penalizes the inclusion of irrelevant variables and rewards the inclusion of relevant variables.</p>
$$\boxed{\text{Adjusted}~ R^2 = 1-\left[\frac{(1 - R²) * (n - 1)}{(n - p - 1)}\right]}$$
<p>Where:</p>
<ul>
<li>n is the number of data points.</li>
<li>p is the number of independent variables.</li>
</ul>
<p>A higher Adjusted R-squared value indicates a better fit of the model while considering the complexity of the model.</p>
<p>These evaluation metrics help assess the performance of a linear regression model by quantifying the accuracy of the predictions and the extent to which the independent variables explain the
dependent variable. It is important to consider multiple metrics to gain a comprehensive understanding of the model's performance.</p>
</li>
</ol>
<p></p>
<strong>Selecting An Evaluation Metric:</strong>
<p>Many methods exist for evaluating regression models, each with different concerns around interpretability, theory, and usability. The evaluation metric should reflect whatever it is you actually
care about when making predictions. For example, when we use MSE, we are implicitly saying that we think the cost of our prediction error should reflect the quadratic (squared) distance between
what we predicted and what is correct. This may work well if we want to punish outliers or if our data is minimized by the mean, but this comes at the cost of interpretability: we output our error
in squared units (though this may be fixed with RMSE). If instead we wanted our error to reflect the linear distance between what we predicted and what is correct, or we wanted our data minimized by
the median, we could try something like Mean Abosulte Error (MAE). Whatever the case, you should be thinking of your evaluation metric as part of your modeling process, and select the best metric based
on the specific concerns of your use-case.</p>
<strong>Are Our Coefficients Valid?: </strong>
<p>In research publications and statistical software, coefficients of regression models are often presented with associated p-values. These p-values come from traditional null hypothesis statistical tests: t-tests are used to measure whether a given cofficient is significantly different than zero (the null hypothesis that a particular coefficient
β<sub>i</sub> equals zero), while F tests are used to measure whether any of the terms in a regression model are significantly different from zero. Different opinions exist on the utility of such tests.</p>
</section>
<!--------------------->
<h3 id="OLS">Ordinary least squares method (OLS)</h3>
Consider a simple linear regression model:
$$y_i=\theta_0 +\theta_1 x_i +\epsilon_i$$
<p>Let's consider that after optimization, the predicted values of dependent variable \(y_i\) is \(h_\theta(x^{(i)})\)</p>
<p>Therefore, we can write error function as: \(\epsilon = y_i - h_\theta(x_i) \).</p>
<p><strong>Objectivem or cost function:</strong></p>
The objective function is to minimize the sum of squared residuals:
$$\text{RSS}(\theta_0, \theta_1) = \frac{1}{n}\sum_{i=1}^n \left(y_i - (\theta_0+\theta_1 x_i)\right)^2$$
<p>Where:</p>
<ul>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>n</mi> </math> is the number of data points.</li>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>y</mi> <mi>i</mi> </msub> </math> is the actual value of the dependent variable for the i-th data point.</li>
<li>\(\theta_0, \theta_1\) are the coefficients.</li>
</ul>
Now taking partial derivatives with respect to \(\theta_0\) and \(\theta_1\) and setting them to zero gives the OLS estimates:
$$\hat{\theta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$$
and
$$\hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}$$
where
<ul>
<li>\(\bar{x}\) and \(\bar{y}\) are the means of the Independent and dependent variables, respectively. </li>
</ul>
<hr>
<!----------------------------------->
<h3 id ="gradient-decent">Gradient Descent method for Linear Regression:</h3>
<figure>
<img src="assets/img/machine-ln/gradient-discent.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption style="text-align: center;">Gradient descent is the process of taking steps to find the minimum of the loss surface. Image Credit: <a href="https://www.researchgate.net/figure/Non-convex-optimization-We-utilize-stochastic-gradient-descent-to-find-a-local-optimum_fig1_325142728" target="_blank">Alexander Amini</a></figcaption>
</figure>
<ul>
<li>A regression model optimizes the gradient descent algorithm to update the coefficients of the line by reducing the cost function by randomly selecting coefficient values and then iteratively updating the values to reach the minimum cost function.</li>
<li>Gradient Descent is an iterative optimization algorithm commonly used in machine learning to find the optimal parameters in a model. It can also be applied to linear regression to estimate the parameters (coefficients) that minimize the cost function.</li>
<li>The steps involved in using Gradient Descent for Linear Regression are as follows:
<ol>
<li><strong>Define the Cost Function: </strong>The cost function for linear regression is the Mean Squared Error (MSE), which measures the average squared difference between the predicted values \(h_\theta(x^{(i)})\) and the actual values \(y^{(i)}\) of the dependent variable.</li>
$$\text{MSE} = \frac{1}{2n}\sum_{i=1}^n \left(y^{(i)} - h_\theta(x^{(i)})\right)^2$$
<p>Where:</p>
<ul>
<li><math xmlns="http://www.w3.org/1998/Math/MathML"> <mi>n</mi> </math> is the number of data points.</li>
<li>\(y^{(i)}\) is the actual value of the dependent variable for the i-th data point.</li>
<li>\(h_\theta(x^{(i)})\) is the predicted value of the dependent variable for the i-th data point.</li>
</ul>
<li><strong>Initialize the Parameters: </strong>Start by initializing the parameters (coefficients) with random values. Typically, they are initialized as zero or small random values.</li>
<li><strong>Calculate the Gradient: </strong>Compute the gradient of the cost function with respect to each parameter. The gradient represents the direction of steepest ascent in the cost function space (for more details, see <a href="Linear-Parameter-estimation.html">Calculation of the equations</a>).
$$\frac{\partial (MSE)}{\partial \theta_0} = \frac{1}{n}\sum (h_\theta(x^{(i)}) - y^{(i)})$$
$$\frac{\partial (MSE)}{\partial \theta_i} = \frac{1}{n}\sum (h_\theta(x^{(i)}) - y^{(i)})\times x^{(i)}$$
<p>Where:</p>
<ul>
<li>\(\frac{\partial (MSE)}{\partial \theta_0}\) is the gradient with respect to the y-intercept parameter (\(\theta_0\)).</li>
<li>\(\frac{\partial (MSE)}{\partial \theta_1}\) is the gradient with respect to the slope parameter (\(\theta_1\)).</li>
<li>\(h_\theta(x^{(i)})\) is the predicted value of the dependent variable for the i-th data point.</li>
<li>\(y^{(i)}\) is the actual value of the dependent variable for the i-th data point.</li>
<li>\(x^{(i)}\) is the value of the independent variable for the i-th data point.</li>
</ul>
</li>
<li><strong>Update the Parameters: </strong> Update the parameters using the gradient and a learning rate (
α), which determines the step size in each iteration.
$$\theta_0 = \theta_0 - \alpha \times \frac{\partial (MSE)}{\partial \theta_0}$$
$$\theta_i = \theta_i - \alpha \times \frac{\partial (MSE)}{\partial \theta_i}$$
<p>Repeat this update process for a specified number of iterations or until the change in the cost function becomes sufficiently small.</p>
</li>
<li><strong>Predict: </strong>Once the parameters have converged or reached the desired number of iterations, use the final parameter values to make predictions on new data.
$$h_\theta (x_i)= \theta_0 +\theta_1 x+ ... +\theta_n x_n$$
<p>Gradient Descent iteratively adjusts the parameters by updating them in the direction of the negative gradient until it reaches a minimum point in the cost function. This process allows for the estimation of optimal parameters in linear regression, enabling the model to make accurate predictions on unseen data.</p>
<figure>
<img src="assets/img/data-engineering/optimal-reg2.png" alt="" style="max-width: 90%; max-height: auto;">
<figcaption></figcaption>
</figure>
<p>Let's take an example to understand this. If we want to go from top left point of the shape to bottom of the pit, a discrete number of steps can be taken to reach the bottom.</p>
<ul>
<li>If you decide to take larger steps each time, you may achieve the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not even near the bottom.</li>
<li>In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate, and this decides how fast the algorithm converges to the minima.</li>
</ul>
</li>
</ol>
</li>
</ul>
<br>
<h4>Comparison between the OLS and gradient descent method:</h4>
<table>
<tr>
<th></th>
<th>Gradient-decent</th>
<th>OLS</th>
</tr>
<tr>
<td><strong>Optimization Technique:</strong></td>
<td>Gradient descent is an iterative optimization algorithm. It starts with initial values for the parameters and updates them iteratively in the direction of the negative gradient of the objective function until convergence.</td>
<td>OLS is an analytical method that involves finding the values of the model parameters directly by minimizing a closed-form expression. It computes the derivatives of the objective function with respect to the parameters, sets them equal to zero, and solves for the parameters.</td>
</tr>
<tr>
<td><strong>Approach to Minimization:</strong></td>
<td>Gradient descent minimizes the objective/cost function by iteratively adjusting the parameters in the direction opposite to the gradient. It moves toward the minimum of the function in steps determined by the learning rate.</td>
<td>OLS minimizes the sum of squared residuals directly. It finds the values of the parameters that lead to the smallest possible sum of squared differences between observed and predicted values.</td>
</tr>
<tr>
<td><strong>Convergence:</strong></td>
<td>Gradient descent is an iterative process. It continues updating parameters until a stopping criterion is met, such as reaching a specified number of iterations or achieving a small change in the objective function.</td>
<td>OLS provides a closed-form solution, and once the solution is found, it is exact and does not require further iteration.</td>
</tr>
<tr>
<td><strong>Computational Complexit:</strong></td>
<td>The computational complexity of gradient descent depends on the number of iterations and the size of the dataset. It may be more computationally efficient for very large datasets or complex models.</td>
<td>The computational complexity of OLS depends on the number of features in the model. For simple linear regression, there is a closed-form solution. However, for multiple linear regression with many features, matrix operations are involved.</td>
</tr>
<tr>
<td><strong>Sensitivity to Initial Conditions:</strong></td>
<td> The convergence of gradient descent can be sensitive to the choice of the initial parameter values and the learning rate.</td>
<td>OLS directly computes parameter estimates, and the solution is not sensitive to initial conditions.</td>
</tr>
<tr>
<td><strong>Regularization:</strong></td>
<td>Gradient descent can be easily extended to include regularization terms like L1 or L2 regularization to prevent overfitting.</td>
<td>OLS does not inherently include regularization terms to prevent overfitting.</td>
</tr>
</table>
<br>
<!----------------->
<div class="important-boxx">
<h5 id="Maximum-likelihood-estimation"><strong>Maximum Likelihood Estimation (MLE)</strong></h5>
For this check: <a href="mle.html" target="_blank">Maximum Likelihood Estimation (MLE)</a>.
</div>
<br>
<div class="important-box">
For more details on overfitting and underfitting go to following link: <a href="Ridge-lasso-elasticnet.html">Ridge lasso and Elastic net algorithms</a>.
On this page, you'll find a comprehensive guide on selecting the most relevant variables when dealing with multiple factors. The detailed description outlines strategies to identify the best variables,
ensuring that a specific algorithm aids in constructing a model that guards against overfitting.
</div>
</section>
<section id="MLE-gradient-descent">
<h3>Difference between MLE and Gradeint descent:</h3>
Maximum Likelihood Estimation (MLE) and gradient descent are distinct methods used in the context of parameter estimation, often in the realm of statistical modeling or machine learning. While they share the common objective of finding parameter values that optimize a given criterion, they differ in their approaches, use cases, and mathematical foundations.
<table>
<tr>
<th><strong></strong></th>
<th><strong>MLE</strong></th>
<th><strong>Gradient Descent</strong></th>
</tr>
<tr>
<td><strong>Optimization Approach:</strong></td>
<td>MLE is a statistical method that aims to find the parameter values that maximize the likelihood function or, equivalently, maximize the log-likelihood function. It often involves solving equations or using optimization techniques.</td>
<td>Gradient descent is an iterative optimization algorithm that updates parameter values in the direction of the negative gradient of an objective function (e.g., mean squared error in regression problems) until convergence.</td>
</tr>
<tr>
<td><strong>Model Assumptions:</strong></td>
<td> MLE is a more general framework applicable to a wide range of statistical models. It is often used when there is a probabilistic model describing the distribution of the observed data.</td>
<td>Gradient descent is commonly used in the context of machine learning, where the objective function is often related to the error or loss of a predictive model. It may not necessarily assume a specific probabilistic model.</td>
</tr>
<tr>
<td><strong>Convergence Criteria:</strong></td>
<td>MLE often involves solving equations or using optimization techniques that guarantee convergence to a global maximum (for well-behaved problems).</td>
<td>Convergence in gradient descent is achieved iteratively, and the stopping criteria may include reaching a certain number of iterations, achieving a small change in the objective function, or satisfying a threshold condition.</td>
</tr>
</table>
<br>
<h4>Why Use MLE Instead of Gradient Descent (or Vice Versa)?</h4>
The common objective of both MLE and gradient descent is to find parameter values that optimize a given criterion. In the context of machine learning, this criterion is typically associated with minimizing a loss function or maximizing a likelihood function.
<ol>
<li><strong>Nature of the Problem: </strong>
<ul>
<li>Use MLE when the problem is naturally formulated in a probabilistic framework, and you have a clear understanding of the likelihood function.</li>
<li>Use gradient descent when dealing with machine learning problems where the focus is on minimizing a loss function associated with prediction errors.</li>
</ul>
</li>
<li><strong>Analytical Solutions:</strong>
<ul>
<li>MLE often provides analytical solutions for parameter estimates if a closed-form solution exists for the likelihood function.</li>
<li>Gradient descent is particularly useful for problems where analytical solutions are difficult or impossible to obtain.</li>
</ul>
</li>
<li><strong>Statistical Inference:</strong>
<ul>
<li>MLE is widely used in statistical inference, where parameter estimates come with statistical properties such as standard errors and confidence interval.</li>
<li>Gradient descent is commonly employed in machine learning for training predictive models, and its primary focus is on minimizing prediction errors.</li>
</ul>
</li>
<li><strong>Complexity and Robustness:</strong>
<ul>
<li>MLE can be more computationally intensive, especially when solving equations or performing optimizations involving complex likelihood functions.</li>
<li>Gradient descent is often preferred for its simplicity and efficiency, particularly in large-scale machine learning problems.</li>
</ul>
</li>
</ol>
</section>
<!-----------Example ------------->
<section id="Example-simple-linear">
<h3><a href="https://github.com/arunp77/Machine-Learning/tree/main/Projects-ML/Reg-models" target="_blank">Example on Simple linear method</a></h3>
Lets consider a dataset of person with their Weight and Height. he data set can be found at:
<a href="https://github.com/arunp77/Machine-Learning/blob/main/Projects-ML/Reg-models/Weight_height.csv" target="_blank">Dataset</a>.
Now we want to do the linear regression curve fitting, where we will see on the basis of model training, what would be height of a person with a given weight.
This can be done through the gradient decent method or OLS method. We will first look at gradient decent method.
<ul>
<li>Lets first import the import libraries:
<pre class="language-python"><Code>
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
</Code></pre>
</li>
<li>Next we can load the dataset using:
<pre class="language-python"><code>df = pd.read_csv('Weight_height.csv')</code></pre>
the table will have two colmns: <pre class="language-python"><code>df.head()</code></pre>
<table>
<thead>
<tr>
<th>Weight</th>
<th>Height</th>
</tr>
</thead>
<tbody>
<tr>
<td>45</td>
<td>120</td>
</tr>
<tr>
<td>58</td>
<td>125</td>
</tr>
<tr>
<td>48</td>
<td>123</td>
</tr>
<tr>
<td>60</td>
<td>145</td>
</tr>
<tr>
<td>70</td>
<td>160</td>
</tr>
</tbody>
</table>
</li>
<li>The scattered and pair plot can be obtained using the '<code>seaborn</code>' pairplot: <code>sns.pairplot(df)</code>
<figure>
<img src="assets/img/machine-ln/pair-plot.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption></figcaption>
</figure>
</li>
<li>Now created the independent and dependent features as:
<pre class="language-python"><code>
# Independe and dependent features
X = df[['Weight']] ## Always remember that independent features should be dataframe or 2 dimesnional array
y=df['Height'] ## this variable can be a series or 1D array.
</code></pre>
</li>
<li>Now deviding the dataset into the training an test dataset.
<pre class="language-python"><code>
## Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.25, random_state=42)
</code></pre>
</li>
<li>It is best to standardize the training and test datasets.
<pre class="language-python"><code>
## Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
</code></pre>
It is to be noted here that only transfor is used in the case of X_test. This is due to the following fact.
<code>scaler = StandardScaler()</code> this basically make norm of calulating the mean and standard deiation for the dataset.
However when we provide the <code>X_train</code> to the <code>scaler.fit_transform</code>, it basically calculate the mean and the standard deviation
and then transform the <code>X_train</code>. Next when we use the same scaler function, we don't need to recalculate the mean and standard deviation.
</li>
<li>Now we fit the train datasets, using:
<pre class="language-python"><code>
## Apply linear regression
from sklearn.linear_model import LinearRegression
regression = LinearRegression(n_jobs=-1)
regression.fit(X_train, y_train)
print(f"Coefficient or slop is: {regression.coef_}")
print(f"Intercept is: {regression.intercept_}")
</code></pre>
The output will be
<pre>
Coefficient or slop is: [17.81386924]
Intercept is: 155.88235294117646
</pre>
</li>
<li>We can see the fitted line:
<pre class="language-python"><code>
## plot Training data plot and best fit line
plt.scatter(X_train, y_train)
plt.plot(X_train, regression.predict(X_train))
</code></pre>
</li>
<figure>
<img src="assets/img/machine-ln/fit-line.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption></figcaption>
</figure>
<li>Predicteion of the test data:
<ul>
<li>predicted height output = intercept + coef_(Weight)</li>
<li>y_pred_test = regression.intercept_ + regression.coef_ * X_test</li>
</ul>
<pre class="language-python"><code>
## Prediction for the test data
y_pred = regression.predict(X_test)
</code></pre>
</li>
<li><strong>Model testing:</strong>
<pre class="language-python"><code>
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"The Mean squared error is: {mse}")
print(f"The mean absolute error is: {mae}")
print(f"The root mean squared error is: {rmse}")
</code></pre>
The output is:
<pre>
The Mean squared error is: 119.78211426580515
The mean absolute error is: 9.751561944430335
The root mean squared error is: 10.944501554013556
</pre>
Similalry R-squared can be caluclated as:
<pre class="language-python"><code>
from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)
print(f"The r-squared value for the model is= {score}")
</code></pre>
and the output is: <code>The r-squared value for the model is= 0.724726708358188</code>
Similalry adjusted R-squared can be obtained as:
$$ \text{Adjusted R-squared} = 1- \frac{(1-R^2)(n-1)}{n-k-1}$$
where
<ul>
<li>R^2 = is the R-squared value of the model,</li>
<li>n = The number of observations</li>
<li>k = The number of predictor variables</li>
</ul>
<pre class="language-python"><code>
## displaying adjusted r squared value
1 - (1-score)*(len(y_test) - 1) / (len(y_test) - X_test.shape[1]-1)
</code></pre>
and the adjusted r-squared value is: 0.655908385447735
</li>
<li>Prediction of height for a person of weight 72 Kgs.
<pre class="language-python"><code>
### Prediction for net data
new_weight = [[72]]
scaled_new_weight = scaler.transform(new_weight) # since we have standardized the training data, we mush use the standardize the weight
# Make the prediction
prediction_new_height = regression.predict(scaled_new_weight)
print(f"Height for the weight 72 Kg is: {prediction_new_height}")
</code></pre>
</li>
The predicted height is 155.37 cms.
</ul>
<strong>OLS method:</strong>
We can also use the OLS method for the above example.
<pre class="language-python"><code>
import statsmodels.api as sm
model = sm.OLS(y_train, X_train).fit()
prediction = model.predict(X_test)
print(prediction)
</code></pre>
This will give the prediction points for the X_test dataset. The model summary:
<pre class="language-python"><code>print(model.summary())</code></pre>
shows that the two methods discussed gave same coefficient and the intercept.
<figure>
<img src="assets/img/machine-ln/ols-method.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption></figcaption>
</figure>
</section>
<!-----------Example ------------->
<section id="Example-multiple-regression">
<h3>Example on multiple regression model</h3>
This example generates synthetic data for advertising spend, number of salespeople, and sales. It then performs multiple
linear regression analysis using the <code>Advertising\_Spend</code> and <code>Num\_Salespeople</code> as independent variables to predict <code>Sales</code>.
The code also includes visualization to compare actual vs. predicted sales.
<ul>
<li><strong>Importing libraries</strong>
<pre class="language-python"><code>
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
</code></pre>
</li>
<li><strong>Generating random data and creating dataframe: </strong>
<pre class="language-python"><code>
# Set a random seed for reproducibility
np.random.seed(42)
# Generate synthetic data
num_samples = 200
advertising_spend = np.random.uniform(50, 200, num_samples)
num_salespeople = np.random.randint(3, 10, num_samples)
error_term = np.random.normal(0, 20, num_samples)
sales = 50 + 2 * advertising_spend + 5 * num_salespeople + error_term
# Create a DataFrame
df = pd.DataFrame({'Advertising_Spend': advertising_spend,
'Num_Salespeople': num_salespeople, 'Sales': sales})
df.head()
</code></pre>
<table>
<thead>
<tr>
<th></th>
<th>Advertising_Spend</th>
<th>Num_Salespeople</th>
<th>Sales</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>106.181018</td>
<td>6</td>
<td>282.863129</td>
</tr>
<tr>
<td>1</td>
<td>192.607146</td>
<td>5</td>
<td>447.147707</td>
</tr>
<tr>
<td>2</td>
<td>159.799091</td>
<td>3</td>
<td>419.907267</td>
</tr>
<tr>
<td>3</td>
<td>139.798773 </td>
<td>6</td>
<td>367.697179</td>
</tr>
<tr>
<td>4</td>
<td>73.402796</td>
<td>8</td>
<td>211.587913</td>
</tr>
</tbody>
</table>
The pair plot of this gives: <code>sns.pairplot(df)</code>
<figure>
<img src="assets/img/machine-ln/pair-plot.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption></figcaption>
</figure>
and the correlation of these are: <code>df.corr()</code> which gives as output following:
<figure>
<img src="assets/img/machine-ln/correlation-heatmap.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption></figcaption>
</figure>
<div class="box">
<h5>Cross validation:</h5>
Here we are going to use the cross-validation method to assess the performance and
generalizability of predictive model. The primary goal of cross-validation is to ensure that a model trained on a particular dataset can generalize
well to new, unseen data. It helps in estimating how well the model perform on an independent dataset. For more details, you can see the <a href="https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right" target="_blank">document (well explained)</a>,
or you can look at <code>sklearn</code> website: <a href="https://scikit-learn.org/stable/modules/cross_validation.html" target="_blank">Cross-validation: evaluating estimator performance</a>.
<figure>
<img src="assets/img/machine-ln/grid_search_cross_validation.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption style="text-align: center;"><strong>k-fold</strong></figcaption>
</figure>
<p>In machine learning (ML), generalization usually refers to the ability of an algorithm to be effective across various inputs. It means that the ML model does not encounter performance degradation on the new inputs from the same distribution of the training data. </p>
<p><strong>Definition: </strong>Cross-validation is a technique for evaluating a machine learning model and testing its performance. CV is commonly used in applied ML tasks. It helps to compare and select an appropriate model for the specific predictive modeling problem.</p>
<p>There are a lot of different techniques that may be used to cross-validate a model. Still, all of them have a similar algorithm:</p>
<ul>
<li>Divide the dataset into two parts: one for training, other for testing</li>
<li>Train the model on the training set</li>
<li>Validate the model on the test set</li>
<li>Repeat 1-3 steps a couple of times. This number depends on the CV method that you are using</li>
</ul>
<p>There are plenty of CV techniques of which some of them are:</p>
<ul>
<li>Hold-out</li>
<li>K-folds</li>
<li>Leave-one-out</li>
<li>Leave-p-out</li>
<li>Stratified K-folds</li>
<li>Repeated K-folds</li>
<li>Nested K-folds</li>
<li>Time series CV</li>
</ul>
For details, see: <a href="https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right" target="_blank">Cross-validation</a>.
</div>
</li>
<li><strong>Feature selection: </strong>Now let's select the features.
<pre class="language-python"><code>
X= df.iloc[:,:-1]
y = df.iloc[:,-1]
</code></pre>
</li>
<li><strong>Standardization: </strong>We standardize the data as:
<pre class="language-python"><code>
from sklearn .preprocessing import StandardScaler
scaler =StandardScaler()
X_train =scaler.fit_transform(X_train)
X_test =scaler.fit_transform(X_test)
</code></pre>
</li>
<li><strong>Regression model:</strong>
<pre class="language-python"><code>
from sklearn.linear_model import LinearRegression
regression = LinearRegression()
regression.fit(X_train, y_train)
</code></pre>
We can calculate the coefficient as:
<pre class="language-python"><code>
print(regression.coef_)
</code></pre>
which gave: <code>[86.62120427 6.33750727]</code>.
</li>
<li><strong>Cross-validation method: </strong> Here we use cross-validation method. It is not mandetory but best practice.
<pre class="language-python"><code>
from sklearn.model_selection import cross_val_score
validation_score = cross_val_score(regression, X_train, y_train, scoring='neg_mean_squared_error', cv=3)
validation_score
</code></pre>
The output is <code>validation_score = ray([-437.35690029, -363.86846439, -321.13612303])</code>, which infact represents the MSE each.
The mean of these can be found as: <code>np.mean(validation_score) = -374.1204959020976 </code>.
</li>
<li><strong>Prediction: </strong>Now we find the prediction as follows:
<pre class="language-python"><code>
## prediction
y_pred = regression.predict(X_test)
</code></pre>
</li>
<li><strong>Calculating the MSE, MAE:</strong>
<pre class="language-python"><code>
from sklearn.metrics import mean_absolute_error, mean_squared_error
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"The Mean squared error is: {mse}")
print(f"The mean absolute error is: {mae}")
print(f"The root mean squared error is: {rmse}")
</code></pre>
which in term gave:
<pre>
The Mean squared error is: 508.99064011586074
The mean absolute error is: 17.628284263814095
The root mean squared error is: 22.560820909618087
</pre>
</li>
<li><strong>R-squared value: </strong>
<pre class="language-python"><code>
from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)
print(f"The r-squared value for the model is= {score}")
</code></pre>
The r-squared value for the model is= 0.9326105266469662.
</li>
<li><strong>Calculating the residuals and seeing it's distribution:</strong>
<pre class="language-python"><code>
residuals = y_test-y_pred
# plot residuals
sns.displot(residuals, kind='kde')
</code></pre>
<figure>
<img src="assets/img/machine-ln/distribution-residuals.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption></figcaption>
</figure>
Defiently not a perfect gaussian distrinution but still resembles a gaussian distribution.
</li>
</ul>
<strong>OLS method:</strong>
We can alsi use OLS method for this.
<pre class="language-python"><code>
import statsmodels.api as sm
model = sm.OLS(y_train, X_train).fit()
prediction = model.predict(X_test)
preint(model.summary())
</code></pre>
<figure>
<img src="assets/img/machine-ln/multi-ols.png" alt="" style="max-width: 80%; max-height: auto;">
<figcaption></figcaption>
</figure>
So we can compare the two method and we can see that the coefficient calculated in the two methods are approximately same.