Lab Assignment 2

Python Lab Assignment-2

Authors

Lab ID-1

1.Nagababu Chilukuri, Class Id-26

2.Nikita Goyal, Class Id-7

3.Ronnie Dean Antwiler II, Class Id-1

Introduction

This lab Assignment includes the Data Analysis, Data Exploration ,Cleaning of the data , Handling the missing values and predicting some results using various Machine Learning Algorithms. The Natural language process has also been used to understand various text analysis concepts.

The Dataset we used in completing the Lab Assignment was:

Titantic
Wine Quality
Startups Sales and Profit

Objective

The main Objective is to understand and learn the various Machine learning Algorithm and NLP concepts using scikit Learn:

KNN, SVM , Naive bayes Classifier
K-mean Clustering
Tokenization, Lemmatization,Trigrams
Linear and Multiple Regression

Workflow

Question-1

Pick any dataset from the dataset sheet in the class sheet or online which includes both numeric and non-numeric features

Perform exploratory data analysis on the data set (like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...)
Apply the three classification algorithms Naïve Baye’s, SVM and KNN on the chosen data set and report which classifier gives better result.

Solution:

Loading the Data

Lab1_Q1_1

Analyzing the Data through Visualization

Lab2_Q1_2

Checking for Missing Values and Replacing it

Lab2_Q1_3

Lab2_Q1_4

Lab2_Q1_5

Exploring the Data for analyzing it

Lab2_Q1_6 Lab2_Q1_7 Lab1_Q1_8 Lab2_Q1_9 Lab2_Q1_10 Lab2_Q1_11 Lab2_Q1_12 Lab2_Q1_13 Lab2_Q1_14 Lab2_Q1_15 Lab2_Q1_16 Lab2_Q1_17

Split The Data for Validation

Lab2_Q1_18

Evaluating the model using three Classifier

We use three Classifiers that is SVM, Naive Bayes and KNN Classifier to evaluate the model

Lab1_Q1_19

Results

lab2_Q1_20

The SVM Classifier gives the highest accuracy score than other two Classifier.

QUESTION-2

Choose any dataset of your choice. Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn.

Report which K is the best using the elbow method.
Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class)

Solution:

Loading the Data

Lab2_Q2_1

Lab2_Q2_2

Lab2_Q2_3

Data Exploration

Lab2_Q2_4 Lab2_Q2_19 Lab2_Q2_20 Lab2_Q2_5 Lab2_Q2_6

Applying K-mean Clustering model

Lab2_Q2_7

Making Clusters of different Values and Calculating Silhoutte Score

Lab2_Q2_8 Lab2_Q2_9 Lab2_Q2_10 Lab2_Q2_11

Lab2_Q2_12 Lab2_Q2_13 Lab2_Q2_14 Lab2_Q2_15 Lab2_Q2_16 Lab2_Q2_17

Elbow Method Analysis

Lab2_Q2_18

QUESTION-3

Write a program in which take an Input file, use the simple approach below to summarize a text file:Link to input file: https://umkc.box.com/s/7by0f4540cdbdp3pm60h5fxxffefsvrwa.

Read the data from a file
Tokenize the text into words and apply lemmatization technique on each word.
Find all the trigrams for the words.
Extract the top 10 of the most repeated trigrams based on their count.
Go through the text in the file
Find all the sentences with the most repeated tri-gramsg. Extract those sentences and concatenateh. Print the concatenated result

Solution

Read the data from File

Lab2_Q3_1

Tokenizing into words and applying lemmatization

Lab2_Q3_2 Lab2_Q3_3

Finding Trigrams

Lab2_Q3_4 Lab2_Q3_5 Lab2_Q3_6 lab2_rema Lab2_remainin lab2_rem2

QUESTION-4

Create Multiple Regression by choosing a dataset of your choice (again before evaluating, clean the data set with the EDA learned in the class). Evaluate the model using RMSE and R2 and also report if you saw any improvement before and after the EDA.

### Solution: