Blogs

Join the
Revolution

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Enim, orci eget nec cursus in maecenas. Interdum enim, consectetur erat morbi volutpat.

Artificial Intelligence Archives

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cursus tristique auctor turpis ultrices sodales.

K-Nearest Neighbor (KNN) Algorithm

By Yashodhan Omase

Introduction

  • K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
  • K-NN algorithm can be used for Regression purpose as well as for Classification purpose.
  • KNN does not make any assumption hence KNN called as a non-parametric algorithm.
  • KNN does not learn from the training dataset instead of training KNN stores the dataset and at the time of classification, KNN performs an action on the stored dataset hence KNN called as a lazy learner algorithm.
  • Suppose we have two classes red class and green class both having 5 points and black point is our new point which want to classify.
  • First, we calculate 3 nearest neighbors from black point out of 3 neighbors maximum two belongs to green class hence our new point is classified into green class.
  • Distance between two points i.e. A(X1,Y1) and B(X2,Y2) calculate by using Euclidean distance and Manhattan distance formula. The Euclidean distance and Manhattan distance are the distance between two points, which we have already studied in geometry.
  • Euclidean distance – D = √(𝑿𝟏 − 𝑿𝟐) 𝟐 + (𝒀𝟏 − 𝒀𝟐) 𝟐
  • Manhattan distance – D = | X1 – X2 | + | Y1 – Y2 |
  • How to select the value of K?

  • There is no particular way to determine the best value for "K", so by using try and error method we find the best out of them.
  • If the K value is very low then outliers’ effects on classification, we will see the effect of outliers at end of this blog.
  • Large values for K are good, but sometimes it may find some difficulties.
  • We can select K value by plotting graph of error rate vs K value.
  • From graph select K value where error rate is minimum i.e. accuracy is maximum.
  • Below graph show the relation between error rate and K value.
  • From the above graph we can see that the mean error is minimum when the value of the K is between 5 and 11. I would advise you to calculate accuracy by using K value between 5 to 11 and then select the appropriate one.
  • For detail python code please click here
  • KNN For Classification

  • We understand maths behind KNN for classification by using the Iris data set.
  • For detail python code please click here
  • Iris data set having total 150 rows contain 3 species Iris-setosa, Irisversicolor, Iris-virginica for easy calculation we consider only 15 rows 5 rows of each species.
  • Below of above table new point is given which we want to classify using KNN.
  • First, we calculate distance between new point and existing point by using Euclidean distance method.
  • Suppose K = 5 hence select 5 nearest neighbors of new point we can see in above table out of 5 nearest neighbors 4 belongs to species Iris-versicolor, 1 belong to species Iris-versicolor, 0 belong to species Iris-setosa.
  • Maximum nearest neighbors belong to species Iris-versicolor hence our new point is classified into species Iris-versicolor.
  • We can classify it from original data set.
  • For detail python code please click here.
  • For detail calculation on excel please click here.
  • KNN For Regression

  • We understand maths behind KNN for classification by using the Advertising dataset.
  • For download Iris data set please click here.
  • Advertising dataset contain 200 rows and 4 columns. Dataset provides information about money spends on different media platform and total sale.
  • We can predict sales on basis of money spend on different media platform.
  • Below of above table new point is given we want to predict total sale on basis of money spend on each media platform.
  • First, we find the distance between new point and all existing point by using Euclidean distance method.
  • \
  • Suppose our K is 5 hence ID21, ID49, ID93, ID134, ID169 nearest neighbors of new point and predicted sales value is average of all 5 nearest neighbors which is 18.78. We cross check it from original dataset.
  • For detail python code please click here.
  • For detail calculation on excel please click here.
  • Industrial Application of KNN Algorithm

  • KNN is used in many sections like medical, entertainment, banking etc. K-nearest.
  • KNN algorithm is easy to understand and use also so many Data scientist and beginners in Machine learning algorithm are consistently used it.
  • Diabetics Prediction

  • Now a days many youngsters are facing diabetes disease. Diabetes diseases are based on family tradition, health condition, age and food habits. If you have dataset contain independent variables like age, pregnancies, glucose, blood pressure, skin thickness, insulin, body mass index then we can easily prdict the person is diabetic or not.
  • For detail python code please click here.
  • Lung Cancer Prediction

  • In the medical sector, the KNN algorithm is widely used. It is used to predict lung cancer. IN this kind of prediction KNN algorithm is used as the classifier. The K nearest neighbor is the easiest algorithm to apply here with higher accuracy. Based on the previous history of the locality, age and other conditions we can predict person is having lung cancer or not.
  • For detail python code please click here.
  • Recommendation System

  • If we search any product on online marketing website then next time system recommends us products related to our last search in such system KNN algorithm is used. Due to this recommendation system compony like Flipkart, Amazon Netflix, YouTube increase their revenue.
  • For detail python code please click here.
  • Advantages of KNN Algorithm

  • The KNN algorithm is very simple and easy to understand and implement also.
  • There’s no need to make additional assumptions, tune several parameters and build a model.
  • KNN algorithm used for classification problem, regression problem and search i.e. it is versatile.
  • Disadvantages of KNN Algorithm

  • The speed of KNN algorithm is decrease by increasing now of rows or independent variable in dataset
  • I applied KNN algorithm on two dataset first dataset contain 150 row and second dataset contain 120000 rows.
  • The time required to execute program of first dataset is: 0.014963150024414062
  • The time required to execute program of second dataset is: 3.635221242904663
  • For detail python code please click here.
  • Effect of Imbalanced Data

  • A classification data set with skewed class proportions is called imbalanced. Classes having a large proportion of the data set are called majority classes. Those classes having a smaller proportion of data set are called minority classes.
  • In above figure 20 red points are majority class and 4 green points are minority class.
  • We can clearly see that new point would have been classify into green class but only due to unbalanced data new point is classify into red class.
  • Effect of Outliers

  • The observation that lies an abnormal distance from other values in a data set are called as outliers.
  • In above figure small portion of red class is placed away from the population this are outlier.
  • Suppose K value is 5 then new point having 3 red point neigbhors and 2 green point neighbors if the outliers are not present in dataset then our new point would have been definitely classified into green class.
  • New point is wrongly classified due to outliers.