K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
K-NN algorithm can be used for Regression purpose as well as for Classification
purpose.
KNN does not make any assumption hence KNN called as a non-parametric
algorithm.
KNN does not learn from the training dataset instead of training KNN stores the
dataset and at the time of classification, KNN performs an action on the stored
dataset hence KNN called as a lazy learner algorithm.
Suppose we have two classes red class and green class both having 5 points and
black point is our new point which want to classify.
First, we calculate 3 nearest neighbors from black point out of 3 neighbors
maximum two belongs to green class hence our new point is classified into green
class.
Distance between two points i.e. A(X1,Y1) and B(X2,Y2) calculate by using
Euclidean distance and Manhattan distance formula. The Euclidean
distance and Manhattan distance are the distance between two points, which we
have already studied in geometry.
There is no particular way to determine the best value for "K", so by using try and
error method we find the best out of them.
If the K value is very low then outliers’ effects on classification, we will see the
effect of outliers at end of this blog.
Large values for K are good, but sometimes it may find some difficulties.
We can select K value by plotting graph of error rate vs K value.
From graph select K value where error rate is minimum i.e. accuracy is maximum.
Below graph show the relation between error rate and K value.
From the above graph we can see that the mean error is minimum when the value
of the K is between 5 and 11. I would advise you to calculate accuracy by using K
value between 5 to 11 and then select the appropriate one.
Iris data set having total 150 rows contain 3 species Iris-setosa, Irisversicolor, Iris-virginica for easy calculation we consider only 15 rows 5 rows
of each species.
Below of above table new point is given which we want to classify using KNN.
First, we calculate distance between new point and existing point by using
Euclidean distance method.
Suppose K = 5 hence select 5 nearest neighbors of new point we can see in above
table out of 5 nearest neighbors 4 belongs to species Iris-versicolor, 1 belong to
species Iris-versicolor, 0 belong to species Iris-setosa.
Maximum nearest neighbors belong to species Iris-versicolor hence our new
point is classified into species Iris-versicolor.
Advertising dataset contain 200 rows and 4 columns. Dataset provides
information about money spends on different media platform and total sale.
We can predict sales on basis of money spend on different media platform.
Below of above table new point is given we want to predict total sale on basis of
money spend on each media platform.
First, we find the distance between new point and all existing point by using
Euclidean distance method.
\
Suppose our K is 5 hence ID21, ID49, ID93, ID134, ID169 nearest neighbors of
new point and predicted sales value is average of all 5 nearest neighbors which is
18.78. We cross check it from original dataset.
KNN is used in many sections like medical, entertainment, banking etc. K-nearest.
KNN algorithm is easy to understand and use also so many Data scientist and
beginners in Machine learning algorithm are consistently used it.
Diabetics Prediction
Now a days many youngsters are facing diabetes disease. Diabetes diseases are
based on family tradition, health condition, age and food habits. If you have
dataset contain independent variables like age, pregnancies, glucose, blood
pressure, skin thickness, insulin, body mass index then we can easily prdict the
person is diabetic or not.
In the medical sector, the KNN algorithm is widely used. It is used to predict lung
cancer. IN this kind of prediction KNN algorithm is used as the classifier. The K
nearest neighbor is the easiest algorithm to apply here with higher accuracy. Based
on the previous history of the locality, age and other conditions we can predict
person is having lung cancer or not.
If we search any product on online marketing website then next time system
recommends us products related to our last search in such system KNN algorithm
is used. Due to this recommendation system compony like Flipkart, Amazon
Netflix, YouTube increase their revenue.
A classification data set with skewed class proportions is called imbalanced.
Classes having a large proportion of the data set are called majority classes. Those
classes having a smaller proportion of data set are called minority classes.
In above figure 20 red points are majority class and 4 green points are minority
class.
We can clearly see that new point would have been classify into green class but
only due to unbalanced data new point is classify into red class.
Effect of Outliers
The observation that lies an abnormal distance from other values in a data set are
called as outliers.
In above figure small portion of red class is placed away from the population this
are outlier.
Suppose K value is 5 then new point having 3 red point neigbhors and 2 green
point neighbors if the outliers are not present in dataset then our new point would
have been definitely classified into green class.