When we increase the amount of training data in the KNN algorithm, why does the error rate … by Mustafa Qamar-ud-Din
Answer by Mustafa Qamar-ud-Din:
This is true not only for KNN, but for all machine learning algorithms. It’s believed that the algorithm who has more data wins. Simply put, when you provide more data the granularity of the sample space becomes more fine grained. That is the distances either Euclidean, L1 Norm, or Manhattan becomes on a much lower scale that prevents errors.
For example, let’s say we have two samples labelled with A & B. The distance between them is 10 units. However, when we provide more samples, we get A, B, C, and D with distances 0.5 units. This is much more accurate. Isn’t it?
If you contemplate a little at the word outliers, you will find out why more samples are needed for training. An outlier is a sample that is very far away on a Cartesian coordinate system from every other sample.
If you add more samples, more likely you will fill the gap and reduce the occurrences of outliers.
However, you should be careful not to fall in the over-fitting trap which may happen by getting 100% accuracy for training.
Cross validation and k-folding are the technique most common to handle this dilemma.