Uncategorized

When we increase the amount of training data in the KNN algorithm, why does the error rate reduce?

When we increase the amount of training data in the KNN algorithm, why does the error rate … by Mustafa Qamar-ud-Din

Answer by Mustafa Qamar-ud-Din:

This is true not only for KNN, but for all machine learning algorithms. It’s believed that the algorithm who has more data wins. Simply put, when you provide more data the granularity of the sample space becomes more fine grained. That is the distances either Euclidean, L1 Norm, or Manhattan becomes on a much lower scale that prevents errors.

For example, let’s say we have two samples labelled with A & B. The distance between them is 10 units. However, when we provide more samples, we get A, B, C, and D with distances 0.5 units. This is much more accurate. Isn’t it?

Outliers:

If you contemplate a little at the word outliers, you will find out why more samples are needed for training. An outlier is a sample that is very far away on a Cartesian coordinate system from every other sample.

If you add more samples, more likely you will fill the gap and reduce the occurrences of outliers.

Over-fitting:

However, you should be careful not to fall in the over-fitting trap which may happen by getting 100% accuracy for training.

Cross Validation:

Cross validation and k-folding are the technique most common to handle this dilemma.

When we increase the amount of training data in the KNN algorithm, why does the error rate reduce?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s