Maximum Likelihood Estimation (MLE) is a parametric method, while K-Nearest-Neighbor Estimation (KNN) is a non-parametric method. That’s in the former we have information about the underlying probability density function (PDF); i.e., Gaussian, Exponential, Poisson … etc. This type of information is not available for the latter – non-parametric methods.

Among the state-of-art techniques for parametric estimation are MLE and Bayesian Estimation. The main conceptual difference is that MLE considers the parameters as fixed values. Then, they can be obtained by solving a maximization problem resulting in a delta-function. However, Bayesian Estimation assumes that the parameters are random variables and then applies *Bayesian Learning* to obtain them.

On the other hand, we have Histogram, Parzen Window, KNN, and Nearest-Neighbor Estimation as examples of non-parametric methods. The given feature space is divided into cells / bins. Each cell has a computed probability depending on the number of samples falling in such a cell relative to the total number of training samples. In Histogram and Parzen Window estimation we assume a cell of a fixed volume. However, the resulting probability estimation varies dramatically depending on the chosen fixed volume. This can be avoided by allowing the cell volume to vary to search for the k nearest neighbors to a given test feature vector. Hence, our window grows or shrinks depending on the density of the training set. We expect a large window, if the density is low and a small window of good resolution otherwise.

Since we have to compute the distance between a given feature vector and all the samples in our training set each time to perform a classification, KNN suffers from a high complexity in terms of time and space. A more reasonable approach in terms of complexity is the Nearest-Neighbor estimation. It uses *Voronoi Tesselation* to divide the feature space to cells. Each cell consists of the neighbors of a training point and is labelled by the category of this training point. This saves us from having to compute the distances every time we’re given a test sample, but sacrifices the accuracy on the other hand.

References:

- Pattern Classification – Richard O. Duda, Peter E. Hart, David G. Stork – John Wiley & Sons, Nov 9, 2012.
- Pattern Recognition and Application by Prof. P.K. Biswas, IIT Kharagpur [Link , YouTube].
- Pattern Recognition by Prof. P.S. Sastry, IISc Bangalore [Link , YouTube].