## Non-parametric Density Estimation

In practice, the form of the underlying density distribution from which the training samples are drawn hardly fit a unimodal PDF. Parzen Window estimates the likelihood p(x|ω_{i}) which can be used in a Bayesian classifier. K-Nearest Neighbor (KNN) directly estimates the posterior P(ω_{i}|x) which can be used for a decision function.

## Density Estimation

An unknown density of a sample x can be estimated using the following formula

where is the number of samples falling in a region , n is the total number of samples, is the volume of this region, is an estimate of the probability p(x).

Regardless of the mathematical derivation of the above equation, it was found out both analytically and empirically that the above equation converges as the number of training samples approaches infinity and the volume of the region approaches zero.

Parzen Window starts with an initial region that shrinks as n grows, while KNN specifies k number of samples and the region V grows until it encloses k neighbors of x.

I’d recommend these two video lectures, if you’re interested in pursuing these ideas further;

## Parzen Windows

It uses a window / kernel ; such as a hypercube or a uni-variate normal Gaussian density

i.e. when x falls in a region ; a hypercube centered at

where is the edge of a hypercube whose volume is , and hence can be estimated using

If the window function is a Gaussian with zero mean,

In this case, the probability estimation is an average of normal densities centered at the samples

where is a parameter that controls the width of the Gaussian window, it could be set by the user or estimated using a Jackknife technique.

The window size affects our estimate, since a large window results in low resolution smoothed out-of-focus estimate, and a small window results in an erratic noisy spike-like estimate with high statistical variability. Hence, the Parzen window converges when the mean of all the estimates approaches the unknown density p(x) and the variance of these estimates approaches zero. This is as explained earlier under the constraints that n approaches infinity and V approaches zero. The following figure[2] demonstrates this effect

Parzen Windows provide a generic framework that is independent of the shape of the unknown density; uni-moal, or multi-modal. However, it suffers from computational complexity which is a function of the number of samples and the dimensionality of the feature space – *the curse of dimensionality*. The calculations have to be done all over again for each test sample. These disadvantages of Parzen Windows lead us to the next subject.

## Probabilistic Neural Networks (PNNs)

A solution to the above mentioned cons could be found by incorporating a neural network that trades time complexity with space complexity. Furthermore, PNNs allow us to parallelize our algorithm on multi cores. Surprisingly, it also augments new training samples online without having to restart the training phase from scratch.

We’ll use Octave and the Iris dataset to demonstrate the nuts and bolts of PNNs. We set off defining some constants. Since our net consists of three layers; input, pattern, and category. There are d nodes corresponding to the number of dimensions of our feature space. In the case of the Iris dataset, we have four dimensions; petal width, length, sepal width and length. The number of nodes in the middle layer correspond to the number of training samples n. There are finally three nodes in the category layer corresponding to the three classes; Iris Setosa, Iris Versicolor, and Iris Virginica. Since the PNN is based on Parzen Window of a Normal Density; a parameter sigma is used to control the window width.

## Constants num_of_samples_per_class = 50; num_of_training_samples = 5; num_of_testing_samples = 45; d = 4; ## number of features c = 3; ## number of classes ## number of patterns nodes = number of training samples per class * number of classes n = num_of_training_samples * c; sigma = 1;

Next, we connect the layers of our network. The layer between the input and the pattern layer is fully connected set of weights. That is from each input node, there are weighted links to each pattern node. Hence, the weights can be represented by an (n x d) matrix w. On the other hand, the connections between the pattern and the category layers are sparse, because each pattern node is connected only to the category node to which it belongs. The connections are represented with an (n x c) matrix denotes as a. We also allocate memory for the inner product or simply . It’s the non-linear output function emitted by each node in the category layer.

## Matrices w = zeros(n, d); ## weights between input layer and patterns layer a = zeros(n, c); ## connections between patterns layer and output categories layer net_k = zeros(n, 1); ## inner product of test sample x and weights matrix w w'x

This is how we read the Iris dataset from the file;

## Read Iris dataset ## read file into four columns to escape text` [x_1, x_2, x_3, x_4] = textread("Iris Data.txt", "%f %f %f %f %*s", "delimiter", ","); ## merge the columns into 150x4 Matrix Iris_Dataset = [x_1 x_2 x_3 x_4];

We also use a generic formula to allow us to easily change the number of training samples to experiment with the accuracy of our classifier;

## Distill training set ## separate each class into 20x4 Matrix for Training ## use constants for generic coding w1 = Iris_Dataset(1:num_of_training_samples, : ); from = (1+num_of_samples_per_class); to = from + num_of_training_samples - 1; w2 = Iris_Dataset(from:to, : ); from = (1 + num_of_samples_per_class * 2); to = from + num_of_training_samples - 1; w3 = Iris_Dataset(from:to, : ); training_set = [w1; w2; w3];

Our algorithm consists of two steps; a training step, and a testing step. In the training step, we place each sample at the input nodes, and set the weights of the connections between the features nodes and the corresponding pattern node. That is for each sample i, the weights of the links between all the input nodes and the i^{th} pattern node are set to the normalized value of the corresponding elements of the feature vector. The normalization is achieved by dividing each feature element by;

## calculate normalization factor for each row in training set ## Sqrt(Summation(x[j][i]^2)) i = 1 to d, j = 1 to n norms = ones(n, 1); for j = 1 : n norms(j, 1) = sqrt(sum(power(training_set(j, : ),2))); endfor;

Each link is assigned a weight and the link between each pattern and the corresponding category node is set to 1 if and only if sample j belongs to that category.

for j = 1 : n for k = 1 : d Xjk = training_set(j,k) / norms(j, 1); ## normalize w(j,k) = Xjk; ## train endfor; ## construct patterns matrix [ n x c ] ## a[j][i] = 1 iff pattern node j belongs to class i ## sparse matrix ## partitions training set to c segmens ## to help build the connections ## from patterns to categories category_index = floor((j-1) / num_of_training_samples) + 1; a(j, category_index) = 1; endfor;

We extract our testing set using a generic formula;

## extract testing set Matrix 30 x 4 ## use constants for generic coding w1 = Iris_Dataset(num_of_training_samples+1:num_of_samples_per_class, : ); ## (21:50, : ) from = (1+num_of_samples_per_class+num_of_training_samples); ## 71 to = from + num_of_testing_samples - 1; ## 100 w2 = Iris_Dataset(from:to, : ); from = (1+2*num_of_samples_per_class+num_of_training_samples); ## 121 to = from + num_of_testing_samples - 1; ## 150 w3 = Iris_Dataset(from:to, : );

The logic of the testing phase is encapsulated in .m file. The net activation is a non-linear function of [2]

where is the inner product calculated at each pattern node

Also, the net activation is added only if a link exists between a pattern and a category node. We assign the test sample x to the category node having the maximum output value. This very much resembles our discriminant function classifier in a previous post. The net activation here is interpreted as the Parzen Window estimation when the window is centered at the pattern node, and each pattern node contributes its probability to the category node where all signals are summed.

function [confusion_matrix] = Test ( confusion_matrix, test_matrix, num_patterns, weights, num_classes, a, net_k, variance, true_class ) for m = 1 : size(test_matrix, 1) %% loop all test cases output = zeros(num_classes, 1); ## discriminant function calculated at each output node x = transpose(test_matrix(m,:)); for j = 1 : num_patterns %% loop all patterns nodes net_k(j, 1) = weights(j,:)*x; %% calculate net activation for i = 1 : num_classes %% loop all classes if a(j, i) == 1 %% if pattern node is connected to category node output(i, 1) = output(i, 1) + exp( (net_k(j, 1) - 1) / variance); endif; endfor; %% i endfor; %% j %% assign test sample to category of max output max_val = -1; class = -1; for i = 1 : num_classes %% loop all category nodes if output(i, 1) > max_val class = i; max_val = output(i, 1); endif; endfor; %% i confusion_matrix(class, true_class) = confusion_matrix(class, true_class) + 1; endfor; %% m endfunction

Finally, we call the test function for each class, and calculate the confusion matrix. The diagonal of which corresponds to the correctly classified samples as per the ground truth state of nature. The diagonal is then used to calculate the overall accuracy of our classifier.

## calculate net_k for each pattern from j = 1 ... n variance = power(sigma, 2); confusion_matrix = zeros(c, c); [confusion_matrix] = Test (confusion_matrix, w1, n, w, c, a, net_k, variance, 1); [confusion_matrix] = Test (confusion_matrix, w2, n, w, c, a, net_k, variance, 2); [confusion_matrix] = Test (confusion_matrix, w3, n, w, c, a, net_k, variance, 3); confusion_matrix overall_accuracy = sum(diag(confusion_matrix)) / (num_of_testing_samples*c)

We have experimented different values for the number of samples in the training set, and different sigma values. We wrapped our testing section in a for loop for 20 different values of sigma , and stored the overall accuracy in each case;

overall_accuracy = zeros(10, 1); for r = 1: 20 ## the width of the effective gaussian window sigma = r / 10; variance = power(sigma, 2); confusion_matrix = zeros(c, c); [confusion_matrix] = Test (confusion_matrix, w1, n, w, c, a, net_k, variance, 1); [confusion_matrix] = Test (confusion_matrix, w2, n, w, c, a, net_k, variance, 2); [confusion_matrix] = Test (confusion_matrix, w3, n, w, c, a, net_k, variance, 3); confusion_matrix overall_accuracy(r, 1) = sum(diag(confusion_matrix)) / (num_of_testing_samples*c); endfor;

The confusion matrix is for example when n = 5 and ;

and the overall accuracy = 0.93333

From the figure, we can tell that the highest accuracy was achieved for values of = 0.2 , 0.3 and 0.4

The complete source code is available [1] for you to experiment different values, and compare the accuracy of the classifier.

**References:**

[1] Source Code https://github.com/thecortex/Octave/tree/master/Pattern%20Recognition/PNN

[2] Pattern Classification – Richard O. Duda, Peter E. Hart, David G. Stork – John Wiley & Sons, Nov 9, 2012.

[3] http://stackoverflow.com/questions/8025111/add-a-row-to-a-matrix-in-matlab

[4] http://www.mathworks.com/matlabcentral/answers/105090-how-to-calculate-the-sum-of-each-row-in-a-matrix

[5] http://www.wi.hs-wismar.de/~cleve/vorl/projects/dm/ss13/PNN/Quellen/CheungCannons_AnIntroductiontoPNNs.pdf

[6] https://www.projectrhea.org/rhea/index.php/Lecture_14_-_ANNs,_Non-parametric_Density_Estimation_(Parzen_Window)_OldKiwi