Machine Learning · Pattern Recognition

A Probabilistic Neural Network Classifier for the Iris Dataset using GNU Octave

Non-parametric Density Estimation

In practice, the form of the underlying density distribution from which the training samples are drawn hardly fit a unimodal PDF. Parzen Window estimates the likelihood p(x|ωi) which can be used in a Bayesian classifier. K-Nearest Neighbor (KNN) directly estimates the posterior P(ωi|x) which can be used for a decision function.

Density Estimation

An unknown density of a sample x can be estimated using the following formula

p_{n}(x) = \frac{ k_{n} / n }{V_{n}}

where k_{n} is the number of samples falling in a region R_{n} , n is the total number of samples, V_{n} is the volume of this region, p_{n}(x) is an estimate of the probability p(x).

Regardless of the mathematical derivation of the above equation, it was found out both analytically and empirically that the above equation converges as the number of training samples approaches infinity and the volume of the region approaches zero.

Parzen Window starts with an initial region that shrinks as n grows, while KNN specifies k number of samples and the region V grows until it encloses k neighbors of x.

visual comparison between parzen window and knn

I’d recommend these two video lectures, if you’re interested in pursuing these ideas further;

Parzen Windows

It uses a window / kernel \phi(u) ; such as a hypercube or a uni-variate normal  Gaussian density

\phi(u) = \{  \begin{array}{lr}  1 & |u_{j}| \leq \frac{1}{2} \forall j = 1 ... d \\  0 & otherwise  \end{array}

i.e. \phi(u) = 1 when x falls in a region R_{n}; a hypercube centered at x_{i}

k_{n} = \sum_{i=1}^{n} \phi(\frac{x-x_{i}}{h_{n}})

where h_{n} is the edge of a hypercube whose volume is V_{n} = h_{n}^{d}, and hence p_{n}(x) can be estimated using

p_{n}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{V_{n}} \phi(\frac{x-x_{i}}{h_{n}})

If the window function is a Gaussian with zero mean,

\phi(u) = \frac{1}{\sqrt[2]{2\pi}\sigma} \exp^{\frac{-u^{2}}{2\sigma^{2}}}

In this case, the probability estimation p_{n}(x) is an average of normal densities centered at the samples

p_{n}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_{n}} \phi(\frac{x-x_{i}}{h_{n}})

where h_{n} is a parameter that controls the width of the Gaussian window, it could be set by the user or estimated using a Jackknife technique.

The window size affects our estimate, since a large window results in low resolution smoothed out-of-focus estimate, and a small window results in an erratic noisy spike-like estimate with high statistical variability. Hence, the Parzen window converges when the mean of all the estimates p_{n}(x) approaches the unknown density p(x) and the variance of these estimates approaches zero. This is as explained earlier under the constraints that n approaches infinity and V approaches zero. The following figure[2] demonstrates this effect

parzen window size and number of samples

Parzen Windows provide a generic framework that is independent of the shape of the unknown density; uni-moal, or multi-modal. However, it suffers from computational complexity which is a function of the number of samples and the dimensionality of the feature space – the curse of dimensionality. The calculations have to be done all over again for each test sample. These disadvantages of Parzen Windows lead us to the next subject.

Probabilistic Neural Networks (PNNs)

A solution to the above mentioned cons could be found by incorporating a neural network that trades time complexity with space complexity. Furthermore, PNNs allow us to parallelize our algorithm on multi cores. Surprisingly, it also augments new training samples online without having to restart the training phase from scratch.

structure of probabilistic neural network pnn

We’ll use Octave and the Iris dataset to demonstrate the nuts and bolts of PNNs. We set off defining some constants. Since our net consists of three layers; input, pattern, and category. There are d nodes corresponding to the number of dimensions of our feature space. In the case of the Iris dataset, we have four dimensions; petal width, length, sepal width and length. The number of nodes in the middle layer correspond to the number of training samples n. There are finally three nodes in the category layer corresponding to the three classes; Iris Setosa, Iris Versicolor, and Iris Virginica. Since the PNN is based on Parzen Window of a Normal Density; a parameter sigma \sigma  is used to control the window width.

  ## Constants
  num_of_samples_per_class = 50;
  num_of_training_samples = 5;
  num_of_testing_samples = 45;

  d = 4; ## number of features
  c = 3; ## number of classes
  ## number of patterns nodes = number of training samples per class * number of classes
  n = num_of_training_samples * c;
  sigma = 1;

Next, we connect the layers of our network. The layer between the input and the pattern layer is fully connected set of weights. That is from each input node, there are weighted links w_{ij} \forall i = 1 ... d, j = 1 ... n to each pattern node. Hence, the weights can be represented by an (n x d) matrix w. On the other hand, the connections between the pattern and the category layers are sparse, because each pattern node is connected only to the category node to which it belongs. The connections are represented with an (n x c) matrix denotes as a. We also allocate memory for the inner product or simply net_{k} . It’s the non-linear output function emitted by each node in the category layer.

  ## Matrices
  w = zeros(n, d); ## weights between input layer and patterns layer
  a = zeros(n, c); ## connections between patterns layer and output categories layer
  net_k = zeros(n, 1); ## inner product of test sample x and weights matrix w w'x

This is how we read the Iris dataset from the file;

  ## Read Iris dataset
  ## read file into four columns to escape text`
  [x_1, x_2, x_3, x_4] = textread("Iris Data.txt", "%f %f %f %f %*s", "delimiter", ",");
  ## merge the columns into 150x4 Matrix
  Iris_Dataset = [x_1 x_2 x_3 x_4];

We also use a generic formula to allow us to easily change the number of training samples to experiment with the accuracy of our classifier;

  ## Distill training set
  ## separate each class into 20x4 Matrix for Training
  ## use constants for generic coding
  w1 = Iris_Dataset(1:num_of_training_samples, : );
  from = (1+num_of_samples_per_class);
  to = from + num_of_training_samples - 1;
  w2 = Iris_Dataset(from:to, : );
  from = (1 + num_of_samples_per_class * 2);
  to = from + num_of_training_samples - 1;
  w3 = Iris_Dataset(from:to, : );
  training_set = [w1; w2; w3];

Our algorithm consists of two steps; a training step, and a testing step. In the training step, we place each sample at the input nodes, and set the weights of the connections between the features nodes and the corresponding pattern node. That is for each sample i, the weights of the links between all the input nodes and the ith pattern node are set to the normalized value of the corresponding elements of the feature vector. The normalization is achieved by dividing each feature element by;

\sqrt[2]{ \sum_{i=1}^{d}(x_{i}^{2}) }

    ## calculate normalization factor for each row in training set 
    ## Sqrt(Summation(x[j][i]^2)) i = 1 to d, j = 1 to n
    norms = ones(n, 1);
    for j = 1 : n
      norms(j, 1) = sqrt(sum(power(training_set(j, : ),2)));

Each link is assigned a weight w_{j,k} \leftarrow x_{j,k} \forall j = 1 ... n, k = 1 ... d and the link between each pattern and the corresponding category node is set to 1 if and only if sample j belongs to that category.

    for j = 1 : n
      for k = 1 : d
        Xjk = training_set(j,k) / norms(j, 1); ## normalize
        w(j,k) = Xjk; ## train
      ## construct patterns matrix [ n x c ] 
      ## a[j][i] = 1 iff pattern node j belongs to class i
      ## sparse matrix
      ## partitions training set to c segmens 
      ## to help build the connections 
      ## from patterns to categories
      category_index = floor((j-1) / num_of_training_samples) + 1;
      a(j, category_index) = 1;

We extract our testing set using a generic formula;

  ## extract testing set Matrix 30 x 4
  ## use constants for generic coding
  w1 = Iris_Dataset(num_of_training_samples+1:num_of_samples_per_class, : ); ## (21:50, : )
  from = (1+num_of_samples_per_class+num_of_training_samples); ## 71
  to = from + num_of_testing_samples - 1; ## 100
  w2 = Iris_Dataset(from:to, : );
  from = (1+2*num_of_samples_per_class+num_of_training_samples); ## 121
  to = from + num_of_testing_samples - 1; ## 150
  w3 = Iris_Dataset(from:to, : );

The logic of the testing phase is encapsulated in .m file. The net activation is a non-linear function of net_{k} [2]

\phi(\frac{x-w_{k}}{h_{n}}) = \exp^{\frac{net_{k} - 1}{\sigma^{2}}}

where net_{k} is the inner product calculated at each pattern node

net_{k} = w_{k}^{t}*x

Also, the net activation is added only if a link exists between a pattern and a category node. We assign the test sample x to the category node having the maximum output value. This very much resembles our discriminant function classifier in a previous post. The net activation here is interpreted as the Parzen Window estimation when the window is centered at the pattern node, and each pattern node contributes its probability to the category node where all signals are summed.

  function [confusion_matrix] = Test (
  for m = 1 : size(test_matrix, 1) %% loop all test cases
      output = zeros(num_classes, 1); ## discriminant function calculated at each output node
      x = transpose(test_matrix(m,:));
      for j = 1 : num_patterns %% loop all patterns nodes
        net_k(j, 1) = weights(j,:)*x; %% calculate net activation
        for i = 1 : num_classes %% loop all classes
          if a(j, i) == 1 %% if pattern node is connected to category node
            output(i, 1) = output(i, 1) + exp( (net_k(j, 1) - 1) / variance);
        endfor; %% i
      endfor; %% j

      %% assign test sample to category of max output
      max_val = -1;
      class = -1;
      for i = 1 : num_classes %% loop all category nodes
        if output(i, 1) > max_val
          class = i;
          max_val = output(i, 1);
      endfor; %% i
      confusion_matrix(class, true_class) = confusion_matrix(class, true_class) + 1;
  endfor; %% m

Finally, we call the test function for each class, and calculate the confusion matrix. The diagonal of which corresponds to the correctly classified samples as per the ground truth state of nature. The diagonal is then used to calculate the overall accuracy of our classifier.

  ## calculate net_k for each pattern from j = 1 ... n
  variance = power(sigma, 2);
  confusion_matrix = zeros(c, c);
  [confusion_matrix] = Test (confusion_matrix, w1, n, w, c, a, net_k, variance, 1);
  [confusion_matrix] = Test (confusion_matrix, w2, n, w, c, a, net_k, variance, 2);
  [confusion_matrix] = Test (confusion_matrix, w3, n, w, c, a, net_k, variance, 3);
  overall_accuracy = sum(diag(confusion_matrix)) / (num_of_testing_samples*c)

We have experimented different values for the number of samples in the training set, and different sigma \sigma values. We wrapped our testing section in a for loop for 20 different values of sigma \sigma , and stored the overall accuracy in each case;

  overall_accuracy = zeros(10, 1);
  for r = 1: 20
    ## the width of the effective gaussian window
    sigma = r / 10; 
    variance = power(sigma, 2);
    confusion_matrix = zeros(c, c);
    [confusion_matrix] = Test (confusion_matrix, w1, n, w, c, a, net_k, variance, 1);
    [confusion_matrix] = Test (confusion_matrix, w2, n, w, c, a, net_k, variance, 2);
    [confusion_matrix] = Test (confusion_matrix, w3, n, w, c, a, net_k, variance, 3);
    overall_accuracy(r, 1) = sum(diag(confusion_matrix)) / (num_of_testing_samples*c);

The confusion matrix is for example when n = 5 and \sigma=1 ;
Confusion Matrix = [\begin{array}{lcr}  45 & 0 & 0 \\  0 & 42 & 6 \\  0 & 3 & 39  \end{array}]

and the overall accuracy = 0.93333

From the figure, we can tell that the highest accuracy was achieved for values of \sigma = 0.2 , 0.3 and 0.4


The complete source code is available [1] for you to experiment different values, and compare the accuracy of the classifier.


[1] Source Code
[2] Pattern Classification – Richard O. Duda, Peter E. Hart, David G. Stork – John Wiley & Sons, Nov 9, 2012.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s