Pattern Recognition

# A Multi dependent Feature Bayesian Classifier with Discriminant Functions using GNU Octave

In this blog post, we continue with a variation of Bayesian Classifier. Here, we use multiple features – four – and they are all dependent. We also introduce two new concepts; discriminant functions and maximum likelihood estimation. We’re using the Iris dataset [4] which has 150 samples of three different flowers;Iris setosa, versicolor, and virginica. Each class has 50 samples, and each sample has four featres; sepal length, sepal width, petal length, petal width. We’ve chosen GNU Octave for two reasons; it’s an open source and it’s easier to demonstrate the concepts without so much code compared to C# – as in the previous posts.

The Normal Density

Due to its analytical tractability, the Gaussian density has received more attention than any other density function. Hence, we assume that the Iris dataset follows a multivariate normal or Gaussian density. The general formula is

p(x) = $\frac{1}{(2\Pi)^{d/2} |\Sigma|^{1/2}}$ exp[$\frac{-1}{2}$ (x-μ)t-1(x-μ)]

where x is a d-component column vector, μ is the d-component mean vector, ∑ is the d-by-d covariance matrix, and |∑| and ∑-1 are its determinant and inverse, respectively. This formula can be expressed in a parametric form as

p(x) ~ N(μ, ∑)

That is the multivariate normal or Gaussian density is completely specified by two parameters; its mean μ and covariance ∑.

Maximum-likelihood Estimation

Now that we assumed our underlying density is multivariate normal or Gaussian, we’ll make use of the samples to train our classifier. The goal of the training step is to estimate the parameters vector Θ = [μ , ∑]t. Maximum-likelihood estimation is  a parameter estimation method widely used for its convergence criteria and simplicity. It finds the parameter vector $\hat\theta$ that maximizes the log-likelihood of the parameter vector Θ with respect to the set of samples p(D|Θ). Try to convince yourself or just believe me that the maximum-likelihood estimates for μ and ∑ are given by

$\hat\mu$ = $\frac{1}{n}$ $\sum_{k=0}^n x_{k}$

and

$\hat\Sigma$ = $\frac{1}{n}$ $\sum_{k=0}^n$ $( x_{k} - \hat\mu ) ( x_{k} - \hat\mu )^{t}$

where n is the number of samples belonging to the same class. In this example, we used 20 samples from each species to train our classifier. Also note that $( x_{k} - \hat\mu ) ( x_{k} - \hat\mu )^{t}$ is the inner product or dot product.

Discriminant Functions

A pattern classifier can be viewed as a network with the testing sample placed at its inputs. The classifier then calculates the discriminant function gi(x) for each class. Finally, it assigns the test sample to the class corresponding to the largest value of the discriminant function – this is equivalent to the minimum conditional risk R(αi|x) and the maximum posterior probability p(ωi|x) as per a previous blog post.

Hence, the discriminant functions divide the feature space into regions separated by decision boundaries. The test sample is assigned to the decision region which belongs to the class that gives the largest value of the discriminant function. It can also be stated that the discriminant function calculates the Mahalanobis distance between the input feature vector x and the mean of each class, then assigns the test sample to the closest class.

The general discriminant function gi(x) for multivariate normal or Gaussian density where each class has a different covariance matrix is a quadratic linear function. Remember, this is the case at hand of the Iris data set under the normality assumption. Since the discriminant function gi(x) is quadratic, the resulting decision surfaces take any of the general forms of hyperquadrics; hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, … etc.

gi(x) = xtWix + witx + ωio

where

Wi = $\frac{-1}{2} \Sigma_{i}^{-1}$

wi = $\Sigma_{i}^{-1}\mu_{i}$

and

ωio = $\frac{-1}{2} \mu_{i}^{t} \Sigma_{i}^{-1} \mu_{i} - \frac{1}{2} \ln\Sigma_{i} + \ln P(\omega_{i})$

where subscript i in gi(x) denotes the discriminant function of class i ∀ i = 0 to c, c is the number of classes = 3 for Iris dataset, P( $\omega_{i}$ ) = $\frac{1}{3}$ denotes the prior probability, μ is the mean, ∑ is the covariance matrix, and x is the d-dimensional vector of the test sample.

GNU Octave

Now that the concept was laid out, we can go through the source code for our example. First, we set some global constants:

  ## constants
num_of_training_samples = 20;
num_of_testing_samples = 30;
num_of_samples_per_class = 50;
num_of_classes = 3;
num_of_features = 4;
priori = 1 / num_of_classes;


It’s relatively easier than C# to read the Iris dataset from file and divide the samples into three matrices. Also, note that we used generic formulas for division in order to allow experimenting with the number of training samples.

  ## read file into four columns to escape text
[x_1, x_2, x_3, x_4] = textread(&quot;Iris Data.txt&quot;, &quot;%f %f %f %f %*s&quot;, &quot;delimiter&quot;, &quot;,&quot;);
## merge the columns into 150x4 Matrix
Iris_Dataset = [x_1 x_2 x_3 x_4];
## separate each class into 20x4 Matrix for Training
## use constants for generic coding
w1 = Iris_Dataset(1:num_of_training_samples, :);
from = (1+num_of_samples_per_class);
to = from + num_of_training_samples - 1;
w2 = Iris_Dataset(from:to, :);
from = (1 + num_of_samples_per_class * 2);
to = from + num_of_training_samples - 1;
w3 = Iris_Dataset(from:to, :);


We applied the formulas to calculate the mean and the variance using Maximum-likelihood estimation as explained above:

  ## estimate mean vector meu for each class (state of nature) 4x1 vector
meu_1 = vec(mean(w1));
meu_2 = vec(mean(w2));
meu_3 = vec(mean(w3));
## estimate covariance matrix
cov_1 = Estimate_Covariance_with_MLE (w1, meu_1, num_of_features, num_of_training_samples);
cov_2 = Estimate_Covariance_with_MLE (w2, meu_2, num_of_features, num_of_training_samples);
cov_3 = Estimate_Covariance_with_MLE (w3, meu_3, num_of_features, num_of_training_samples);
[\code]

For clarity and re-usability, the covariance matrix was calculated a separate function residing in its own .m file:

function [covariance_matrix] = Estimate_Covariance_with_MLE (training_set, meu, num_of_features, num_of_training_samples)
## estimate covariance matrix
covariance_matrix = zeros(num_of_features, num_of_features);
for k = 1:num_of_training_samples
for i = 1:num_of_features
for j = 1:num_of_features
covariance_matrix(i, j) += ( training_set(k, i) - meu(i) ) * ( training_set(k, j) - meu(j) )
endfor; ## j:= d:= #_of_features
endfor; ## i:= d:= #_of_features
endfor; ## k:= training samples
covariance_matrix = covariance_matrix / num_of_training_samples;
endfunction


The results were as follows;

$\mu_{1} = \begin{array}{lccr} 5.02800 & 3.48000 & 1.46000 & 0.24800 \\ \end{array}$

$\Sigma_{1} = \begin{array}{lccr} 0.1540160 & 0.1133600 & 0.0231200 & 0.0186560 \\ 0.1133600 & 0.1304000 & 0.0060000 & 0.0213600 \\ 0.0231200 & 0.0060000 & 0.0376000 & 0.0063200 \\ 0.0186560 & 0.0213600 & 0.0063200 & 0.0104960 \\ \end{array}$

$\mu_{2} = \begin{array}{lccr} 6.0120 & 2.7760 & 4.3120 & 1.3440 \end{array}$

$\Sigma_{2} = \begin{array}{lccr} 0.288256 & 0.105088 & 0.179056 & 0.049872 \\ 0.105088 & 0.119424 & 0.085088 & 0.044656 \\ 0.179056 & 0.085088 & 0.189056 & 0.061472 \\ 0.049872 & 0.044656 & 0.061472 & 0.040864 \\ \end{array}$

$\mu_{3} = \begin{array}{lccr} 6.5760 & 2.9280 & 5.6400 & 2.0440 \\ \end{array}$

$\Sigma_{3} = \begin{array}{lccr} 0.503424 & 0.116672 & 0.414960 & 0.059456 \\ 0.116672 & 0.125216 & 0.094880 & 0.057968 \\ 0.414960 & 0.094880 & 0.400800 & 0.064640 \\ 0.059456 & 0.057968 & 0.064640 & 0.062464 \\ \end{array}$

We could visually verify that our covariances are equal on both sides of the diagonal; i.e. $\sigma_{ij}^{2} = \sigma_{ji}^{2}$.

Note that we could've estimated the covariance matrix using an approach similar to [6], but we preferred this approach for readability and clarity. Also, note that a concept explained by Dr. Andrew Ng [15] called vectorization could have been used. Vectorization basically avoids using for loops to calculate inner products or dot products, but writes the formula in vector transpose form. Then, leaves it to the compiler to do the numerical optimization.

Having estimated the parameters of the underlying probability distributions for each class, we can start calculating the discriminant function for each class given the input test samples. In order to optimize our code a bit, we made use of the fact that the linear coefficients of the discriminant function gi(x) are independent of x, so we could calculate them only once

  ## calculate linear coefficients of discriminant functions for each class (state of nature)
## Wi
W_1 = (-1 / 2) * inv(cov_1);
W_2 = (-1 / 2) * inv(cov_2);
W_3 = (-1 / 2) * inv(cov_3);
## wi
w_1 = inv(cov_1) * meu_1;
w_2 = inv(cov_2) * meu_2;
w_3 = inv(cov_3) * meu_3;
## Omega_i_0
omega_0_1 = (-1/2) * transpose(meu_1) * inv(cov_1) * meu_1 - (1/2) * log(det(cov_1)) + log(priori);
omega_0_2 = (-1/2) * transpose(meu_2) * inv(cov_2) * meu_2 - (1/2) * log(det(cov_2)) + log(priori);
omega_0_3 = (-1/2) * transpose(meu_3) * inv(cov_3) * meu_3 - (1/2) * log(det(cov_3)) + log(priori);


Again using generic indexing, we extract the testing samples

  ## extract testing set Matrix 30 x 4
## use constants for generic coding
w1 = Iris_Dataset(num_of_training_samples+1:num_of_samples_per_class, :); ## (21:50, 🙂
from = (1+num_of_samples_per_class+num_of_training_samples); ## 71
to = from + num_of_testing_samples - 1; ## 100
w2 = Iris_Dataset(from:to, :);
from = (1+2*num_of_samples_per_class+num_of_training_samples); ## 121
to = from + num_of_testing_samples - 1; ## 150
w3 = Iris_Dataset(from:to, :);


In the following snippet, we compacted two steps together - thanks to Octave. We calculate the discriminant function using a .m file and simultaneously fill the confusion matrix. The confusion matrix is simple a c * c matrix where an element (i, j) means that the test sample belong to class j, but was classified by our classifier as class i. Note that if i equals j, then, our classifier was accurate. Hence, the overall accuracy is simply the sum of along the diagonal of the confusion matrix.

  ## Loop testing set for class 1
confusion_matrix = zeros(num_of_classes, num_of_classes);
confusion_matrix = Calculate_Linear_Discriminant_Function(confusion_matrix,w1, 1, W_1, W_2, W_3, w_1, w_2, w_3, omega_0_1, omega_0_2, omega_0_3);
confusion_matrix = Calculate_Linear_Discriminant_Function(confusion_matrix,w2, 2, W_1, W_2, W_3, w_1, w_2, w_3, omega_0_1, omega_0_2, omega_0_3);
confusion_matrix = Calculate_Linear_Discriminant_Function(confusion_matrix,w3, 3, W_1, W_2, W_3, w_1, w_2, w_3, omega_0_1, omega_0_2, omega_0_3);

overall_accuracy = sum(diag(confusion_matrix)) / (num_of_testing_samples*num_of_classes);

function [confusion_matrix] = Calculate_Linear_Discriminant_Function (confusion_matrix, class_state_of_nature, label, W_1, W_2, W_3, w_1, w_2, w_3, omega_0_1, omega_0_2, omega_0_3)
for i = 1:rows(class_state_of_nature);
x = vec(class_state_of_nature(i,:));

g_1 = transpose(x) * W_1 * x + transpose(w_1) * x + omega_0_1;
g_2 = transpose(x) * W_2 * x + transpose(w_2) * x + omega_0_2;
g_3 = transpose(x) * W_3 * x + transpose(w_3) * x + omega_0_3;

if(g_1 &gt; g_2 &amp;&amp; g_1 &gt; g_3)
confusion_matrix(1, label) = confusion_matrix(1, label) + 1;
elseif(g_2 &gt; g_3)
confusion_matrix(2, label) = confusion_matrix(2, label) + 1;
else
confusion_matrix(3, label) = confusion_matrix(3, label) + 1;
endif;
endfor; ## rows i
endfunction
`

We've experimented with different values for the number of training samples and testing samples to see how the overall accuracy is affected. An understanding of the concept of cross-validation is necessary to understand the results. Let's spare it for a future blog post.

References:
1. How do I read a delimited file with strings/numbers with Octave? http://stackoverflow.com/questions/5301419/how-do-i-read-a-delimited-file-with-strings-numbers-with-octave
2. Documentation of GNU Octave and the Octave-Forge packages http://octave.sourceforge.net/docs.html
3. Source code for the example https://github.com/thecortex/Octave/tree/master/Pattern%20Recognition/Discriminant%20Function%20and%20Maximum%20Likelihood%20Estimation
4. what is the clear screen function http://octave.1599824.n4.nabble.com/what-is-the-clear-screen-function-td1616732.html
5. MATLAB Documentation http://www.mathworks.com/help/matlab/index.html
6. Calculating Covariance Matrix in Matlab http://stackoverflow.com/questions/13703063/calculating-covariance-matrix-in-matlab
7. GNU Octave Manual http://www.gnu.org/software/octave/doc/interpreter
8. Does MATLAB pass parameters using "call by value" or "call by reference"? http://www.mathworks.com/matlabcentral/answers/96960-does-matlab-pass-parameters-using-call-by-value-or-call-by-reference
9. GNU Octave Call by Value https://www.gnu.org/software/octave/doc/interpreter/Call-by-Value.html
10. Summing diagonal elements of matrix in Octave http://stackoverflow.com/questions/8115101/summing-diagonal-elements-of-matrix-in-octave
11. Iterate over each row in matrix Octave http://stackoverflow.com/questions/26087392/iterate-over-each-row-in-matrix-octave
12. Edgar Anderson's Iris Data https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html
13. Pattern Classification – Richard O. Duda, Peter E. Hart, David G. Stork – John Wiley & Sons, Nov 9, 2012.
14. LATEX An unofficial reference manual http://texdoc.net/texmf-dist/doc/latex/latex2e-help-texinfo/latex2e.pdf
15. OctaveTutorial - Vectorization - Machine Learning - Professor Andrew Ng https://www.youtube.com/watch?v=955HphMCMnI