CA274 Programming for Data Analysis – Assignment 1

$30.00

Download Details:

  • Name: Assignment-1-yisosq.zip
  • Type: zip
  • Size: 433.37 KB

Category:

Description

5/5 - (1 vote)

In this assignment I will use the digits dataset given in CA274 and explore the different routes we took in
lectures to classify digits using the KNN algorithm.
Section 1 – KNN where k = 2:
Running our KNN with k = 2 means that we classify the digit off just the digit closest to it. This is because the
2 nearest digits to any digit always contains itself as the first digit because of our implementation.
Fig 1.1. Confusion matrix of the results vector from classification against the labels vector.
It can be seen in the above confusion matrix that the 8’s and 9’s are the digits most commonly misclassified,
so I have decided to look at their nearest neighbours to understand why exactly they are mistaken for other
numbers.
8’s mistaken for 3’s:
It can be seen here that the 8 is
identical to the 3 all along the right
hand side.
The only discrepancy is the thin
strokes to complete the 8 on the left
hand side that are absent on the 3.
This is why the 8 is so similar to the 3.
9’s mistaken for 3’s:
Here it is similar to the situation with
the 8 and the 3 above. Along the right
the digits are identical.
The 3 is missing a small arc to the left
of the centre in order to become a 9.
As this missing stroke is only small
these digits are still very close
together. Fig 1.3. 9 misclassified as 3 Screenshot.
8’s mistaken for 2’s:
It can be seen here that the 2 looks
like an 8 that is missing a diagonal
stoke from top left to bottom right.
The stroke thickness and shape is
identical and for this reason the 8 was
classified as a 2.
8’s mistaken for 5’s:
The digit 5 can be seen to follow a
similar shape to that of an 8.
The only difference is the absence
of a diagonal from bottom left to
top right in the 5. As they are so
close in shape and coverage it is no
wonder some 8’s are classified as
5’s.
9’s mistaken as 4’s:
In much the same way as the other
comparisons, the 9 and 4 have a very
similar shape. The only slight
difference between them is that the
4 is not rounded and closed at the
top.
This particular 9 I found was a very
exceptional case. The only indication
that it is a 9 is the curvature of the
digit.
Possible Improvement:
It is very clear that more complex digits are easier to misclassify because there is more variation between
different examples of that digit. If you are using distances between images alone you may need many
examples of 8’s and 9’s to make a difference to the ~95% accuracy as of current. This is because there are so
many different possibilities for these digits because they have more strokes and arcs than other classes of
digits.
Fig 1.6. 9 misclassified as 4 Screenshot.
Fig 1.5. 8 misclassified as 5 Screenshot.
Fig 1.4. 8 misclassified as 2 Screenshot.
One possible solution to this would be to check for the presence or absence of a stroke in the upper and
lower left regions of the digit in the case of the 8 being classified as a 3. Allow me to use a diagram to explain
my reasoning.
Below the original image I used can be seen. The premise of the idea is that images must have a similar
shape.
If an image such as the 8 on the left is being
compared to the 3 on the right. Looking at
distances shows they’re very similar.
If we were to check that there were at least
a similar number of bright pixels down the
bottom left we would find that there isn’t.
As a result we would know that there is a
stroke missing and these are not likely to be
the same digit. This could be applied to any
stroke in a digit to help in classification.
This in theory would be wasteful to do for every image being classified, especially considering that there are
only a handful where it would be necessary.
In addition to this, the fact that k = 2 means that an image is classified according to only the nearest
neighbour. If k is increased I believe that the results will be better as these misclassifications are due to there
being very similar shaped digits. In the general case where these digits didn’t have such a similar shape by
chance, we would see that the nearest neighbours would mainly consist of the same digit type.
Section 2 – 3D Point cloud (Displaying 8’s, 3’s and 8’s misclassified as 3’s):
I edited the code to show the 8’s, the 3’s and the 8’s misclassified as 3’s when k = 2 in the one point cloud.
This was with respect to the first 3 eigenvectors with the most information. The clouds can be seen below,
the 3’s are red dots, the 8’s are light-blue dots and the misclassified 8’s are black stars so that they stand out.
It can be seen in the above images that the misclassified 8’s lie a distance from the main clusters. They are
sat along an outer shell of the main cluster. This is clear as they are found among the stragglers at the top
and are also not as deep as the main cluster as they are placed at the front as a result of zorder. This helps to
explain why there is some misclassification, but there are some that seem like they are in areas with only 8’s
in near proximity.
Fig 1.7. Example to explain possible improvement
Fig 2.1. Point cloud highlighting position of misclassified 8’s (two opposite sides)
At first I was a little confused at why some of these points were being misclassified when they were buried
among other 8’s, how could their k-nearest neighbours be 3’s? Well, turning the point cloud around some
more and exploring the data revealed several 3’s (red points) lurking within the groups.
In some of the cases I could see that a red point did not always appear closest. Above, the line joining the
red dot to a black star in the right of the cluster looks like a longer distance than of that between the black
star and the light blue dot to the right of the red dot. What I must also consider though is that there is
information not shown in these 3 eigenvectors and among the other dimensions the red dot must be closer.
This shows to me that there is an obvious improvement to the algorithm. Should the value of k be increased
we would eliminate these unfortunate scenarios where the closest neighbour is a different digit. This would
definitely increase the accuracy and not be too expensive computationally for the increase in accuracy.
Running the code to get the accuracy with different values of k produces the following plot, I also used
Sys.time() before and after the process and stored the difference to time the process.
It can be seen in the first plot that either k = 3 or 5 produces the best accuracy. Looking at the times it took
to classify them it can be seen that there less than 0.3 of a second between the slowest and fastest times.
This I believe is just variation as there is no correlation between the value of k and the time taken.
Fig 2.2. Point cloud rotated to show hidden 3’s
Fig 2.3. Plot of accuracies for different values of k Fig 2.2. Plot of times to classify images with respect to k-nearest neighbours
For this reason I believe that a value of 3 or 5 for k would have no increase to the cost of the computations
but would make a difference to the overall accuracy of the classifications.
Section 3 – Accuracy using different no. of eigenvectors:
The main idea behind using eigenvectors is getting the original images in terms of the eigenvectors. When
you multiply the original matrix d by the eigenvectors to get p you are transforming the matrix d onto the
eigenvectors. This means that the first row corresponds to the amount the original images moves in the
direction of the first eigenvector and contains the most information. If we analyse the distances between
images with respect to only these rows we will get a good idea of the nearest neighbours with not nearly as
much computation.
Below is the curve representing the accuracy for different eigenvalues. I have every value up to 150 and then
in steps of 2 after that to get a nice curve.
It can be seen that at first there is a drastic and steady increase in accuracy as we introduce more
eigenvectors that still are rich in information. The curve follows what looks like a log functions typical shape.
With a small number of eigenvectors the code runs much faster than the nbrs256 code. At about 15
eigenvectors the increase in accuracy becomes much smaller. At 34 eigenvectors we reach our max of
98.18%, from here adding eigenvectors decreases the accuracy and causes it to fluctuate around the 97.8%
mark. So as a result of new information that is not so vital after 34 eigenvectors the accuracy changes slightly
but never getting better than at 34 eigenvectors. So going past 34 offers nothing but an increased amount of
computations.
Looking at the max value from using eigenvectors it seems larger than the max accuracy from using all 256
dimensions in the original set. This I believe is as a result of having only vital information that helps to
correctly classify digits. One possible issue is that it may be overfitting the dataset, but a test set would be
needed in order to check this.
Fig 3.1. Plot of accuracy for different numbers of eigenvectors.
As more eigenvectors are introduced the amount of time required to compute the nearest neighbours
increases rapidly. Below a graph of the time for different eigenvectors can be seen.
It can be seen above that the values follow a linear model. Using R I found that the equation to find the time
taken is mins = 0.04*eigenvectors + 1.746 for the above(excluding 150-160). In the above the values for
150,155 and 160 are outliers. This is because during the computations my laptop decided to sleep and this
took a toll on the computation time.
In order to find the ideal time for the best accuracy I decided to plot the times against accuracy.
It can be seen above that the plot looks very similar to the accuracy plot. It turns out that you could of
course use smaller values between 20-30 and still have a great accuracy with less computations, but
anything over 34 eigenvectors is a waste of computations.
This does show that using eigenvectors can provide as good if not a better accuracy with a fraction of the
computations.
Fig 3.2. Plot of times (in mins) for different numbers of eigenvectors.
Fig 3.3. Plot of accuracy for different times (times from eigenvalues 1-10, then from 10-150 in steps of 5)
Section 4 – Limiting computations using boxes:
In order to save on computations, only the distances for images in close proximity to the image we want to
classify will be computed.
I modified the code to loop through box dimensions from 200-900 in steps of 100. I timed the computation
and put the accuracy and times into 2 different matrices that correspond to one another. These can be seen
below.
It appears from looking at this data that the biggest box does not produce the best accuracy. I believe this is
because you are including images that are not very close to the image we are trying to classify with respect
to the first and second eigenvector and are so not close based on a lot of the information about the images.
Looking at the values of eigenvectors shows that using the first two eigenvectors definitely gets some of the
information, but this explains why the accuracy isn’t the highest when we have a big box in these two
dimensions. There are other dimensions that contain quite a lot of the information. When the box is too big
a lot of images are let through and those that are close in dimensions that aren’t very information rich seem
to be considered when classifying.
On the top of the following page a plot of all the times against the accuracies. The colours separate the rows.
So red is 200 along the first eigenvector, blue is 300, green is 400, blue violet is 500, chocolate is 600, cyan is
700, magenta is 800 and brown is 900.
I took some points using identify which are show below. They move along the column.
Point no. 19 27 35 36 41 60
x-dim 400 400 400 500 200 500
y-dim 400 500 600 600 700 900
Accuracy 97.19% 97.24% 97.25% 97.29% 97.16% 97.31%
Time 0.59 mins 0.699 mins 0.794 mins 0.975 mins 0.50 mins 1.25 mins
Fig 4.1a. accuracy of boxes of different dims Fig 4.1b. times of boxes of different dims
Fig 4.2. Values of eigenvectors
It can be seen that all of the points with good accuracy and fast speed lie in a lower range of both the
dimensions, typically < 600,600 should give less than a minute. Although if you do want that higher accuracy
you shouldn’t go as far as 900,900 as that is wasteful and has a lower accuracy, 500,900 at 1.25 mins is as
high as you should go dimension wise.
Section 5 – Using blurring and boxes:
The code supplied for blurring was pre-set to a width of 1 and so was quite blurred. I created a matrix for
both the accuracy and time taken. Below these two matrices can be seen, dimensions for the box range from
400-900 in steps of 100. This is because at smaller values some images did not have neighbours in this range.
The rows correspond to the value of the width in direction of the first eigenvector and the columns
correspond to the width of the box in the direction of the second eigenvector.
It can be seen from the above tables that changes in width of the accepted values for the first
eigenvector(changes along the rows of the table) cause an increase in the accuracy. This I believe is because
of the fact that the first eigenvector always contains the most information. This seems to apply to this case
more so than the usual as can be seen in the following plot of the eigenvectors for this blurring.
Fig 5.1a. blurred accuracy for width 1 Fig 5.1b. blurred time for width 1 (in mins)
Fig 4.3. Plot of times against accuracy (points from table above are highlighted
)
It can be seen that even though there is little to no change as you move along the columns there is still an
increase in the computation time. This is because along the second eigenvector there is less information
than the first (as is seen in the above plot of the eigenvectors) and the images within that range are not as
meaningful as the ones in the range of the first eigenvector. It is however a good thing to set a small width to
this vector as it did help to eliminate extra computations as can be seen when the width is higher.
Above each colour represents different value of the width for the second eigenvector, so the columns are
different colours. It can be seen that the accuracy for each is very similar in the distribution of the points. At
96.5% we have the first value in every column(the first row). This shows that although the time taken
increases the accuracy doesn’t as you increase the value of the width of the second eigenvector.
Fig 5.2. Eigenvector values for blur width of 1
Fig 5.3. Time vs Accuracy for blur width 1
It can also be seen that a similar thing happens with the other rows, so it seems that the best way to keep
the time taken down but have a good accuracy is to increase the width for the first eigenvector and keep the
width’s for the second eigenvector small. This can be seen as the biggest accuracy is achieved at 900,500.
This will not be the case when the level of blurring is not as intense as more information will get through and
there will be more in the other eigenvectors. So I would assume that the dimensions of the box in the
direction of the second eigenvector will be more important at higher widths in the blurring code.
To test this I used a list for blurred accuracy and blurred time and I had matrices identical to the ones in fig
5.1. populating these lists. For widths of values 3,5,7,9 and 11 I repeated what I did previous and made and
filled out a matrix for each level of blurring so that they could be compared for different dimensions and the
ideal combination for accuracy and efficiency could be found. For the less blurred details I used dimensions
that were smaller as we would not be at as coarse of a level of detail as if the blurring was at width 1. E.g. at
width = 9 and 11 I had dimensions of (300, 400, 500, 600, 700, 800).
Fig 5.4a. blurred accuracy for width 3 Fig 5.4b. blurred time for width 3 (in mins)
Fig 5.5a. blurred accuracy for width 5
Fig 5.5b. blurred time for width 5 (in mins)
Fig 5.6b. blurred time for width 7 (in mins) Fig 5.6a. blurred accuracy for width 7
Fig 5.7a. blurred accuracy for width 9
Fig 5.8a. blurred accuracy for width 11
Fig 5.7b. blurred time for width 9 (in mins)
Fig 5.8b. blurred time for width 11 (in mins)
In the above matrices for width 5,7,9 and 11 the colnames and rownames are incorrect, they should both be
the vector c(300,400,500,600,700,800) but in R an l and 1 look identical and part of my test code was used in
my final solution.
This is a lot of data, but what does it all mean?
It seems that blurring the data decreases the accuracy slightly as there are certain images that are not
checked as they do not lie in the box. The time also seems to be higher than that of the not blurred data
from the previous section for the first few. The upshot of blurring the images is that at a less intense level of
blurring at the middle levels there is nice accuracy for a smaller computational time.
In the plot below the colours represent the different levels of blurring with red being 1, blue being 3, green
being 5, magenta being 7, chocolate being 9 and black being 11.
It can be seen that the most accurate value is around the 97.3% mark and there are several times this value
is seen. As low as 0.6 mins is the first time it is seen. At a moderate level of blurring with dimensions around
the 600 or so This value can be found at a relatively low cost.
Overall, It must be concluded that at a higher width of blurring and a reasonably sized box, images can be
classified quite accurately and very efficiently. This can be seen as with just boxes we can get an accuracy of
97.31 in 1.25 mins with dimensions 500,900. Using a blurring width of 7 we can get an accuracy of 97.3% in
1.15 with dimensions 800,500. So blurring can be used to save on computation time.
Fig 5.9. Time against Accuracy for different levels of blurring.