Project 4 - Stitching Photo Mosaics

Part 1: Naive Approach

Shoot the Pictures

For this part, I chose scenic shots to create mosaics. The first image I chose was near Moffit Library, near the entrance to a building. The second image I chose was near Berkeley Way West. The third image I chose was of my living room. To take these images, I first took an image in a fixed direction. Then, I rotated my phone camera either directly right or left, so that the two images wouldn't be seperated by rotations. To rectify images, I chose a picture of my roommate's Spongebob figure, since the shape is overall rectangular. I also chose a random textbook lying around. Below are the images I took:

Picture turned leftward near Moffit
Image 1
Picture of building near Moffit
Image 2
1st pic near Berkeley Way West
Image 1
2nd pic near Berkeley Way West
Image 2
1st image of room
Image 1
2nd image of room
Image 2
Original image of Spongebob
Image 1
Original image of textbook
Image 2

Recover Homographies

For this part, I used the same label selection tool provided in the previous project. For each image, I selected more than 4 points, usually 6-8 points, to create a overdetermined system of equations. With each point, I first stacked them into two lists, one for image 1, and the other for image 2. For a pair of x,y values for a single point, the equation becomes H * [x_1, y_1, 1].T = [x_2, y_2, 1].T. To solve this, we try to apply SVD. We move all of the terms from the previous equation to one side, and then we take the coefficients of the terms of H and stack them together to create matrix A. This means that the first row of A is for all of the x terms, and the second row is for y terms. Because we have moved all of the terms to one side, we want to find some solution such that A*h = 0, where h is the vector of all terms from H stacked together. From SVD, we know that A = U * S * V.T, and that the last column of V is the solution to A*h = 0. After getting the last column, we now need to normalize it by making the bottom-right element 1. And to reverse this homography, we simply need to take the inverse of H, which is what computeH() will return.

Warping Images and Rectification

Using computeH() from the previous part, I implemented warpImage() to warp the images. First, I got the homography H of the image to be warped and passed it in as a parameter. Then, I got the corners of the image. This is because I need to warp the corners in order to get the new dimensions of the warped image. Using this new information, I initialized a new image with large enough dimensions to store the warped image. Then, I got the polygons of the warped image, and I mapped this to the base image by using the inverse warp, meaning I took the inverse of H. Finally, I copied the pixels from the warped image onto this canvas, producing the final warped image. To test this part out, I tried rectifying the images below. Because of Spongebob's shape, the basic rectangular area has been rectified, where the face now appears to be facing more towards the front. The book also appears to be facing straight up. However, because the book was a bit worn out, the edges were not perfectly straight. The plushie's edges also were a bit round, since it is a plushing, and it is not a figurine with a fixed frame. Regardless, the images look good are are rectangular, with its face facing straight out of the screen. Below are the before and after images:

Original Spongebob
Image 1
Rectified Spongebob
Image 2
Original Book
Image 1
Rectified Book
Image 2

Blend images into a mosaic

For this part, I simply used computeH and warpImage for the majority of the logic. First, I used warpImage to get the warped image. Next, I needed to create a mask for both images, from which I could create a mask of the overlapping region for the two images. This is crucial, because I don't want to naively blend the two images together. Both images do not completely fill up the entire canvas space, so there are a lot of black space which will ruin such a naive blend. By using this mask of the overlapping region between the two images, I can just blend this part only, which creates a seamless result. For this, I used alpha blending. For the photos near Moffit, I had a sharper turn, having an overlap of close to 40% of the images. Because of this, the warped image had to stretch the pixels much more, which yielded a lower resolution result near the left side. The living room blend, which had photos much closer to each other, showed less of a stretch. Below are the results:

Left image near Moffit
Image 1
Right image near Moffit
Image 2
Mosaic of images
Image 1
Left image near Way West
Image 1
Right image near Way West
Image 2
Mosaic of images
Image 1
Left image of room
Image 1
Right image of room
Image 2
Mosaic of images
Image 1

Part 2: Feature Matching for Autostitching

Overview

This part of the project is an extension of the previous part, where now we want to automatically stitch images together instead of having to manually choose the feature points. To do this, we follow the MOPS paper. Mainly, we split this into 5 steps:

  1. We detect Harris corners in an image. We can then filter out these corners based on their scores, and we then use ANMS to keep corners that are spread out across the image.
  2. We extract feature descriptors for each corner.
  3. We match these feature descriptors between the two images, hence creating the correlation between images.
  4. We search for the best homography of these points by using RANSAC.
  5. Finally, reusing code from the previous part, we generate the mosaics.

To show the results of each step, we will use the same set of images, the ones taken near Berkeley Way West.

Step 1: Detecting corner features in an image

For this part, we first start with the Harris Interest Point Detector. Below are all related images to this part. This part of the code was provided for us as part of the project. However, this naive approach for colleecting feature points is not idea. As shown below, this meant that too many points are captured, some of which are due to noise. This also might be due to the fact that my camera lens isn't good, so the image quality for shots such as the sky might introduce some sort of pixelated gradient, which introduces incorrect "edges" in a supposed to be smooth section of the photo. As a result of this, the image is covered with feature points using the naive Harris corner detector. Below are the images for the H matrix and points detected of the sample image:

Original sample image
Image 1
Harris matrix of image
Image 2
Harris corners detected (naive)
Image 3

Again, notice how there are way too many points detected. Most are also simply noise, such as all the ones in the sky, which we don't want to affect our results. However, the Harris matrix H shows certain areas are "important". This is because their pixel corner scores are higher, so those areas are brighter in the matrix. To remove such noise, we can filter the detected images to keep only the points that have a high enough corner strength, which is given by values in H. Each value in H is the score for the corresponding pixel from the original sample image. I simply filtered by using a threshold of 0.1, which is a good value to keep only the most important points. After filtering the points, we get the improvement shown below:

Original sample image
Image 1
Corners detected (naive)
Image 2
Corners detected (filtered)
Image 3

Notice how the images shown above show a significantly less number of points captured. And, compared to the Harris matrix, the points do align with "important" features, such as certain areas in the trees or edges on the buildings. We now do this for both images taken so we have two sets of points, one for each image:

Original left image
Image 1
Points for left image
Image 3
Original right image
Image 1
Points for right image
Image 3

We can already see that some points can potentially match to each other. However, there are still way too many points, currently well above 1k. So, we use ANMS (Adaptive Non-Maximal Suppression) to only retain key points that are well spread out across the image.

  1. First, we get the corner scores for each filtered harris point from the Harris matrix H.
  2. Next, we create a matrix for each point's threshold values. I chose c_robust = 0.9. The reason why I created this temporary value matrix is so that I can utilize numpy's matrix computation to quickly create a mask in a later step. Then, using this, I can quickly compare each threshold to another point's, which I can again easily process using numpy's matrix support.
  3. Now, I create the square mask matrix, where I compare each pixel's threshold to another's. A value in this matrix is true only if a given pixel's score is lower than the suppression threshold of the other point, meaning that the first point should be considered to be kept or not.
  4. Finally, using dist2(), the given function to calculate the distance between two points, and the mask, I can mass-calculate the distances between points. The key is to set the distance between two points to be infinity (np.inf) IF the first point does not satisfy the suppression threshold of the second point. I heavily utilized numpy functions here, which allowed me to not use manual for() loops, which did help speed up my code.
  5. Now, we have a list of distances for each point. According to the paper, we should only care about the points that had the greatest radii. So, we can simply rank the radii based on their values, and select the points with corresponding indicies to be kept.
Using this ANMS method, we have cut down the number of points significantly, while still keeping the most important ones. For this run, I only kept 500 points for each image. Below are the results:

Original left image
Image 1
Points after filtering only
Image 2
Points after ANMS
Image 3
Original right image
Image 1
Points after filtering only
Image 2
Points after ANMS
Image 3

As shown above, the number of points have been significantly reduced, while still keeping the most important ones. And, the points are still at important locations, such as the windows or nearing certain other edges/features. A visual check can be used by comparing these results to the Harris matrix; the final points are all in those bright regions too. So, these points are good enough, and we move onto the next step.

Step 2: Extracting a Feature Descriptor for each feature point

Now, we use MOPS (Multiscale Oriented Patches) to extract feature descriptors for each point. We take in the coordinate points from the previous step. For each point, we first create a 40x40 pixel patch around it. Next, to save computation, I downsized each patch to 8x8 pixels, and I normalized the values in this small patch. Finally, I flatten the collected descriptors each into a 1x64 vector, and I return the results. Below are some of the 8x8 patches detected for the right image:

Patch 1
Image 1
Patch 2
Image 2
Patch 3
Image 3

Step 3: Matching feature descriptors between two images

After getting the two feature descriptors for both images, we now need to match them.

  1. First, we get the pairwise distance matrix between each descriptor points. This also sets up the later matrix computations, allowing for a faster approach.
  2. Now, we loop through the descriptors of the first image, trying to match them to the second image. We only sort the rows (for each descriptor for image 1) of the distance matrix, and we then sort these values. Because we only care about the indices of the values, rather than then actual values themselves, argsort() is used instead of sort().
  3. Next, we use Lowe's techinque to find the best matches. We have the ranked distances from the last step. Lowe's technique states that if there is a best match, then the current nearest neighbor should be significantly closer than the second nearest neighbor. So, we can simply compare this ratio for each pixel pairing and, if it is above a certain threshold, then we can consider that the two descriptors between the images are indeed a good match. Only if the threshold condition is met do we remember the matching indices of the descriptors, in the exact same order to preserve the match connection.
  4. Finally, we return the two lists of coordinates for the two images. For a given i, the ith point in the first list will match to the ith point in the second list.
Below are the results after matching and keeping those descriptor points. The threshold I used was 0.7, which was a good value based on the paper. Compared to the points kept after using ANMS, there are significantly less points. And visually, we can see how most of these points match to each other.

Original left image
Image 1
Points after ANMS
Image 3
Points after matching
Image 2
Original right image
Image 1
Points after ANMS
Image 3
Points after matching
Image 3

Step 4: Use RANSAC to compute a homography

Now, the points above still aren't exactly perfect. On closer inspection, we can already spot some outliers. For example, for the image on the right, it includes part of the biology building that is not present in the first image. Yet, the matching algorithm still captured some points in this right image region. So, now we use RANSAC to find the best homography between these two images, while also further filtering and keeping a better list of points.

  1. We first randomly select 4 points from the matched pairings to compute an exact homography. Because there are exactly 4 points, there is no need for least squares to give an approximate solution.
  2. Next, we warp all of the points from the first image using this homography, and we normalize them.
  3. After this, we create a mask to determine which warped points are close enough to the second image's points. For this step, I used a threshold of 4. By creating such a matrix, I can simply use numpy to quickly search through it, only keeping the points that have the error less than the threshold. The points that are kept are the inliner points based on this homography.
  4. Finally, I keep the mask and record the number of inliners counted.
We loop the above steps repeatedly, where currently I have around 1000 iterations. After all of these iterations with the random sampling to try out random homographies, we finally find the iteration with the largest number of inliners. Then, we simply use the mask from this iteration to get the indices of the inliner points, and we return the two list of points, one for each image. The key to RANSAC, as discussed in lecture, is that the actual homography is not the homography of the 4 randomly-selected points. Instead, it is the homography of all of the inliner points. This is why we need to keep track and eventually return those points, meaning that hopefully RANSAC will return more than 4 points per list. Below are the results, comparing the points found after matching features to the points found after using RANSAC:

Original left image
Image 1
Points after matching
Image 3
Points after RANSAC
Image 2
Original right image
Image 1
Points after matching
Image 2
Points after RANSAC
Image 3

After this step, we can see that the number of points have shrunk down even more. And, the outliers from the matching step have been removed. Each point from one image has an exact match in the other, and only these points are retained.

Step 5: Generate the mosaic

For this step, we simply reuse code from the first part in the project. The results are basically the same, since for the manually selected points, I was careful to only click certain prominent corners in each image. Below are the results of the three mosaics, comparing them with the first part where the points are manually selected:

Left image near Moffit
Image 1
Right image near Moffit
Image 2
Manual mosaic
Image 1
Auto mosaic
Image 1
Left image near Way West
Image 1
Right image near Way West
Image 2
Manual mosaic
Image 1
Auto mosaic
Image 1
Left image of room
Image 1
Right image of room
Image 2
Manual Mosaic
Image 1
Auto mosaic
Image 1

Conclusion

This project was really interesting and fun. The first part was a good introduction to homographies and warping images. The theory was not as challenging, but seeing the results was really cool. However, the coolest thing is the second part of the project. I didn't get the change to replicate a paper before, and especially implementing such an automated algorithm from scratch was really cool. Specifically, I thought that implementing RANSAC was the most interesting part. I played around with the parameters a lot, but past a certain point the resulting mosaics would converge to the same good result. Overally, this project deepend my understanding of homographies and taught me how to implement a paper from scratch.