This is the logical follow-up to Structure from Motion - Case of two views/images when the number of views/images/cameras is greater than two. Here, I am focusing on incremental (sequential) Structure from Motion which is the most popular method for multiple view Stereo from Motion. This is basically what Bundler: Structure from Motion (SfM) for Unordered Image Collections and VisualSFM : A Visual Structure from Motion System use to reconstruct a sparse 3D scene from a set of images.

## Sunday, February 21, 2016

## Sunday, February 14, 2016

### Structure from Motion - Case of two views/images

Structure from Motion (SfM) means recovery of the (sparse) 3D scene, that is the structure, and the camera poses, that is the motion. What follows is the Structure from Motion pipeline in the case of two (calibrated) images. Because the images are calibrated, the 3D scene can be reconstructed up to a scale factor. There are lots of pieces but each piece is rather easy to implement when you have at your disposal the book "Multiple View Geometry in computer vision" by Richard Hartley and Andrew Zisserman. This is fundamental reading material when doing anything that remotely relates to 3D scene reconstruction. Structure from Motion in the case of more than two views builds largely upon the two view case.

Once the outliers have been removed from the set of matches, the fundamental matrix F should be re-estimated by minimizing the reprojection error using the Levenberg-Marquardt algorithm.

The 3D structure and camera matrices may be improved by using what is called Bundle Adjustment. Please note that Bundle Adjustment needs good initial guesses, in other words, 3D scene reconstruction cannot rely solely on Bundle Adjustment.

It is possible to extract the camera matrices directly from the fundamental matrix F but, in this case, the 3D reconstruction is only projective, that is, it is known up to a projective transformation (aka homography or collineation). Knowing the calibration matrices makes the 3D reconstruction more useful (correct) as it is known up to scale.

Clearly, the 3D reconstruction obtained this way can only be sparse as the image matches are themselves sparse, usually obtained with SIFT and filtered by RANSAC or variants of those.

It should be noted that Epipolar Rectification 9 (ER9) and Epipolar Rectification 9b (ER9b) compute the fundamental matrix in order to get rid of outliers (bad matches). As a bonus (in the process of rectification), they also compute the focal length (the images are assumed to be taken by the same camera) from which one can easily get the (simplified) calibration matrix.

Once the outliers have been removed from the set of matches, the fundamental matrix F should be re-estimated by minimizing the reprojection error using the Levenberg-Marquardt algorithm.

The 3D structure and camera matrices may be improved by using what is called Bundle Adjustment. Please note that Bundle Adjustment needs good initial guesses, in other words, 3D scene reconstruction cannot rely solely on Bundle Adjustment.

It is possible to extract the camera matrices directly from the fundamental matrix F but, in this case, the 3D reconstruction is only projective, that is, it is known up to a projective transformation (aka homography or collineation). Knowing the calibration matrices makes the 3D reconstruction more useful (correct) as it is known up to scale.

Clearly, the 3D reconstruction obtained this way can only be sparse as the image matches are themselves sparse, usually obtained with SIFT and filtered by RANSAC or variants of those.

It should be noted that Epipolar Rectification 9 (ER9) and Epipolar Rectification 9b (ER9b) compute the fundamental matrix in order to get rid of outliers (bad matches). As a bonus (in the process of rectification), they also compute the focal length (the images are assumed to be taken by the same camera) from which one can easily get the (simplified) calibration matrix.

## Sunday, February 7, 2016

### Depth Map Automatic Generator 5c (DMAG5c)

DMAG5c is a variant of Depth Map Automatic Generator 5 (DMAG5). The core of the method is still based upon Fast Cost-Volume Filtering for Visual Correspondence and Beyond by Christoph Rhemann, Asmaa Hosni, Michael Bleyer, Carsten Rother, and Margrit Gelautz. Where it differs with DMAG5 is in choice of the raw matching cost. DMAG5c uses a SIFT-like descriptor to determine the raw matching cost. SIFT is a robust method to detect features in images and match them. It is explained in details in Distinctive Image Features from Scale-Invariant Keypoints by David G. Lowe. The advantage of using a SIFT-like descriptor in determining the raw matching costs is that it's kinda invariant to illumination changes because it focuses on gradient orientations rather than actual colors.

The SIFT-descriptor used here is vastly simplified since: (i) it has a fixed radius (equal to 2), (ii) it uses a single gradient histogram, in other words, there is only one bin in image space, (iii) the gradient magnitudes are not weighted, and (iv) it is assumed the stereo pair has been rectified and there is therefore no need to rotate the descriptor window. Because occlusion handling happens when the raw matching cost is smoothed (for each possible disparity), it is a good idea to keep the radius of the SIFT-descriptor small. Because the radius is small, there is really no need to use more than one bin in image space and consider weights for the gradient magnitudes.

A quick word about the parameters:

- min and max disparity. Those can be obtained with zero effort by using Epipolar Rectification 9b (ER9b).

- radius. This is the radius of the guided image filter. The larger the better but up to a certain point. Note that the speed of DMAG5c doesn't depend on the size of the filter, which is kind of a good thing.

- epsilon. This controls the smoothness of the depth map. The lower the epsilon, the smoother the depth map. I think that 4 is a pretty good value but you can certainly try 3, 2, 1, and even 0.

- disparity tolerance. Controls how tight you want the consistency between left and right depth maps to be. Pixels that have non-consistent disparities (those are shown in black in the occlusion maps) have their disparities recomputed using some sort of averaging between neighboring pixels. That averaging is controlled by the radius to smooth occlusions, sigma space and sigma color. The default values should be more than ok in most cases.

Here's how it behaves on tsukuba which, by the way, doesn't suffer at all from illumination changes:

In practice, DMAG5 should be tried first. If the depth map is not satisfactory no matter the choice of parameters, then it's probably a good idea to switch over to DMAG5c. This is all assuming that the stereo pair has been properly rectified by, let's say, ER9b.

The windows executable (guaranteed to be virus free) is available for free via the 3D Software Page.

The SIFT-descriptor used here is vastly simplified since: (i) it has a fixed radius (equal to 2), (ii) it uses a single gradient histogram, in other words, there is only one bin in image space, (iii) the gradient magnitudes are not weighted, and (iv) it is assumed the stereo pair has been rectified and there is therefore no need to rotate the descriptor window. Because occlusion handling happens when the raw matching cost is smoothed (for each possible disparity), it is a good idea to keep the radius of the SIFT-descriptor small. Because the radius is small, there is really no need to use more than one bin in image space and consider weights for the gradient magnitudes.

A quick word about the parameters:

- min and max disparity. Those can be obtained with zero effort by using Epipolar Rectification 9b (ER9b).

- radius. This is the radius of the guided image filter. The larger the better but up to a certain point. Note that the speed of DMAG5c doesn't depend on the size of the filter, which is kind of a good thing.

- epsilon. This controls the smoothness of the depth map. The lower the epsilon, the smoother the depth map. I think that 4 is a pretty good value but you can certainly try 3, 2, 1, and even 0.

- disparity tolerance. Controls how tight you want the consistency between left and right depth maps to be. Pixels that have non-consistent disparities (those are shown in black in the occlusion maps) have their disparities recomputed using some sort of averaging between neighboring pixels. That averaging is controlled by the radius to smooth occlusions, sigma space and sigma color. The default values should be more than ok in most cases.

Here's how it behaves on tsukuba which, by the way, doesn't suffer at all from illumination changes:

In practice, DMAG5 should be tried first. If the depth map is not satisfactory no matter the choice of parameters, then it's probably a good idea to switch over to DMAG5c. This is all assuming that the stereo pair has been properly rectified by, let's say, ER9b.

The windows executable (guaranteed to be virus free) is available for free via the 3D Software Page.

## Saturday, February 6, 2016

### Depth Map Automatic Generator 8b (DMAG8b)

DMAG8b is a multi-view stereo automatic depth map generator based on the plane-sweep approach originally developed by Robert T. Collins in A Space-Sweep Approach to True Multi-Image Matching by Robert T. Collins.

It is quite similar in principle to Depth Map Automatic Generator 8 (DMAG8). The difference with DMAG8 is that the matching is performed using the methodology of Depth Map Automatic Generator 5 (DMAG5).

How to run DMAG8b (on Windows 64 bit) is explained in the DMAG8b manual that's inside ugosoft3d-8-x64 archive.

Update: Instead of using VisualSFM to get the camera positions and orientations (and the sparse reconstruction), use Structure from Motion 10 (SfM10). It's, in my opinion, much better and simpler to use.

Here's an example:

This is a set of three images taken with a regular non-stereo camera. Those are of course not aligned in any way.

The windows executable (guaranteed to be virus free) is available for free via the 3D Software Page.

It is quite similar in principle to Depth Map Automatic Generator 8 (DMAG8). The difference with DMAG8 is that the matching is performed using the methodology of Depth Map Automatic Generator 5 (DMAG5).

How to run DMAG8b (on Windows 64 bit) is explained in the DMAG8b manual that's inside ugosoft3d-8-x64 archive.

Update: Instead of using VisualSFM to get the camera positions and orientations (and the sparse reconstruction), use Structure from Motion 10 (SfM10). It's, in my opinion, much better and simpler to use.

Here's an example:

This is a set of three images taken with a regular non-stereo camera. Those are of course not aligned in any way.

The windows executable (guaranteed to be virus free) is available for free via the 3D Software Page.

### Epipolar Rectification 9b (ER9b)

ER9b is an implementation of Quasi-Euclidean Uncalibrated Epipolar Rectification by A. Fusiello and L. Irsara. ER9b also contains an implementation of Distinctive Image Features from Scale-Invariant Keypoints by David G. Lowe and an implementation of Automatic Homographic Registration of a Pair of Images, with A Contrario Elimination of Outliers by Lionel Moisan, Pierre Moulon, and Pascal Monasse. Unlike Epipolar Rectification 9, this is my own implementation (of the whole enchilada including SIFT and ORSA).

Note that the A Contrario elimination of outliers is referred as ORSA (Optimized Random SAmple) or AC-RANSAC (A Contrario RANSAC). RANSAC (RANdom SAmple Consensus) is the reference algorithm when you need to get rid of outliers in a set of matches between two images. The advantage of A Contrario RANSAC over plain RANSAC is that it eliminates the always delicate thresholding that's needed to separate the inliers from the outliers.

Epipolar rectification (of the uncalibrated kind) takes 2 images and transform them such that stereo matches are all along horizontal lines. This is crucial to get the best possible results in automatic depth map generation. The input 2 images can really be anything as long they represent the same scene.

Here's an example:

These two images were taken with a regular non-stereo camera without paying too much attention of horizontal alignment.

Clearly, those need to be rectified prior to generating a depth map.

Even though pieces of the image get lost due to camera rotation, the movement is now on the horizontal.

ER9b also outputs (in an image format) the features detected by SIFT, the matches found by SIFT, and the (good) matches found by ORSA.

Matches found by SIFT in the 2 images. A match is represented by two rectangles of the same color. These matches must be processed by a RANSAC type of algorithm in order to reject the outliers.

Matches remaining after ORSA has removed the outliers.

It should be noted that ER9b gives the minimum and maximum disparity of the rectified stereo pair in the console window printout. These disparities can be used as input to the automatic depth map generators that are available here for download. Unlike ER9, there's no need to manipulate those values.

Here is a video tutorial for ER9b:

The windows executable (guaranteed to be virus free) is available for free via the 3D Software Page.

Source code: ER9b on github.

Note that the A Contrario elimination of outliers is referred as ORSA (Optimized Random SAmple) or AC-RANSAC (A Contrario RANSAC). RANSAC (RANdom SAmple Consensus) is the reference algorithm when you need to get rid of outliers in a set of matches between two images. The advantage of A Contrario RANSAC over plain RANSAC is that it eliminates the always delicate thresholding that's needed to separate the inliers from the outliers.

Epipolar rectification (of the uncalibrated kind) takes 2 images and transform them such that stereo matches are all along horizontal lines. This is crucial to get the best possible results in automatic depth map generation. The input 2 images can really be anything as long they represent the same scene.

Here's an example:

These two images were taken with a regular non-stereo camera without paying too much attention of horizontal alignment.

Clearly, those need to be rectified prior to generating a depth map.

Even though pieces of the image get lost due to camera rotation, the movement is now on the horizontal.

ER9b also outputs (in an image format) the features detected by SIFT, the matches found by SIFT, and the (good) matches found by ORSA.

Matches found by SIFT in the 2 images. A match is represented by two rectangles of the same color. These matches must be processed by a RANSAC type of algorithm in order to reject the outliers.

Matches remaining after ORSA has removed the outliers.

It should be noted that ER9b gives the minimum and maximum disparity of the rectified stereo pair in the console window printout. These disparities can be used as input to the automatic depth map generators that are available here for download. Unlike ER9, there's no need to manipulate those values.

Here is a video tutorial for ER9b:

The windows executable (guaranteed to be virus free) is available for free via the 3D Software Page.

Source code: ER9b on github.

Subscribe to:
Posts (Atom)