Monday, July 25, 2016

3D scene reconstruction from a stereo pair using Structure from Motion and Multi-View Stereo

Even though Structure from Motion (SfM) and Multi-View Stereo (MVS) works best with a fair amount of views (the more views, the more accurate the 3D scene reconstruction can be), it can certainly be used to reconstruct a 3D scene from a single stereo pair. With only two views, you don't have the luxury to check the matches that are derived from the single depth map, but if the depth map is accurate, the 3D scene reconstruction can be pretty decent.

So, let's try to do just that: Extract a 3D reconstruction (point cloud) from a single stereo pair. Note that the stereo pair doesn't have to be coming from a stereo camera or stereo rig. It may come from a mono camera that's pointing toward the same (static) scene and making two shots from slightly different spots. Here, the stereo pair was taken by my Fuji W3.

Input: unrectified stereo pair taken by a stereo camera.
Output: 3D dense reconstruction in the form of a dense point cloud.

Video that shows the two main steps of the process:


Step 1:

Compute the camera positions and orientations aka the extrinsic camera parameters using Structure from Motion 10 (SfM10).

Input to SfM10 (sfm10_input.txt):

Number of images = 2
File name for image 1 (and focal length) = 1200_DSCF3097_l.JPG 4429
File name for image 2 (and focal length) = 1200_DSCF3097_r.JPG 4429
Number of trials (to determine the good matches) = 10000
Max number of iterations (Bundle Adjustment) = 1000
Min separation angle (low-confidence 3D points) = 0.0
Max reprojection error (low-confidence 3D points) = 10000.0
Pixel radius (animated gif frames) = 10
Amplitude angle (animated gif frames) = 5.0

To get the focal length (4429), I simply used the output given by Epipolar Rectification 9b (ER9b). The images are 1200 pixels wide but were taken with quite a bit of zooming. That explains the larger than usual focal length.

Step 2:

Reconstruct the 3D scene using Multi View Stereo 10 (MVS10).

Input to MVS10 (mvs10_input.txt):

Filename for nvm file = duh.nvm
Minimum number of matches (camera pair selection) = 100
Minimum average separation angle (camera pair selection) = 0.0
Radius used to smooth the cost = 32
Alpha = 0.9
Truncation value for color cost = 20.0
Truncation value for gradient cost = 10.0
Epsilon = 4
Disparity tolerance used to detect occlusions = 0
Downsampling factor = 1
Sampling step = 1
Minimum separation angle (removal of low-confidence 3D points) = 0.0
Minimum number of image points per 3D point (removal of low-confidence 3D points) = 2
Maximum reprojection error (removal of low-confidence 3D points) = 10000.0
Pixel radius (animated gif frames) = 1
Amplitude angle (animated gif frames) = 1.0

The nvm file is of course the output of SfM10.

It should be noted that, usually, Structure from Motion and Multi-View Stereo are not too keen on cameras (here, we have two) that are too close too each other (a stereo camera has two cameras that are very close to each other) because triangulation may lead to inaccuracies in the 3D reconstruction. This explains why the minimum separation angle is set to 0.0 degrees instead of something larger like 1.5 degrees, for instance. The maximum reprojection error was set to infinity (10000.0) in order to reject any match coming from the depth map. Usually, it is set to 2.0 (up to 16.0) in order to remove any image point that does not coincide (within the allowed reprojection error) with the projection of its corresponding 3D point.