Wednesday, January 13, 2021

Getting depth maps from single images using Artificial Intelligence (AI)

Is it possible in this day and age to get good, usable depth maps from monocular images using Artificial Intelligence (AI)?

Well, I think it's getting there today. And as the number of data sets used to train neural networks increase, results can only get better.

First, there was Google's "Mannequin Challenge" where the data set was taken from thousands of youtube videos featuring the now forgotten "mannequin challenge" where people freeze while the camera turns around them. Google must have spent a lot of time creating the data set since they had to recreate the full 3d scene for each video (to get the depth map).

Here's a video I made on how to use Google's "Mannequin Challenge" 2d to 3d conversion software on google colab:



I think the results are ok in terms of segmentation, that is, objects are properly detected. But the results are not that great in terms of depth, meaning that they are often at incorrect depths. Good effort though.

Well, that was then and this is now. My good friend waniah alerted me to the fact that there was a new kid in town called MiDaS v2.1. The originating paper is: "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer" by René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun. You can find the code on github: MiDaS on github and it's also available on google colab ready to run: MiDaS on google colab.

Here's a video where I put MiDaS to the test on google colab:



In my opinion, the results are very impressive, much better than google's "Mannequin Challenge". Probably because the data set used to train their neural network was much larger.

Let's dig in a little bit more into that MiDaS v2.1. The paper is available here: "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer" by Ranftl et al. So, they are saying that they have used 3d movies to train their model. Cool but how did they get the depth maps for those 3d movie dual screenshots? The answer is: "PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume" by Sun et al. Now, I have been a little bit out of the loop regarding the latest trends in automatic depth map generation from a stereo pair but this is quite interesting. The main idea behind this particular depth map generation technique is optical flow, which is exactly what is used in the first automatic depth map generator I wrote, the original DMAG (Depth Map Automatic Generator). How cool is that! Of course, DMAG doesn't have any CNN (Convolutional Neural Network) involved but that is interesting nonetheless. Back in the day, I (and other early adopters of my software) quickly dismissed DMAG as a viable depth map generator because it could not handle small and/or fast moving objects. The good thing about DMAG though is that you don't need to input the disparity range and the produced depth maps are very smooth. I think I should probably spend some time figuring out how the CNN plays a role in that new implementation of optical flow but that's gonna be for another post, I think.

Now, I came across this idea of coupling optical flow and CNNs somewhere else. Here, on this website: KeystoneDepth. What they did is take thousands of old stereocards and compute their depth maps. Now, guess what was used to compute the depth maps? Yes, an optical flow method coupled with a CNN. The paper that explains the process is here: KeystoneDepth : History in 3D. We learn that they use FlowNet2 to compute the depth maps. FlowNet2 is described in this paper: "Flownet 2.0: Evolution of optical flow estimation with deep networks" by Ilg et al. It's cool to see that automatic depth map generation using optical flow is not dead, actually far from it. Makes me wanna go back into the original DMAG, the one that started it all. We'll see.

To save the obtained depth map when you run MiDaS v2.1, you need to put the following:

fig = plt.imshow(output)
plt.axis('off')
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.savefig('depthmap.png', bbox_inches='tight', pad_inches = 0)

right after:

plt.imshow(output)
# plt.show()

Use Ctrl-C Ctrl-V to copy paste inside the cell. When you click on the "Files" icon on the left, the saved depthmap should be there. If it's not there, click on the "Refresh" tab just above. To download, simply click on the dots to the right of the name. Be aware that the depth map that MiDaS generates is small (but at the right aspect ratio). This is because MiDaS downsamples the input image to match the sizes of the images in the training sets. So, you will need to resize it yourself so that it matches the size of the input image. You can do that in Gimp or Photoshop quite easily. To make a "Facebook 3d" post, assuming your photo is called "photo.jpg", you need to rename the depth map as "photo_depth.jpg" but you probably know that. You can feed the depth map to Facebook as is, that is, as a rgb heat map. If you prefer grayscale depth maps, you can use gimp or photoshop to switch mode from rgb to grayscale with no ill effect (I think).