This paper: "3D Photography using Context-aware Layered Depth Inpainting" by Meng-Li Shih promises that inpainting can be done realistically with AI. There's a Google Colab for it, which means we can check it out right there in the browser thanks to Google wthout installing anything and without the need for a gpu card. In the google colab implementation, they use MiDaS to get a depth map from a given reference image and then do extreme inpainting using AI. The output of 3d photo inpainting is the MiDaS depth map, a point cloud of the 3d scene and four videos that kinda show off the inpainting (2 of the zoom type a la Ken Burns and two of the wiggle/wobble type). To visualize the point cloud which is in the ply format, you can use Meshlab or CloudCompare (preferred). Note that the depth map doesn't need to be coming from MiDaS, you can certainly use your own depth map (although you may have to blur it).
Here's a video that explains how to run the Google Colab python notebook. First, I let the software use MiDaS to create the depth map. Then, I bypass MiDaS and use my own depth map which I created with SPM:
If you use your own depth map, make sure that it is grayscale and that it is smooth enough. If your depth map is not smooth, it's going to take forever and google colab might disconnect you before the videos are created. I explain all that in the video.
We all know that MiDaS can create great depth maps from single images. Check this post if you are not yet convinced: Getting depth maps from single images using Artificial Intelligence (AI). It's the inpainting we were not too sure about... until now. I've gotta say that the filling of occlusions looks quite realistic even when the point of view changes drastically. That AI is really doing wonders and it will only get better as the data sets used to train the neural networks get bigger.