How to fuse rendered depth maps to produce a 3D shape in Blender

I recently published a paper on 3D shape generation in a computer vision conference (CVPR). My co-author wrote the code (in C++ and using OpenCV) for fusing the depth maps and getting the final 3D shapes from the produced multi-view outputs. The inputs to the code he wrote is 20 depth maps, ground-truth camera angles (posted below) and the distance from the center of the shapes (0, 0, 0) to the camera (constant=1.5).

Unfortunately my friend is not available to help me on this and I cannot use OpenCV and C++ for this new project that I’m beginning to work on. So any help would be appreciated. so my goal is to write some code in Python and using Blender instead of my co-author’s C++ code to do the same thing. But before I move on, I wonder if Blender has some built-in functions that can generate the final 3D shape given rendered depth maps of that shape, camera angles and the distance to the camera? If not, can anyone give me some ideas on how I should do that and give me some code samples for it? Here I have uploaded a set of rendered depth maps a headphone’s 3D shape that you can use for backward projection (reconstructing the 3D shape). And if you prefer to start with a 3D shape directly, here is upload a 3D shape of another headphone we used in our work before. You can render the 3D shape using the camera angles below.

FYI, here is my co-author’s high-level description on how his approach works:

In the final step, all depth maps are projected back to the 3D space to create the final rendering. We reconstruct 3D shapes from multi-view silhouettes and depth maps by first generating a 3D point cloud from each depth image with its corresponding camera setting (x, y, z coordinates). The union of these point clouds from all views can be seen as an estimation of the shape.

For your reference:
I posted the same question on blender.stackexchange.com and someone has recommended an approach to do the backward projection which seems promising but it seems that that approach is not flexible and requires a lot of hand-tuning which makes the final 3D shape pretty noisy.

And here are the camera angles we used for doing the rendering in the first place:


-0.57735  -0.57735  0.57735
0.934172  0.356822  0
0.934172  -0.356822  0
-0.934172  0.356822  0
-0.934172  -0.356822  0
0  0.934172  0.356822
0  0.934172  -0.356822
0.356822  0  -0.934172
-0.356822  0  -0.934172
0  -0.934172  -0.356822
0  -0.934172  0.356822
0.356822  0  0.934172
-0.356822  0  0.934172
0.57735  0.57735  -0.57735
0.57735  0.57735  0.57735
-0.57735  0.57735  -0.57735
-0.57735  0.57735  0.57735
0.57735  -0.57735  -0.57735
0.57735  -0.57735  0.57735
-0.57735  -0.57735  -0.57735