Including words, anyone and fast throws the custom hardware option out of the equation pretty quickly. Using a laser scanner (Figure 1) one can get the job done, but it costs a lot of time and the hardware is pretty expensive. Solving the challenge on the software side was the answer as it is much easier to distribute. But getting the geometry information from a plain video requires some magic, right?
Figure 1. Point cloud of a room obtained by a laser scanner.
To understand how to get from video to a point cloud, we need to learn about detecting geometry. An ordinary photograph is a projection of the scene in front of the camera. The projection flattens the three-dimensional (3D) world into a two-dimensional (2D) image losing the distance information between the camera and the surfaces of the scene. (Figure 2.). To restore that information without any additional hardware has been and is still one of the key challenges in the field of computer vision and image-based modeling.
Figure 2. Camera obscura, pinhole projection of a person.
To restore the depth, the conventional approaches try to mimic the human visual system. The eyes of such a stereo system (Figure 3) are two images representing the same scene from slightly different angles. In order to measure the depth of a certain point (P) in one image, one has to find a corresponding point in the other image and that is the major challenge in the conventional methods.
Figure 3. Stereo Camera Model.
Structure from motion
The technique is called structure from motion (SfM), which relies on feature detectors to find correspondences between images and from there, calculate the camera trajectories and construct the 3D geometry (Figure 4). Even though these reconstructions from images work on optimal conditions (rich textures, well-lit, sharp images). Uniformly colored surfaces produce huge issues since feature points are not found in the first place. And if you think about indoor spaces, white walls are pretty common.
Figure 4. Indoor space geometry created from video with structure from motion (Top projection).
The recent advancements in artificial intelligence (AI), especially in the field of machine learning (ML), made us think. What a laser scanner can obtain from an indoor space could be obtained directly and purely from an image of the indoor space? Moreover, this would also give us a more robust and faster-performing architecture for geometry reconstruction from video compared to the conventional methods. Figure 5 illustrates the floor detection from a single image using AI.
Figure 5. Floor Detection from an image with AI.
Constructing 3D models
Detecting depths from a single photograph reconstructs only a small part of a complete 3D model. In order to reconstruct a whole apartment, we need hundreds to even several thousands of images, depending on the size of the target location. Fusing this data, we can build the final model looking like the example presented in Figure 6. So that is how the magic process from point cloud to video happen.
Figure 6. Top projection of AI-based geometry reconstruction.
Article written by Markus Ylimäki & Markus Häikiö from CubiCasa