Aerial Reconstructions via Probabilistic Data Fusion
by
Randi Cabezas
B.S., Electrical Engineering
Florida International University, 2010
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science
in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
September 2013
c© 2013 Massachusetts Institute of Technology
All Rights Reserved.
Signature of Author:
Department of Electrical Engineering and Computer Science
August 26, 2013
Certified by:
John W. Fisher III
Senior Research Scientist
Thesis Supervisor
Accepted by:
Professor Leslie A. Kolodziejski
Chair, Department Committee on Graduate Students
Aerial Reconstructions via Probabilistic Data Fusion
by Randi Cabezas
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science
Abstract
In this thesis we propose a probabilistic model that incorporates multi-modal noisy
measurements: aerial images and Light Detection and Ranging (LiDAR) to recover
scene geometry and appearance in order to build a 3D photo-realistic model of a given
scene. In urban environments, these reconstructions have many applications, such as
surveillance, and urban planning. The proposed probabilistic model can be viewed as a
data fusion model, in which the two data sources complement each other and allow for
better results than when only a single one is present. Moreover, this modeling approach
has the advantages that it can capture uncertainty in reconstructions, and the ability to
incorporate additional scene measurements easily when the sensor models are available.
Furthermore, the results obtained with the proposed method are qualitatively com-
parable to those obtained with traditional structure from motion, despite differences in
modeling approach and reconstruction goals. The appearance and geometry trade-off
present in the model between the different data sources can be used to obtain a similar
(and sometime superior) reconstruction of complex urban scenes with fewer image ob-
servations over traditional reconstruction methods. Extending beyond reconstructions,
the proposed model has two alluring features: first we are able to determine absolute
scale and orientation, and secondly, we are able to detect moving objects.
From an implementation standpoint, this thesis has shown how to leverage the
power of graphic processing units (GPUs) and parallel programming to allow fast infer-
ence. Achieving real time rendering of scenes with hundreds of thousands of geometric
primitives and inferring latent appearance, camera pose and geometry in the order of
seconds each.
Thesis Supervisor: John W. Fisher III
Title: Senior Research Scientist
i
ii
Acknowledgments
I would like to start by thanking John Fisher for providing me with a stimulating
research project. More than that, for his patience, encouragements and overall ability
to run with whatever topic I happen to walk in with any given week. Without his
support this thesis would not have been possible.
I would also like to thank my fellow students in vision and SLI; thanks for all the
invaluable help. In particular, my office mates Sue Zheng, Archana Venkataraman, and
Jason Chang have always been there to answer any type of questions from the silly and
bizarre to the challenging and unexplainable (sorry for that). Sue your patience and
ability to distill a problem to its root amazes me. You have taught me to look beyond
the obvious and focus on the more important things. Archana your deep understanding
of probability and inference inspire me to be better. Jason your sampling and coding
skill astound me; thanks to you my coding skills have improved vastly. Many thanks
to Duke for the distractions and fun while at the office.
Outside lab, I would like to thank the inhabitants of the Pleasant House: Anna
Lessios, Guillaume Lestoquoy, Sam Nicaise, Andy Sweet, for a great atmosphere and
plenty of food-centered good times. Anna, your deserts are amazing! Please keep on
indulging us with your wonderful creations. Gui, your passion and skills are inspiring.
To the vegetarian zombies, I have not giving up in converting you. Sam, you are a lean
and mean machine, I don’t know how you find time for everything in your plate. Andy,
one day I’ll match your style and bold color scheme, but until then, shine on! To the
rest of my friends, both near and afar, thank you for your support and the memories
we have shared.
I’ve had the pleasure of being part of an extraordinary group of people in the
Graduate Student Council. I would like to thank all the other excomm members who
helped me play a small role in the shaping of the present and future of this great place
we all love dearly.
Most importantly, I thank my parents and sister for their love and encouragements.
iii
iv
Contents
Abstract i
Acknowledgments iii
Table of Contents v
List of Figures ix
List of Algorithms xv
List of Tables xvii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Two-View Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Multi-View Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Piecewise Planar reconstructions . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Light Detection and Ranging . . . . . . . . . . . . . . . . . . . . . . . . 14
2.9 Gaussian Process Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 OpenGL and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
vi CONTENTS
3 Model 21
3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Observation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 LiDAR observation model . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Image observation model . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 GPS observation model . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Prior Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Appearance Model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Geometry Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Extrinsic Parameter . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.4 Intrinsic Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Implementation 33
4.1 Appearance Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Image Likelihood Computation . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 LiDAR Likelihood Computation . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Geometry Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Image Likelihood evaluation under multiple primitives . . . . . . 43
4.5 Texture Atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.1 Static Texture Coordinates . . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.3 Speed up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Results 55
5.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Appearance Estimation . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Geometry Estimation . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 Camera Parameter Estimation . . . . . . . . . . . . . . . . . . . 76
5.2.4 Appearance and Geometry trade-off . . . . . . . . . . . . . . . . 81
5.2.5 Geometry Initialization . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Structure from Motion Comparison . . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Reconstruction Comparison . . . . . . . . . . . . . . . . . . . . . 92
5.3.2 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Additional Reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 Beyond Reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5.1 Absolute Scale and Orientation . . . . . . . . . . . . . . . . . . . 103
5.5.2 Identifying Scene Movers . . . . . . . . . . . . . . . . . . . . . . 106
CONTENTS vii
6 Conclusion 109
6.1 Possible Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Appendix A Data Overview 113
A.1 Toy Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.2 Lubbock Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.3 CLIF 2007 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A.3.1 CLIF Stadium Image Stack . . . . . . . . . . . . . . . . . . . . . 116
A.3.2 CLIF Stadium Only . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.3.3 CLIF Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Appendix B Appearance Computation 121
B.1 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
B.3 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.3.1 Multiple Observations - different noise . . . . . . . . . . . . . . . 123
B.4 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Appendix C Gaussian Marginalization 127
C.1 General Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Appendix D OpenGL Implementation Details 129
D.1 OpenGL projective texture transformation . . . . . . . . . . . . . . . . . 129
D.1.1 Projective Transformation . . . . . . . . . . . . . . . . . . . . . . 129
D.1.2 Projective Texture Mapping . . . . . . . . . . . . . . . . . . . . . 129
D.1.3 Pixels to Textures . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Bibliography 133
viii CONTENTS
List of Figures
2.1 Pinhole Camera Diagram. The image is formed by keeping all rays (a
few denoted by dotted lines) that intersect plane I and pass through
camera center Oc (dashed line denotes principal axis, point p denotes
the principal point, where the principal axis intersects the image plane). 7
2.2 Two View geometry with cameras centered at O1 and O2, epipoles e1
and e2 and epipolar plane shown in green. . . . . . . . . . . . . . . . . . 9
2.3 3D world point X has projection u˜ in image I and u˜′ in image I ′. The
diagram shows point the location of X as a search of pixel correspondence
u˜′ over the epipolar line. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Typical Structure from Motion pipeline . . . . . . . . . . . . . . . . . . 13
2.5 Bundler pipeline [50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Bundler’s Bundle Adjustment Details [50] . . . . . . . . . . . . . . . . . 14
2.7 Typical piecewise planer reconstruction pipeline . . . . . . . . . . . . . . 14
2.8 Typical LiDAR scan pattern, with common parameters . . . . . . . . . 15
2.9 LiDAR representation of MIT Dome and Killian Court (color represent
height above ground). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Random Samples from a Gaussian process with different hyperparame-
ters. Left column has a large lengthscale, producing very smooth sam-
ples; Right column has smaller lengthscale resulting in rapidly varying
samples. Top row and bottom row have different signal variances, hence
their amplitude ranges vary according. . . . . . . . . . . . . . . . . . . 17
3.1 Graphical Model Representation of the proposed model. . . . . . . . . . 21
3.2 LiDAR Observation Models, depicting data association using (a) scan
line direction and, (b) and (c) closest plane. . . . . . . . . . . . . . . . . 24
3.3 Notional representation of the world model. Top row : Textured rep-
resentation and triangles (uniquely color-coded and slightly separated),
Middle and Bottom rows: 3D triangles and texture representation per
triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Notional representation of the world model including the background
(aqua colored) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ix
x LIST OF FIGURES
4.1 Appearance computation procedure. First determine which primitive
generated the observation using a triangle mapping, secondly using the
texture mapping determine which pixel generated the observation. . . . 34
4.2 LiDAR and world primitive association. (Color coding: planes in ques-
tion color-coded, points associated with planes color-coded according
to distance, other planes shown in gray, other points shown in orange-
magenta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Relative image likelihood computation. Suppose changing a primitive’s
parameter produces the images shown in in blue and red (with corre-
sponding mask) in (a) and (b). (c) shows how to combine the renders to
produce the desired likelihood computation. . . . . . . . . . . . . . . . . 45
4.4 Sample Atlas, where the numbers indicate which triangle the texture
belongs to, the coordinates are expressed in terms of atlas UV. . . . . . 51
4.5 Sample Atlas drawing direction (1,2,3) for both the kth and kth+1 triangles. 51
5.1 Appearance as a function of viewing angles. Top row: Texture for a given
plane obtained from the rendered image shown in the bottom row.. As
viewing angle increases, texture quality rapidly degrades. This artifact
is a result of the reverse projection scheme, c.f. §4.1. . . . . . . . . . . 57
5.2 Weighted Appearance for a given triangle (the brighter the image the
higher the weight). Bottom-right tile is sum of the weighted sources.
Sources of Fig. 5.1 seen here in the middle of the top row, left in the
center row, and left and center in the bottom row. . . . . . . . . . . . . 58
5.3 Un-weighted Appearance for a given triangle (ordered in increasing angle
between primitive normal and source image viewing direction). Notice
how the quality of the texture degrades with increasing viewing angle. . 59
5.4 Weighted Appearance for a given triangle (the brighter the image the
higher the weight). Bottom-right tile is sum of the weighted sources.
Source images can be seen in Figure 5.3 . . . . . . . . . . . . . . . . . . 59
5.5 Weighted Appearance for a given triangle. (images ordered in increasing
viewing direction, the brighter a pixel, the higher the weight). Bottom
right image is the weighted sum of the sources. . . . . . . . . . . . . . . 60
5.6 Texture generated by forward projection method (source images same as
Figure 5.1). Forward projection associates an observation pixel with a
single texture pixel, and possibly its neighbors. As a result interpolation
is no longer needed, the black pixels in the lower triangular region of the
image represent missing data. . . . . . . . . . . . . . . . . . . . . . . . . 61
5.7 Learned appearances using a constant noise level and forward projection.
Reverse projection results can be seen on the last image of Figures 5.2,
5.4, and 5.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.8 Toy problem texture atlas, appearance size 256× 256, atlas size 1024×
1024 (1 of 1). Notice the effect of fixed texture coordinate. . . . . . . . . 62
LIST OF FIGURES xi
5.9 Lubbock texture atlas, appearance size 8× 8, atlas size 512× 512. (1 of
10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.10 Individual and Joint optimization for a given plane (left and right col-
umn respectively), divided into LiDAR (top row) and Image likelihood
(bottom row). The scale factor changes on the multi-plane image likeli-
hood, this is due to ignoring all primitives not being optimized. Plane
35147, CLIF Image Stack . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.11 Individual and Joint Likelihoods side-by-side for planes of Fig. 5.10.
Multi-plane likelihood scaled with constant offset of pixels not associated
with current planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.12 Computation time comparison for single and multi-plane optimizations
for the CLIF Intersection scene. Top plot: Cost for each method as a
function of number of planes being optimized. Middle plot: per iteration
additional cost as a function additional planes added (base is 1). Bottom
plot: speed up factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.13 Scene overview, regions that will be examined in this section are shown
in different colors. Subsequent region details are color-matched. . . . . . 68
5.14 Planar Reconstruction Example - Stadium Parking Lot . . . . . . . . . 68
5.15 Planar Reconstruction Example - River . . . . . . . . . . . . . . . . . . 69
5.16 Planar Reconstruction Example - Stadium Field . . . . . . . . . . . . . 69
5.17 Planar Reconstruction Example - Roof (two distinct heights). Heights
level smoothed due to incorrect LiDAR association. . . . . . . . . . . . . 70
5.18 Building Reconstruction Example - Tower 1 (northern-most) . . . . . . 71
5.19 Building Reconstruction Example - Part of Stadium . . . . . . . . . . . 72
5.20 Building Reconstruction Example - Slanted Building . . . . . . . . . . . 73
5.21 Building Reconstruction Example - Building Intersection . . . . . . . . . 74
5.22 Trees Example 1 (near stadium parking lot). Reconstructed trees as
viewed from different orientations. . . . . . . . . . . . . . . . . . . . . . 75
5.23 Trees Example 2 (near river). Reconstructed trees as viewed from differ-
ent orientations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.24 Failure Case Example - Tower 2 (southern-most). Incorrect LiDAR as-
sociation causes primitive to extend to explain measurement. . . . . . . 77
5.25 Likelihood Comparison with and without current image contribution
(green and cyan respectively). Intersection image 0. Both methods have
same local optima. Basin of attraction varies for each parameter. . . . . 78
5.26 Likelihood Comparison with and without current image contribution
(green and cyan respectively, ground truth denoted by red vertical line,
parameters were translated so that the ground truth appears at zero).
Lubbock image 2. Both methods have similar local optima. Capture
range varies for each parameter. . . . . . . . . . . . . . . . . . . . . . . . 79
xii LIST OF FIGURES
5.27 Parameter capture range after optimizing all images: once, five, ten, ...,
and thirty times. Notice the basin of attraction remains constant across
batches. (CLIF Stadium Image 0) . . . . . . . . . . . . . . . . . . . . . 81
5.28 Parameter Change for Uniform and GP prior as a function of number of
optimization runs. Approximately the same local optima is found after
about 30 batches independent of which prior probability model is used.
(CLIF Stadium) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.29 Top row: Initial Parameter configuration for appearance reconstructions
using all 49 images. Other rows: Reconstruction as a function of num-
ber of input images (fixed world geometry and LiDAR, same number of
optimizations run for all). . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.30 Reconstruction as a function of number of input images, two views. (a)
original noise level; (b), noise level reduced by a factor of two. (fixed
world geometry and LiDAR, same number of optimizations run for all) . 85
5.31 LiDAR density, 700k, 500k, 200k, 100k, 50k respectively. . . . . . . . . . 86
5.32 Reconstruction as a function of LiDAR measurements (fixed number of
images) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.33 Unique planes per LiDAR measurement, each plane centered at its cor-
responding measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.34 One plane per LiDAR measurement initialization (planes with no texture
appear black, background is white). . . . . . . . . . . . . . . . . . . . . 90
5.35 LiDAR initializing after updating all planes once (planes with no texture
appear black, background is white). . . . . . . . . . . . . . . . . . . . . 91
5.36 Bundler reconstruction of CLIF Intersection - Ortho view (1,715 points) 92
5.37 PMVS reconstruction of CLIF Intersection, 3 views. (12,208 points) . . 94
5.38 Reconstruction of CLIF Intersection using proposed algorithm, 3 views.
(3,490 visible planes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.39 PMVS reconstruction of CLIF Image Stack, 3 views. (77,552 points) . 96
5.40 Reconstruction of CLIF Image Stack using proposed algorithm, 3 views.
(81,573 visible planes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.41 Bundler reconstruction of CLIF Stadium Only. Oblique view, note that
the reconstruction is inverted (25,134 points) . . . . . . . . . . . . . . . 98
5.42 PMVS reconstruction of CLIF Stadium Only, 3 views. Note that the
reconstruction is inverted (156,290 points) . . . . . . . . . . . . . . . . 99
5.43 Reconstruction of CLIF Stadium using proposed algorithm, 3 views.
(12,912 visible planes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.44 Stadium Only Reconstruction, left: reconstruction, right: original image,
top to bottom cameras: 0,16,32,48. . . . . . . . . . . . . . . . . . . . . . 102
5.45 Intersection Reconstruction, left: reconstruction, right: original image,
top to bottom cameras: 0,14,29,44. . . . . . . . . . . . . . . . . . . . . . 104
5.46 Two distance annotation examples of OSU Stadium. . . . . . . . . . . . 105
LIST OF FIGURES xiii
5.47 Low likelihood regions colored in red. Regions correspond to moving
objects and sharp edges (sequence left to right CLIF Intersection images
14-28) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.1 Overview of Toy dataset, 15 images (outline location shown in red). . . 113
A.2 Four Images of the toy dataset (left to right noiseless images 0,4,8,12) . 114
A.3 Overview of Lubbock dataset, 3 images (outline location shown in red). 115
A.4 The three images of the Lubbock dataset . . . . . . . . . . . . . . . . . 115
A.5 Overview of Lubbock LiDAR, (color coded height above ground). . . . . 116
A.6 Overview of CLIF Cameras, frame 0, six cameras. Stadium Image stack,
top center; intersection seen in the bottom center image. . . . . . . . . . 117
A.7 Overview of CLIF LiDAR, (color coded height above ground). . . . . . . 117
A.8 Overview of CLIF Image Stack dataset, 50 images (outline location
shown in red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A.9 Four Images of the CLIF Image Stack dataset (left to right images
0,16,32,48) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A.10 Four Images of the CLIF Stadium Only dataset (left to right images
0,16,32,48) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.11 Four Images of the CLIF Intersection dataset (left to right images 0,14,29,44)120
A.12 Overview of CLIF Intersection LiDAR, (color coded height above ground).120
B.1 Our Graphical Model Representation. . . . . . . . . . . . . . . . . . . . 121
xiv LIST OF FIGURES
List of Algorithms
4.1 Appearance Estimation pseudo-code - CPU . . . . . . . . . . . . . . . . 36
4.2 Appearance Estimation pseudo-code - GPU . . . . . . . . . . . . . . . . 37
4.3 Image Likelihood Evaluation pseudo-code - CPU/GPU . . . . . . . . . . 39
4.4 Multi-Triangle Image Likelihood evaluation - Initialization - CPU/GPU 46
4.5 Multi-Triangle Image Likelihood evaluation - Subsequent computations
- CPU/GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Texture Atlas Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Texture update for the kth primitive . . . . . . . . . . . . . . . . . . . . 52
4.8 Draw routine using texture atlases . . . . . . . . . . . . . . . . . . . . . 52
5.1 Geometry Primitive Initialization . . . . . . . . . . . . . . . . . . . . . . 87
xv
xvi LIST OF ALGORITHMS
List of Tables
3.1 List of Variables used in Equation (3.1) and Figure 3.1 . . . . . . . . . 23
4.1 CPU/GPU speedup (times are the average of 5 estimations), all test done
on an NVIDIA GTX-580 graphics card . . . . . . . . . . . . . . . . . . . 37
4.2 CPU/GPU Image Likelihood evaluation speed-up (times are the average
of 5 estimations), all test done on an NVIDIA GTX-580 graphics card . 40
4.3 Render time for different texture methods on 80k triangles. (tested: on
NVIDIA GTX-580) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Bundler and PMVS time breakdown (all times in seconds) . . . . . . . . 101
5.2 Our time breakdown. All times in seconds, iterations in camera pose
refers to updating each camera once; iterations in geometry refers to
updating 250 planes once (each update runs until specified tolerance
is met or specified number of likelihood evaluations reached, whichever
first). Total time for camera pose is obtained by T = Nimages∗Niter∗Titer,
time for geometry is obtained by T = dNplanes/250e ∗Niter ∗ Titer. . . . 101
xvii
xviii LIST OF TABLES
Chapter 1
Introduction
In this thesis we consider the problem of 3D scene analysis inferred from multi-modal
data sources. Wide area motion imagery (WAMI) and Light Detection and Ranging
(LiDAR) provide rich information about large geographic areas at relatively low-cost.
While a great deal of attention has been paid to the idea of 3D scene reconstruction
from image sequences, the existing literature on multi-modal scene reconstruction is
somewhat lacking. One challenge to principled multi-modal approaches is that one
must consider the joint statistical properties across modalities in order to combine
measurements within a consistent mathematical framework. Here we investigate a
generative Bayesian formulation in which joint properties are encoded implicitly via a
latent representation of the scene.
At some level this introduces greater model complexity since in order to obtain
the desired geometry we must not only infer latent camera parameters but also the
latent parameters associated with each of the additional sensor modalities. Further-
more, the application of interest represents a greater challenge to standard image-based
approaches owing to the extreme geometry disparity and wide baseline of the data
collection.
In formulating the problem as one of statistical inference within the Bayesian frame-
work, we address several key modeling and computational issues. In particular, we
incorporate the use of graphics processing units (GPUs) for fast computation. Further-
more, we demonstrate a variety of advantages of the proposed approach as compared
to image-based methods, such as the ability to obtain dense reconstructions, in both
geometry and appearance, that can help identify small but crucial scene details.
Empirical comparisons demonstrate that comparable reconstructions (from a qual-
itative perspective) are achievable using fewer image measurements when exploiting
the geometric information provided by LiDAR. In contrast to image-based methods
we also demonstrate the ability to measure the physical dimensions of a scene at an
absolute versus relative scale, and the ability to detect moving objects against a static
background. These last aspects follow directly from the proposed approach.
In the rest of this chapter we discuss the motivation for this work and briefly review
the proposed approach; we culminate with an outline for the rest of this thesis.
1
2 CHAPTER 1. INTRODUCTION
 1.1 Motivation
Aerial imagery provides the means for obtaining a large amount of information for
a large scene. In dense urban environments this data has many applications, such as
surveillance, object tracking and urban planning among others. The common factor
in all these applications is the knowledge of the underlying 3D structure of the scene.
Despite recent and major advances in 3D reconstructions, the reconstructions of 3D
environment from a collection of aerial views remains a challenging problem. The
problem of estimating the 3D geometry of a scene from a collection of images without
any prior knowledge is known as Structure from Motion (SfM). That is, SfM attempts
to recover the 3D world state (structure) of a scene and the camera parameters (motion)
of an ensemble of images.
In order to accomplish this, several techniques have been developed, based on
the principles of image formation and multi-view geometry. These techniques can be
broadly classified into sequential methods, in which structure and motion are estimated
iteratively while adding additional images; or factorization methods in which structure
and motion are computed simultaneously for all images. Usually, these two methods
are treated as initial estimates and are further refined using a non-linear optimization
to minimize a suitable cost function; this step is referred to as bundle adjustment.
In practice, SfM can take advantage of additional information about the camera
pose parameters –position (x,y,z), viewing directions (yaw, pitch, roll), and focal length.
Digital camera routinely embed camera information within the image representation,
consequently it is common to information regarding the intrinsic camera parameters
(e.g. focal length) and, on occasion, information regarding the extrinsic information
(e.g. GPS measurements). Despite these advantages several challenges still make this
problem difficult. Specifically, a large number of images are needed to accurately char-
acterize a scene, this in turn increases computation complexity and running time since
the amount of coupling between images drastically increases with each additional im-
age that observes a common 3D point. As a result, large and complex scenes require
hundreds or thousands of images and weeks of computation time.
While SfM methods are widely used, the high number of images required for suitable
reconstructions reduces their appeal. As an alternative, one can reduce the number of
images used, but this typically results in a decrease in accuracy. The issue is primarily
one of obtaining a sufficient number of accurate correspondences within wide baseline
image set. However, these two components typically tend to work against each other.
That is, wide baseline give better estimates, but make it more difficult to find accurate
correspondences, since reliably finding correspondences depends on how distinct the
appearance is locally within a scene.
A possible alternative to decreasing the number of input images would be to intro-
duce additional geometric information about the scene. As an example, let us consider
adding the information that a given scene is composed of a single plane, i.e. that all
points lie in the same plane. In this case the planar structural information can reduce
computation cost (search a 2D plane instead of 3D space). Additionally, the number of
Sec. 1.2. Approach 3
images needed can be reduced, reducing image coupling and further reducing compu-
tation time. Moreover, since we are trading off images for geometric information, the
accuracy of the reconstruction should not decrease.
As we will see, the idea of using geometric information in SfM has been examined
in previous work. The conclusion is similar: under certain assumptions, introducing
geometric constraints into SFM generally leads to more accurate solutions at lower
computational cost. While consensus exist that adding geometric constraints is of great
use, the conditions under which these constraints are best utilized, and even the form
of the constraints is not generally agreed upon. Common constraints include a variety
of geometric primitives, such as lines and planes, but higher order primitives, such as
cuboids and spheres, can also be used.
In practice, the form of the constraint is limited by its source; for example knowing
that images are from an urban scene will probably introduce cuboid constraints due to
the high regularity in a man-made environments. However, since no general information
is available it is up to the end-user to input these constraints, usually in the form of
points or lines for simplicity. However, if additional geometric data was available,
automatic methods could be develop to incorporate this information. In this thesis, we
propose such a method.
 1.2 Approach
We propose a probabilistic model that incorporates multi-modal measurements;
optical and geometric; to recover scene geometry and appearance in order to build a
3D photo-realistic model of a given scene. Our model can be viewed as a data fusion
model, in which multiple observations are being combined to provided a cohesive and
unified representation of a scene.
This framework will allow us to incorporate a noise model in each measurement,
which can be used to capture uncertainty in the reconstructions. Furthermore, this
approach has the ability to incorporate additional scene measurements easily, allowing
future expansions and extra sources of information when available.
We focus on two measurement modalities: aerial images and Light Detection and
Ranging (LiDAR). LiDAR is helpful since it provides geometric information. This
information can be used to reduce the number of images and the required computation
time while maintaining comparable results.
There are several differences between the proposed model and traditional SfM, both
in approach and end result. In terms of approach, traditional SfM takes a purely
geometric route in that all image measurements are resolved to correspondences across
images, once the correspondences have been established a suitable geometric function is
minimized and image information is no longer used. In contrast the proposed approach
explicitly uses appearance information, and relies heavily on it as a way of computing
probabilistic likelihoods.
In terms of results, the most noticeable difference is the choice of geometry primitive,
4 CHAPTER 1. INTRODUCTION
while SfM reconstructions are comprised of point clouds, the approach in this thesis uses
triangular elements as the basis of the representation. Furthermore, the model recovers
an appearance image for each triangle while SfM recovers a color for each point.
 1.3 Outline
The thesis is structured as follows: chapter 2, discusses necessary background, in-
cluding the basics of image formations and the details of Structure from Motion (SfM),
as well as the extensions to piecewise planar reconstructions; useful mathematical con-
cepts such as the Gaussian Process prior are also discussed.
Chapter 3 presents the reconstruction problem as a probabilistic data fusion model.
That chapter discusses the observation models, and prior assumptions use in this work.
It also presents inference procedures and update equations for the latent variables in the
proposed model. The implementation details and inference algorithms are presented in
chapter 4. The chapter details the computation of the likelihoods and how to leverage
graphics hardware to obtain substantial computational efficiency.
Experimental results are presented in chapter 5. These results include model vali-
dation, comparison to well known structure from motion algorithms, as well as aspects
of the model that extend beyond reconstructions. Concluding remarks and future work
are discussed in chapter 6.
Appendices provide useful information including an overview of the data used through-
out this work (appendix A); derivations of the appearance updates and useful Gaussian
marginalization as discussed in appendices B and C respectively. OpenGL implemen-
tation details are provided in appendix D.
Chapter 2
Background
In this chapter we develop key background concepts related to Structure from Mo-
tion and LiDAR. We begin with a brief review of previous works. We follow with basic
concepts such as homogeneous coordinates and image formation. Concepts relating to
single-, two-, and multi-view geometry are discussed next. Once the basics have been
discussed we circle back to structure from motion and piecewise planar reconstructions.
After the recap of 3D scene reconstructions we introduce a set of useful concepts
beginning with a description of LiDAR and Gaussian Process priors. We culminate
with a concise introduction to OpenGL and CUDA.
 2.1 Related Work
As we will demonstrate this work differs from other aerial reconstruction works in
both goal and approach. This section will look at related works in three main categories,
general reconstruction works, higher order primitives and aerial reconstruction works.
Traditional reconstruction works fall on the Structure and Motion (SfM) category.
These works [19, 49, 50, 52, 56], relay solely on a collection of images and require ex-
tracting and matching correspondence points across the images. Recent advancements
of SfM by [1, 42, 49, 50] have made reconstructions of urban scenes possible from a
large collection of street views. This work differs from classical SfM in that we in-
corporate scene geometry; we model higher order primitives rather than single points;
and most importantly we substitute explicit point correspondence, with dense implicit
correspondences.
Previous works that include geometry information include [2, 4–6, 12, 16, 55, 63]. As
stated earlier, these works all point out that the introduction of geometric information
can help find a better solution faster for the SfM problem. However, most of these
methods require the user to input the geometric information, typically in the form of
lines and points. This undesirable step can lead to errors and reconstruction quality
that vary from user to user and day to day.
There are many methods that recover higher order primitives such as planes, or
parallelograms [24, 48] – also known as piecewise planar reconstructions. All of these
have SfM as a required initial step. The results of these methods are highly coupled
with the results of traditional SfM since most of these works focus on refining and fitting
5
6 CHAPTER 2. BACKGROUND
the SfM output. These methods produced very accurate geometric results when the
primitives reconstructed match the underlying geometry.
Aerial reconstruction works solely based on images has not been of much interest
[27, 35, 60]. On the other hand Aerial reconstruction with geometry information has
achieved much attention in the photometry field over the last decade [9, 10, 17, 22, 26,
31, 33, 51, 53, 58, 61, 62, 64]. These work rely on LiDAR to reconstruct underlying
scene geometry and images to produce appearance for that geometry. To our knowledge
no work attempts to refine LiDAR geometry with image information. Previous works
that include SfM in a probabilistic framework include [11, 20, 43].
To our knowledge no work attempts to combine geometry information and appear-
ance information in a probabilistic model to reconstruct 3D urban scenes.
 2.2 Homogeneous Coordinates
Before beginning we should clarify some of the notation that will be used. The
distinction between euclidean and homogeneous coordinates is an important one; we
denote 3D Euclidean points by X =
[
X Y Z
]>
, similarly a 2D Euclidean point is
represented by x =
[
x y
]>
. Homogeneous coordinates are denoted by tildes, such that
a 3D homogeneous point is X˜ ∼ [X˜ Y˜ Z˜ W˜ ]> and a 2D point x˜ ∼ [x˜ y˜ w˜]>. Im-
age pixel location, have similar notation to 2D homogeneous coordinates u˜ =
[
u v
]>
.
Notation aside, let us get started.
Homogeneous vectors refer to the equivalent class, under which two vectors are
equivalent up to a non-zero scaling. That is, 3D Euclidean vector X =
[
X Y Z
]>
and kX =
[
kX kY kZ
]>
are equivalent ∀k 6= 0. Thus any vector X is representative
of the equivalence class. The space of such vectors is P3. It becomes natural to think of
an augmented vector to include the scaling factor, such that X˜ ∼ [kX kY kZ k]> =[
X˜ Y˜ Z˜ k
]>
(for clarity we continue to use k as a scaling parameter but a more
general notation W˜ can also be used). It becomes clear that X =
[
X˜/k Y˜ /k Z˜/k
]>
,
and X˜ ∼ [X Y Z 1]>.
Similar reasoning applies to the 2D Euclidean and homogeneous coordinates (in
space P2). A 2D Euclidean point x =
[
x y
]>
has homogeneous coordinates given by
x˜ ∼ [x˜ y˜ w˜]>. Furthermore, x˜ ∼ [x y 1]> and x = [x˜/w˜ y˜/w˜]>
For more detailed treatment of homogeneous coordinates refer to [28].
 2.3 Image Formation
In this section we describe the basics of image formation. This section is based
on the projective pinhole camera model. This model is an accurate description of real
cameras when non-linear effects can be neglected (or compensated).
The pinhole model assumes that ray traicing can form a 2D image. That is, the
Sec. 2.3. Image Formation 7
Oc
I
X
x
O
p
Figure 2.1: Pinhole Camera Diagram. The image is formed by keeping all rays (a
few denoted by dotted lines) that intersect plane I and pass through camera center
Oc (dashed line denotes principal axis, point p denotes the principal point, where the
principal axis intersects the image plane).
image is formed by keeping every ray that intersects an image plane and passes through
the camera center1, see figure 2.1.
Mathematically, we can express the transformation of a 3D world point to a 2D
image pixel by three components, a 3D-to-3D transformation of world coordinates to
camera coordinate, a 3D-to-2D transformation of camera coordinates to image plane
and a 2D-to-2D image plane to pixel transformation.
The 3D-to-3D transformation can be written in homogeneous coordinates as:
X˜c ∼
[
R T
0 1
]
X˜ (2.1)
where R is 3 × 3 rotation matrix, T is a 3 × 1 translation vector, X˜ is the homoge-
neous coordinate of point X with respect to world coordinate frame, and X˜c is the
homogeneous coordinate with respect to camera coordinate frame.
The 3D-to-2D transformation of camera coordinate frame to image plane can be
written as:
x˜ ∼ diag(f, f, 1) [I | 0] X˜c (2.2)
1For simplicity we are adopting the convention that rays pass through the camera center and the
image plane is a distance of f (focal length) from the camera center in the direction of the scene, rather
than having rays pass through the focal point and the image plane located behind camera center and
inversed. These interpretations are equivalent
8 CHAPTER 2. BACKGROUND
where diag(f, f, 1) is a 3x3 diagonal matrix with diagonal entries {f , f , 1}, where f is
the camera focal length. I is the identity matrix and x˜ is the projection of X onto the
image plane.
The remaining transformation is the 2D-to-2D transformation from image plane to
pixel coordinates. This transformation can be written as:
u˜ ∼ Kx˜ (2.3)
where
K =
 fu s u00 fv v0
0 0 1
 (2.4)
and it is known as the calibration matrix. Generally, K is upper triangular; entries
(fu, fv) denote the focal length in the horizontal and vertical direction; (u0, v0) denote
the principal point; and s denotes the skew of pixels (only used when dealing with non-
rectangular pixels), collectively these parameters are referred to as intrinsic parameters.
The three transformations (eq: 2.1, 2.2 and 2.3 ) can be combine to write the
projective transformation between world point X to pixel coordinate u˜
u˜ ∼ PX˜ (2.5)
where
P ∼ K [R T] (2.6)
is a 3× 4 projection matrix.
Thus, equation 2.5 allows us to calculate the pixel location of a 3D world point
provided we know, the parameters for the rotation matrix R usually as a function of
three angles: yaw, pitch and roll, the translation vector T, and the focal length f . A
total of 7 parameters2. The parameters R, and T are referred to as extrinsic parameters.
We can consider the opposite situation in which we know the location of the points
X and u˜, then we can stack the equations in the form of equation 2.5 and solve for the
unknown parameters. This calibration problem can be solved by knowing 6 3D world
points and their corresponding pixel coordinates. This will allows to calculate P, we
can then obtain K and [R T] by RQ decomposition.
For a more detailed discussion of Pinhole cameras and projective transformations
refer to [21, 28, 54].
 2.4 Two-View Geometry
In this section we cover the basics of two-view geometry. In particular we relate pix-
els in one image to pixels in another image, provided the pixels correspond to the same
2Under the assumption that fu = fv = f and s = 0, which is quite reasonable in practice, since
non-rectangular pixels are not common and focal length tends to be symmetric.
Sec. 2.4. Two-View Geometry 9
I1
O
1
I2
O
2
e2
e1
x
x 2
x 1
Figure 2.2: Two View geometry with cameras centered at O1 and O2, epipoles e1 and
e2 and epipolar plane shown in green.
3D world point; we recover camera parameters and 3D world points for corresponding
pixels.
Let us begin by considering the problem depicted in figure 2.2. We can generalize
that camera 1 with center O1 has projection matrix P1 (as given by eq 2.6), similarly
camera 2 with center O2 has projection matrix P2. Furthermore, we can generalize
that the relation between both cameras is an unknown rotation R and translation T.
Given a point X1 in camera 1, its position in camera 2, X2, can be written as:
X2 = RX1 + T (2.7)
pre-multiplying both sides by X>2 [T]× where for T = [tx ty tz]>
[T]× =
 0 −tx tytz 0 −tx
−ty tx 0
 (2.8)
we obtain
X>2 [T]×RX1 = X
>
2 EX1 = 0 (2.9)
where E ∼ [T]×R and it is known as the essential matrix. We can use equation 2.2 to
relate X1 to x˜1 and X2 to x˜2, from which we see that equation 2.9 also holds for image
plane points, yielding
x˜>2 Ex˜1 = 0 (2.10)
10 CHAPTER 2. BACKGROUND
which is known as the epipolar constraint.
This constraint can be visualized as the projection of a point in one image to another
must lie in a line. In particular, lie in the epipolar line of the second image which is
the intersection of the image plane and the epipolar plane, a plane that contains the
camera centers and the point x˜1 in camera 1 (figure 2.2 shows the epipolar plane in
green and epipolar lines as dot dashed). Note that the projection of camera center into
the opposite image is known as the epipoles (shown in figure 2.2 as e1 and e2).
We can apply equation 2.3 to equation 2.10 to generate the relationship between
pixel of image 1 and image 2:
u˜>2 Fu˜1 = 0 (2.11)
where
F ∼ K−1>2 EK−11 (2.12)
is a 3 × 3 matrix of rank 2 called the fundamental matrix. Composed of the intrinsic
parameters of the cameras and the extrinsic parameters (via the essential matrix).
Similar to the approach taken in section §2.3 to estimate the camera matrix, the
constraints of the fundamental matrix can be stacked up and solved as a system in the
form Af = 0 where f is stacked up coefficients of F and A is a matrix composed of pixel
coordinates. Since there are 9 unknowns in F we need at least 9 pixel correspondences
between the images, methods described in [32] can obtain F from 7 correspondences.
As discussed above, the fundamental matrix can be obtained from pixel correspon-
dences across two images. In the special case where we know K1 and K2, then the only
unknown in equation 2.12 is E, which is given by E ∼ [T]×R. Since R and [T]× are
orthogonal we can decompose E by an SVD decomposition, with form E = UΛV>,
where U and V are orthogonal and Λ is diagonal with last entry 0. Then we can
compute:
[T]× = U
 0 1 0−1 0 0
0 0 0
U> (2.13)
and
R = U
 0 −1 01 0 0
0 0 1
V> (2.14)
Once we recover the translation vector and rotation matrix we can compute the
individual camera projection matrices. Without any loss of generality we can let one
camera be the reference camera and the second camera have translation T and rotation
R, that is:
P1 = K1[I | 0] (2.15)
P2 = K2[R | T] (2.16)
Sec. 2.4. Two-View Geometry 11
Figure 2.3: 3D world point X has projection u˜ in image I and u˜′ in image I ′. The
diagram shows point the location of X as a search of pixel correspondence u˜′ over the
epipolar line.
The method introduced above assumes that the calibration matrices are known,
which is not always the case. However, calibration matrices can be estimated using the
Kruppa equations [15] then proceed as above.
To recap, we are now able to describe two views, with their projective matrices
based only on pixel correspondences. What we would like to do next is estimate the
3D location of pixel correspondences, this problem is known as Triangulation [29].
The triangulation problem can be visualized in figure 2.3 for two images. We can
readily observe, that in a noiseless scenario this problem is trivial. In the noisy case,
the search for an optimal 3D point X can be seen as a correspondence point search
along the epipolar line of the second image, which has a non-iterative solution for two
views [29].
In a multi-image case for which we know the projection matrix Pi of each camera i
we can cast the problem as a non-linear optimization problem, in the form:
Xˆ = argmin
X
∑
i
||ui − u˜i(Pi,X)||2 (2.17)
where ui is the known correspondence pixel in image i, u˜i is the predicted pixel in image
i as a function of X and Pi and Xˆ is the solution found. Equation 2.17, minimizes the
square error between the desired pixel and the re-projected pixel based on the estimated
value of X and the projection matrix of image i.
12 CHAPTER 2. BACKGROUND
 2.5 Multi-View Geometry
In this section we extend the concepts develop in two-view geometry to multiple
images. This will be done from the perspective of Structure from Motion. This section
will tie together the material from earlier sections. As mentioned in the introduction,
SfM methods can be broken down into two main categories: sequential, and factorization
methods; we will now briefly explain these.
The most intuitive methods for bundle adjustment are sequential method. They
follow from the results develop in section §2.4 and as the name suggest are developed
in a sequential manner. That is they start with two images for which a reconstruction
based on epipolar constraints is formed, each additional image is reconstructed in a
similar manner. Smarter methods take advantage of the 3D points that have already
been reconstructed when adding new images as in [7]. As an alternative, merging partial
reconstructions of 3D points is also possible [19].
Factorization methods compute camera pose and scene geometry using all images
simultaneously. The key assumption behind this method is that the scene does not
change between images, and that the number of images is relatively large, so that
multiple features are visible across a large number of images [56]. This allows us to
decouple the problem of structure and motion; unfortunately this assumption is not
always valid.
Bundle adjustment refers to the process of refining the initial estimates of section
§2.4 by minimizing an appropriate cost function via an iterative non-linear optimization
[8, 57]. Mathematically, the problem can be neatly summarized in a one-line equation:
min
Pˆi,Xˆj
∑
ij
d(PˆiXj , x˜ij)
2 (2.18)
where x˜ij = Pˆ
iXj is the projection of world point Xj to image point x˜
i
j of camera i
with projection matrix Pˆi where the image measurement are noisy.
Then bundle adjustment is just finding a projection matrix of camera i and point
j that minimize the square of some distance function d(·, ·) between the noisy image
measurement and the predicted image measurement across all cameras and points. In
the case of the noise being modeled as Gaussian, this optimization is equivalent to a
Maximum Likelihood (ML) estimate of projection matrix and world point.
The particular distance function d(·, ·) must be chosen with some care, since it must
be invariant to the transformation of the camera, typically a prediction error is used,
x˜ij − PˆiXj .
Bundle adjustment is powerful tool that in practice can produce satisfactory results.
Its main disadvantage is the expensive computation time and its propensity to get stuck
in local minima.
Sec. 2.6. Structure from Motion 13
Incremental Structure from Motion
Iteratively Remove 
outliers via BA
Chose Initial Pair
(image pair with largest # of 
matches not explained by 
homography)
Bundle Adjustment 
Optimization
Add new cameras 
(with at least 75% of 
total number of 
matches to already 
reconstructed points)
Images
Tracks
If more 
images 
Initialize Camera
Via RANSAC
Direct Linear Transform
Non-linear Refinement
Focal Length 
(if known)
Detect/Remove Outlier Track
(track with high re-projection error)
Camera and Points
Scale Invariant Feature Transform (SIFT)
Difference of Gaussians
(σ=1.6,s=3 scales/octave, 
downsampling=2)
Find Local Extrema
(compare each sample to its 26 
neightbors)
Calculate Smoothed Image 
Gradient 
(magnitude and direction at the 
proper scale)
Histogram Patch Gradient 
(for each sample, histogram oriented 
gradients in a 16x16 set of samples 
around a point, weight by Gaussian σ=8)
Histogram Peak Determines 
Orientation
(other orientations within 80% of peak 
used to create keypoints in the same 
location with different orientation)
Form Descriptor
(combine gradients into a 4x4 window 
with 8 orientations, descriptor is all the 
elements in the window)
Get sub-pixel location 
(fit patch with quadratic 
function)
Eliminate Edge Response
(threshold eigenvalues of 
Hessian, r=10) 
Location and ScaleImage
Histogram of orientations
(for each sample, 36 bins, 10°/bin, 
samples weighted by grad 
magnitude and Gaussian with 
σ=1.6*scale)
Orientation
Normalize and Threshold
(threshold each element to a 
maximum of 0.2)
Descriptor
Approximate nearest neighbors (ANN)
Compute KD Tree 
Matching Structure of 
search keypoint
(efficient search structure)
Find 2-nearest neighbors to each 
given keypoint in KD Tree
(Distance function L2 of query-search 
keypoint descriptor)
Query Keypoint
Keep closest keypoint if 
dist_closest/dist_2nd_closest 
≤ 0.36
Keypoint Matches
Search Keypoint
Scene Reconstruction
Camera Estimates
Structure from MotionFeature Detection Pairwise feature matching Correspondence estimation
SIFT
Lowe et al 2004
Approximate nearest 
neighbors 
Arya et al 1998
Remove Outliers 
Not consistent 
with F
Group Matches into Tracks 
(remove images if 
matches<20,remove tracks with 
multiple keypoints per image)
Fundamental Matrix 
via RANSAC
8-pt Fundamental Matrix
Non-linear Refinement
Images Incremental Structure 
from Motion
Snavely et al 2006
Non-linear Refinement of 
estimates
Feature Detection Pairwise feature matching
Geometry and Camera estimation
(Geometry)
Figure 2.4: Typical Structure from Motion pipeline
Incremental Structure from Motion
Iteratively Remove 
outliers via BA
Chose Initial Pair
(image pair with largest # of 
matches not explained by 
homography)
Bundle Adjustment 
Optimization
Add new cameras 
(with at least 75% of 
total number of 
matches to already 
reconstructed points)
Images
Tracks
If more 
images 
Initialize Camera
Via RANSAC
Direct Linear Transform
Non-linear Refinement
Focal Length 
(if known)
Detect/Remove Outlier Track
(track with high re-projection error)
Camera and Points
Scale Invariant Feature Transform (SIFT)
Difference of Gaussians
(σ=1.6,s=3 scales/octave, 
downsampling=2)
Find Local Extrema
(compare each sample to its 26 
neightbors)
Calculate Smoothed Image 
Gradient 
(magnitude and direction at the 
proper scale)
Histogram Patch Gradient 
(for each sample, histogram oriented 
gradients in a 16x16 set of samples 
around a point, weight by Gaussian σ=8)
Histogram Peak Determines 
Orientation
(other orientations within 80% of peak 
used to create keypoints in the same 
location with different orientation)
Form Descriptor
(combine gradients into a 4x4 window 
with 8 orientations, descriptor is all the 
elements in the window)
Get sub-pixel location 
(fit patch with quadratic 
function)
Eliminate Edge Response
(threshold eigenvalues of 
Hessian, r=10) 
Location and ScaleImage
Histogram of orientations
(for each sample, 36 bins, 10°/bin, 
samples weighted by grad 
magnitude and Gaussian with 
σ=1.6*scale)
Orientation
Normalize and Threshold
(threshold each element to a 
maximum of 0.2)
Descriptor
Approximate nearest neighbors (ANN)
Compute KD Tree 
Matching Structure of 
search keypoint
(efficient search structure)
Find 2-nearest neighbors to each 
given keypoint in KD Tree
(Distance function L2 of query-search 
keypoint descriptor)
Query Keypoint
Keep closest keypoint if 
dist_closest/dist_2nd_closest 
≤ 0.36
Keypoint Matches
Search Keypoint
Scene Reconstruction
Camera Estimates
Structure from MotionFeature Detection Pairwise feature matching Correspondence estimation
SIFT
Lowe et al 2004
Approximate nearest 
neighbors 
Arya et al 1998
Remove Outliers 
Not consistent 
with F
Group Matches into Tracks 
(remove images if 
matches<20,remove tracks with 
multiple keypoints per image)
Fundamental Matrix 
via RANSAC
8-pt Fundamental Matrix
Non-linear Refinement
Images Incremental Structure 
from Motion
Snavely et al 2006
Non-linear Refinement of 
estimates
Feature Detection Pairwise feature matching
Geometry and Camera estimation
(Geometry)
Figure 2.5: Bundler pipeline [50]
 2.6 Structure from Motion
The pipeline for sequential tructu e from motion can be seen in Figure 2.4. As the
figure ind c tes SfM can be thought of as four distinct steps, in the first step interesting
image features are extracted for each image, these features are typically chosen to have
a set of desired properties such as scale and orientation invariant, and illumination
invariant. The next ste is to match the features found across the input images, this
step relays on the uniqueness of the features and the properties discussed earlier.
Step three of the pipeline involves estimating scene geometry and camera pose from
the matched correspondences, using t e meth ds described in §2.4. The main assump-
tion at this step is that matched features correspond to the same 3D scene location. We
note that when errors in the matching occur, this step introduces incorrect information.
The scene points and camera pose obtained in step three are refined in step four using
an iterative refinement, typically bundle adjustment as in §2.5.
Recently, Snavely et al. [50], combined the s eps of the pipeline in Figure 2.4 with
great success into their SfM software package Bundler. Their particular choice of algo-
rithms and m thods for each step made the SfM pipeline highly effective. As can be
seen from Figure 2.5 their use of SIFT features [37] as well as the approximate nearest
neighbor feature matching [3] made the first two steps fast as compared to previous
works. The computation of fundamental matrix via the normalized 8-pt algorithm
of [29] and RANSAC [18] provided robustness and reduced the chances of including
incorrect feature matches.
Their bundle adjust step (cf. Figure 2.6), a modified version of [36] again improved
robustness by having a better initialization for the starting match –important since
bundle adjustment is known to be initialization dependent and prone to getting stuck
in local optima due to initialization. Furthermore, their incremental scheme for adding
new images to the solution allows for more stable solutions.
14 CHAPTER 2. BACKGROUND
Incremental Structure from Motion
Iteratively Remove 
outliers via BA
Chose Initial Pair
(image pair with largest # of 
matches not explained by 
homography)
Bundle Adjustment 
Optimization
Add new cameras 
(with at least 75% of 
total number of 
matches to already 
reconstructed points)
Images
Tracks
If more 
images 
Initialize Camera
Via RANSAC
Direct Linear Transform
Non-linear Refinement
Focal Length 
(if known)
Detect/Remove Outlier Track
(track with high re-projection error)
Camera and Points
Scale Invariant Feature Transform (SIFT)
Difference of Gaussians
(σ=1.6,s=3 scales/octave, 
downsampling=2)
Find Local Extrema
(compare each sample to its 26 
neightbors)
Calculate Smoothed Image 
Gradient 
(magnitude and direction at the 
proper scale)
Histogram Patch Gradient 
(for each sample, histogram oriented 
gradients in a 16x16 set of samples 
around a point, weight by Gaussian σ=8)
Histogram Peak Determines 
Orientation
(other orientations within 80% of peak 
used to create keypoints in the same 
location with different orientation)
Form Descriptor
(combine gradients into a 4x4 window 
with 8 orientations, descriptor is all the 
elements in the window)
Get sub-pixel location 
(fit patch with quadratic 
function)
Eliminate Edge Response
(threshold eigenvalues of 
Hessian, r=10) 
Location and ScaleImage
Histogram of orientations
(for each sample, 36 bins, 10°/bin, 
samples weighted by grad 
magnitude and Gaussian with 
σ=1.6*scale)
Orientation
Normalize and Threshold
(threshold each element to a 
maximum of 0.2)
Descriptor
Approximate nearest neighbors (ANN)
Compute KD Tree 
Matching Structure of 
search keypoint
(efficient search structure)
Find 2-nearest neighbors to each 
given keypoint in KD Tree
(Distance function L2 of query-search 
keypoint descriptor)
Query Keypoint
Keep closest keypoint if 
dist_closest/dist_2nd_closest 
≤ 0.36
Keypoint Matches
Search Keypoint
Scene Reconstruction
Camera Estimates
Structure from MotionFeature Detection Pairwise feature matching Correspondence estimation
SIFT
Lowe et al 2004
Approximate nearest 
neighbors 
Arya et al 1998
Remove Outliers 
Not consistent 
with F
Group Matches into Tracks 
(remove images if 
matches<20,remove tracks with 
multiple keypoints per image)
Fundamental Matrix 
via RANSAC
8-pt Fundamental Matrix
Non-linear Refinement
Images Incremental Structure 
from Motion
Snavely et al 2006
Non-linear Refinement of 
estimates
Feature Detection Pairwise feature matching
Geometry and Camera estimation
(Geometry)
Figure 2.6: Bundler’s Bundle Adjustment Details [50]
Structure from Motion Generation of Candidate 
Planes
SfM Algorithm RANSAC or similar 
World Geometry 
and Camera Pose
Images
Plane Post processing 
Merging/Discarding, 
etc. 
Candidate
 Planes
Image/DepthMath Pixel 
Labeling
Multi-label MRF 
solved by GraphCut 
Candidate 
Planes
Labeled 
Image/Depthmaps
Figure 2.7: Typical piecewise planer reconstruction pipeline
 2.7 Piecewise Planar reconstructions
A typical pipeline for piecewise planar reconstructions can be seen in Figure 2.7.
Typical piecewise planar algorithms begin by first running structure from motion to
obtain candidate camera pose estimates and dense world deapthmaps. The depthmaps
are searched to produce a candidate set of planes, the search usually involves RANSAC
or other algorithms that are robust to outliers. The next step prunes the list of planes,
by merging or discarding planes that are the redundant or incorrect. The final step
views the deapthmaps as a labeling problem in a Markov random field and solves for
the optimal boundary between planes, typically via GraphCut.
As previously discussed the results of planar reconstructions are highly tided to SfM
results, since that is the first step in the pipeline. Furthermore piecewise planar work
can be seen as refinement to the results traditional structure from motion.
Variations from this pipeline are common, for example Gallup et al. [24], introduce
a label to indicate non-planar region, and thus allow for objects, such as trees, to not
be reconstructed as planes.
 2.8 Light Detection and Ranging
Light Detection and Ranging (LiDAR) refers to the class of optical remote sensing
technology that measures distance and/or material properties. In airborne LiDAR, a
ranging system is mounted in an aircraft along with a position tracking system such as
GPS. The system continuously scans the ground plane with light pulses, usually in the
ultraviolet or near infrared spectrum, figure 2.8. By controlling the aircraft’s altitude
and velocity, and measuring the return delay of a pulse a relative distance to ground
Sec. 2.9. Gaussian Process Prior 15
adad adad
ad
Scan Width
~500 m
Point Separation 
~1m
Scan Angle
~40°
z
x
y
Figure 2.8: Typical LiDAR scan pattern, with common parameters
can be obtained which when combined with horizontal position can give a very accurate
3D point, accuracy in the order of centimeters. By combining a great number of scan
lines and pulse returns we can obtain a large point cloud of the area in question, as
shown in figure 2.9 for the MIT Dome.
LiDAR is of particular interest because it provides us with a set of points for which
we know the geometry with a high degree of accuracy. This makes LiDAR ideal to
be introduced as constraints to any model. Furthermore, since the set of points are
relatively dense, we can readily extract surface contours from them. As a simple example
consider projecting all points to the ground plane, performing a Delaunay triangulation
and re-projecting points to their original elevations along with their connections; this
produces a triangular mesh that while simple, is still an effective manner of creating a
3D structural model from LiDAR returns.
 2.9 Gaussian Process Prior
For most of our applications the data we obtain is from city fly-overs. This type
of data has certain characteristics that can be incorporated as prior knowledge. For
example, we can expect that that plane collecting the measurements follows a smooth
trajectory. This implies that the camera collecting the images will also have a smooth
position as a function of time. For similar reasons, we can expect the orientation to
vary smoothly. We can incorporate this information as a prior in the form of a Gaussian
Process Prior.
16 CHAPTER 2. BACKGROUND
Figure 2.9: LiDAR representation of MIT Dome and Killian Court (color represent
height above ground).
 2.9.1 Definition
Simply stated a Gaussian Process is a collection of infinite random variables, any
finite number of which have a Gaussian distributions. A Gaussian process is a general-
ization of a multivariate Gaussian distribution to infinitely many variables.
We can fully specify a Gaussian process by a mean function m(x) and a covariance
kernel k(x, x′):
f(x) ∼ GP (m(x), k(x, x′)) (2.19)
where x is one-dimensional variable (chosen for simplicity of exposition, Gaussian
process can easily be extended to multidimensional case).
In practice we do not need to instantiate the infinite collection of random variables,
instead we can focus on a finite collection of them f = [f(x1), f(x2), . . . , f(xn)]
>, if we
let m(x) = 0 for all x, then
f ∼ N (0,Σ)
where Σij = k(xi, xj), ∀i, j ∈ [1, n].
For many applications m(x) can be made zero and k(x, x′) can take the squared
exponential function:
k(x, x′) = υ2 exp
(
−(x− x
′)2
2`2
)
(2.20)
Sec. 2.9. Gaussian Process Prior 17
The hyper-parameters (υ, `) in the square exponential function (equation 2.20) have
specific interpretation. The lengthscale, `, controls the level of variability expected in
the input. Larger lengthscales indicate that the function is expected to vary slowly,
while shorter lengthscales indicate rapid changes in the function values. The Signal
variance, υ, defines the vertical scale of variations of a typical function (see Figure
2.10).
x
y
Gaussian-Process-Random-Samples-(v=10,-l=6)
0 5 10 15 20 25 30
-8
-6
-4
-2
0
2
4
6
8
10
x
y
Gaussian-Process-Random-Samples-(v=10,-l=3)
0 5 10 15 20 25 30
-8
-6
-4
-2
0
2
4
6
x
y
Gaussian-Process-Random-Samples-(v=5,-l=6)
0 5 10 15 20 25 30
-5
-4
-3
-2
-1
0
1
2
3
4
5
x
y
Gaussian-Process-Random-Samples-(v=5,-l=3)
0 5 10 15 20 25 30
-8
-6
-4
-2
0
2
4
6
8
Figure 2.10: Random Samples from a Gaussian process with different hyperparameters.
Left column has a large lengthscale, producing very smooth samples; Right column has
smaller lengthscale resulting in rapidly varying samples. Top row and bottom row have
different signal variances, hence their amplitude ranges vary according.
 2.9.2 Inference
Typically, a Gaussian Process is used as a prior. In the case where the likelihood is
Gaussian, this results in a close form solution for the posterior (also Gaussian). We will
present the prototypical predictive case here. Let us assume we have noisy observations
of a smooth function tN = {xn, yn}Nn=1, the task is to infer the function f(x).
Assume you have a prior over functions that is a Gaussian Process, such that f(x) ∼
GP(0, k(x, x′)), and that the observation model is p(y|x, f(x)) = N (f , σ2I), where I is
the identity matrix and σ2 represent some suitable noise.
18 CHAPTER 2. BACKGROUND
The posterior over f(x) becomes:
p(f(x)|x,y) = p(y|x, f(x))p(f(x))
p(x,y)
The numerator of the right hand side is the multiplication of a Gaussian and a
Gaussian Process, which results in another Gaussian process:
f(x)|x,y ∼ GP (µpost(x), kpost(x, x′))
where
µpos(x) = k(x,x)
[
K(x,x) + σ2I
]−1
y
and
kpost(x, x
′) = k(x, x′)− k(x,x) [K(x,x) + σ2I]−1 k(x, x′)
 2.9.3 Hyperparameters
The covariance kernel typically has a few hyper-parameters. For example the square
exponential covariance (equation 2.20) has two hyper-parameters, the lengthscale and
the scale variance. These parameters can be manually tuned or learned from the data.
An effective way of learning parameters involves looking at the posterior probability
of the hyper-parameters θ.
p(θ|x,y, f) ∝ p(y|x, f ,θ)p(θ)
where the first term on the right hand side is data likelihood, and the second is a prior
over the parameters. Taking the log of the marginal likelihood, we obtain
log p(y|x, f ,θ) = −1
2
log |K| − 1
2
y>K−1y − n
2
log (2pi)
(where K is shorthand for the covariance matrix) which we can optimize over to get the
hyper-parameters. Typically the gradient useful in the optimization, we can compute
the gradient of the log likelihood with respect to the ith hyper-parameter:
∂
∂θi
log p(y|x, f ,θ) = −1
2
Trace
(
K−1
∂K
∂θi
)
+
1
2
y>K−1
∂K
∂θi
K−1y
Taking the partial of the matrix K with respect to θi is just taking the partials of
each entry. Alternatively we can take the partial of the covariance kernel, and build the
new ∂K using the modified kernel. For the square exponential covariance, the partial
kernels kυ(x, x
′) and k`(x, x′) are derived below:
∂k(x, x′)
∂υ
=
∂
∂υ
[
υ2 exp
(
−(x− x
′)2
2`2
)]
= 2υ exp
(
−(x− x
′)2
2`2
)
︸ ︷︷ ︸
kυ(x,x′)
Sec. 2.10. OpenGL and CUDA 19
and
∂k(x, x′)
∂`
=
∂
∂`
[
υ2 exp
(
−(x− x
′)2
2`2
)]
= υ2
(x− x′)2
`3
exp
(
−(x− x
′)2
2`2
)
︸ ︷︷ ︸
k`(x,x′)
For more on Gaussian Processes see [38, 44].
 2.10 OpenGL and CUDA
Our practical implementation rely heavily on OpenGL [34], CUDA [39] (as well as
libQGLViewer and Qt [13, 25]).
OpenGL is a low level software interface to graphics hardware, it is specifically built
to handle vertex data, typically geometry information, and pixel data, typically image
information. These two basic data types allow the construction of sophisticated and
complicated objects. These data types (and hence objects they create) are combined in
a rendering pipeline to produce 2D representation. These representations allow us to
visualize the 3D models with high realistic details such as lighting, occlusion reasoning
and perspective effects. For more on OpenGL see [47].
CUDA architecture on the other hand allows us to interact directly with graphic
hardware, not for rendering purposes but for general purpose computing. This is desired
since using a Graphics Processing Unit (GPU) can lead to high performance implemen-
tations due to their parallel nature. Furthermore, combining OpenGL and CUDA is
optimal since we can process OpenGL objects while they are in graphics hardware. This
leads to a reduction in communication between the Central Processing Unit (CPU) and
GPU, which is where most of the performance bottleneck occurs. For more on CUDA
see [40, 46].
20 CHAPTER 2. BACKGROUND
Chapter 3
Model
Here we describe a probabilistic generative model of LIDAR and motion imagery.
Our goal is to construct a latent variable model that is sufficiently expressive to explain
multiple measurement modalities while lending itself to efficient inference procedures.
The use of graphical models facilitates our goals.
 3.1 Model
The implementation presented here follows the directed graphical model shown in
Figure 3.1. The implication of the directed graphical model is that the observations are
statistically independent conditioned on the latent variables. In those terms the model
is composed of multiple latent or hidden variables (G, A, K, T , and B), which explain
observable variables (L, I, and Z). The variables G and A collectively explain the
geometry and appearance of the underlying scene, while K and T explain intrinsic and
extrinsic camera parameters. The remaining latent variable, B (denoting background),
is incorporated to explain additional variability not captured by the other variables.
We explain its role later in this section. Observed variables are shaded gray in Figure
3.1, and denote measured quantities such as a LiDAR point (L), an image (I), or a
GPS position (Z).
The model can be conceptually divided into two parts, the first portion encodes the
underlying scene structure in the form of geometry and appearance. As seen in Figure
3.1 the world representation consists of Np independent geometry primitives Gm, with
L
K1 T1
I1 Z1
M
K2 T2
I2 Z2
Kn Tn
In Zn
GA
L
K1T1
I1Z1
M
K2T2
I2Z2
KnTn
InZn
GA
L
K1 T1
I1 Z1
M
K2 T2
I2 Z2
Kn Tn
In Zn
G A
L
G
A K T
I Z
NM
L
G A K T
I Z
N
M
L
G
A K
TI Z
NM
L
G
A K T
I Z
N
M
L
G A
K
T
I Z
N
M
L
G A K T
I Z
N
M
L
G A K T
I Z
N
M
L
G A K
TI Z
N
M
C
L
G A K T
I Z
NI
Np NC
NL
L
G A K T
I Z
NI
Np NC
NL
B
Figure 3.1: Graphical Model Representation of the proposed model.
21
22 CHAPTER 3. MODEL
a canonical appearance Am, where m ∈ [1, Np]. The primitives used throughout this
work are triangles, parameterized by three 3D points. Appearance variables, Am, are
described over a canonical image coordinate system constrained to the support of a
right triangle of specified size. This is then combined with a transformation defined
by primitive’s parameters to describe the appearance of the primitive in 3-dimensional
space.
The second part of the model describes two noisy multi-modal observations: images
and LiDAR and their relationship to the underlying world. The LiDAR measurements
are expressed in the bottom plate of Figure 3.1 with NL independent points Lk where
k ∈ [1, NL]. The plate on the right explains image measurements, it consists of Nc
independent cameras, each with their own intrinsic parameters Kc and extrinsic camera
trajectory Tc where c ∈ [1, Nc]. The intrinsic and extrinsic parameters generate Ni
independent images Icn and GPS measurements Z
c
n, where n ∈ [1, Ni].
The joint probability distribution consistent with the graphical model described can
be expressed as:
p(L, I,Z,G,A,T,K,B) =
Nc∏
c=1
NI∏
i=1
p(Ici |G,A,B,Kc, Tc)p(Zci |Tc)
×
Nl∏
l=1
p(Ll|G)
Np∏
k=1
p(Gk)p(Ak)
Nc∏
c=1
p(Tc)p(Kc)p(B) (3.1)
The individual terms of equation (3.1) will be discussed in the next sections; for easy
reference, Table 3.1 lists the variables used in the model as well as a short description.
 3.2 Observation Models
The directed graphical model of Figure 3.1 illustrates independence properties of
the variables in the model. However, it is still the case that the parameterization of
the model must be provided. In the model presented here, directed edges from latent
variable to measurements are derived from a physical sensor model (or approximation
there of). Here, we describe the sensor models used in this work and the likelihood
equations that follow from those sensors. We emphasize that additional sensors can
easily be incorporated to this model by following the approach presented in this section.
 3.2.1 LiDAR observation model
We model LiDAR as a point measurement of a 3D world surface plus additive
Gaussian noise. That is, the nth LiDAR measurement, Ln, can be expressed as
Ln = L
k(n)
n +W (3.2)
where L
k(n)
n is the 3D point on primitive Gk(n) that generated the measurement, and
W ∼ N (w; 0, σ2). This results in Ln|Lk(n)n ∼ N (Ln;Lk(n)n ;σ2) = N (d2(Ln, Lk(n)n ); 0, σ2),
where d2(Ln, L
k(n)
n ) is the square distance between points Ln and L
k(n)
n .
Sec. 3.2. Observation Models 23
Variable Description
Np Number of world primitives.
NL Number of LiDAR points.
Nc Number of individual cameras.
NI Number of images (for a given camera).
B Background Model.
Lk k
th LiDAR measurement, 3D world point, k ∈ [1, NL].
Gm m
th world primitive, triangle parameterized by three 3D points, m ∈
[1, Np].
Am m
th world primitive appearance, image of specified size, m ∈ [1, Np].
Kc Intrinsic parameters for the c
th camera, consisting of focal length, c ∈
[1, Nc].
Tc Extrinsic camera trajectory for the c
th camera, 3D position and orienta-
tion, c ∈ [1, Nc].
Icn n
th image taken with camera c, n ∈ [1, NI ].
Zcn n
th GPS measurement of camera c, n ∈ [1, NI ].
Table 3.1: List of Variables used in Equation (3.1) and Figure 3.1
Equation (3.2) and the likelihood model associated with it uses the distance between
the point that generates the measurement and the measurement to assign a probability
to the observation. Generally we do not know a priori which point, (or even world
primitive) generates which observation, this leads to a data association problem. When
thinking about sensor model, the most natural way of associating a LiDAR measure-
ment with a surface is to follow the direction of the ray formed by connecting the
collection device (mounted on the airplane as shown in figure 2.8) and the measure-
ment; this yields uncertainty in the direction of the ray as shown in Figure 3.2a. This
computation requires knowledge of the aircraft’s location at all points during the flight,
knowledge of the location of the LiDAR collection device on the aircraft and the scan
parameters. Modeling these parameters drastically increases computational complexity
and is outside the scope of this thesis.
A simple yet powerful approach would associate the LiDAR measurement to the
closest point on a world primitive. (Figure 3.2b, and 3.2c). That is, we can let Ln|Ln ∼
N (d2(Ln, Lˆk(n)n ); 0, σ2), where Ln is the collection of all points that Ln can be projected
to in all primitives and
Lˆk(n)n = arg min
L∈Ln
d2(Ln, L). (3.3)
Equation (3.3) provides the means to solve the association problem between a
measurement and a collection of points that could generate that measurement. The
problem is solved using a greedy solution –selecting the closest point in the set to
24 CHAPTER 3. MODEL
Îk
mlkGm
Scan Direction
lk
m
lkGm
Îk
m
Scan Direction
Normal Direction
lk
m
lk Gm
Normal Direction
(a) Scan Direction (b) Closest Plane (case 1) (c) Closest Plane (case 2)
Îk
m
lk
Gm
Gn
Îk
n
lk GmÎk
m
Gn
Îk
n
Îk
m
lk
Gm
Gn
(a) Scan Direction (b) Closest Plane (case 1) (c) Closest Plane (case 2)
Îk
m
lk
Gm
Gn Îk
mlk
Gm
Gn
Îk
n
lk
Gm
Gn
Figure 3.2: LiDAR Observation Models, depicting data association using (a) scan line
direction and, (b) and (c) closest plane.
the measurement as the generating point1. To obtain the likelihood of a measure-
ment conditioned on all world primitives we can generalize the above procedure as
Ln|G ∼ N (d2(Ln, Gˆk(n)); 0, σ2), where d2(Ln, Gˆk(n)) is the distance between point Ln
and primitive Gˆk(n) and
Gˆk(n) = arg min
G∈G
d2(Ln, G). (3.4)
Again we note that the computation of equation (3.4) is computationally intensive since
it requires a search over all world primitives. We will discuss implementation details
that mitigate this in section 4.3.
 3.2.2 Image observation model
The image observation model depends not only on the camera parameters, but also
on the world geometry. Pixel (u, v) of the nth image of camera c can be expressed as:
Icn(u, v) = Am∗(u
′, v′) +Qn (3.5)
where Am∗ is the appearance of the m
∗ primitive, at some coordinate (u′, v′), and Qn
is a zero mean Gaussian distribution with variance r2m∗ , i.e. Qn ∼ N (q; 0, r2m∗).
Note that while not shown explicitly in equation (3.5), m∗, u′ and v′ are functions
of u, v,Kc, Tc, and G. We can see this if we interpret equation (3.5) as a mapping of
color values from image pixel (u, v) to the appearance of the primitive m∗, this depends
on the camera parameter, i.e. where the camera looking, and the visibility of pixels u′
and v′ of plane m∗, i.e. occlusion reasoning.
We further note that we choose r2m∗ , the variance of Qn to depend on the angle
between the triangle in question and the camera’s viewing direction, (inverse of the
1Data association greatly increases the computational complexity of many inference procedures. For
the purpose of this thesis we will utilize a simple greedy method (closest fit). More complex data
association methods could be used, but they are beyond the scope of this work.
Sec. 3.2. Observation Models 25
Figure 3.3: Notional representation of the world model. Top row : Textured represen-
tation and triangles (uniquely color-coded and slightly separated), Middle and Bottom
rows: 3D triangles and texture representation per triangle.
absolute value of the dot product between viewing direction and triangle normal will
be used throughout this work2). With these choices, we can write,
p(Icn|G,A,B,Kc, Tc) =
∏
k∈Sn
N (ik; am∗(k), r2m∗) (3.6)
where Sn is the set of pixels in image In.
Background Model
The image observation model described earlier is not complete. Pictorially, this can
be seen in Figure 3.3, where all the white pixels in the textured world image are not
currently included in the model representation. Mathematically, equation (3.6) does not
explain all observed pixels, only those which project onto existing geometric primitives,
all other pixels are undefined.
As such, we include an additional background variable so that all pixel observations
are included in the likelihood calculation. That is, allow image pixels to be explained
by either a world primitive or a background model. There are a variety of methods
by which this could be accomplished. We choose to model background as a ground
plane since the camera is typically pointed downwards. We can see the introduction of
the background model for the notional example in Figure 3.4, where the background is
shown as a stripped cyan plane at ground level that extends across the entire image.
To obtain the new observation model, we need to add the terms that explain the
2It is important to point out that this noise parameterization heavily penalizes low grazing angles.
Such situations arise when the viewing direction and primitives are almost orthogonal. The dot product
should never be zero, since that implies the primitive is not visible
26 CHAPTER 3. MODEL
Figure 3.4: Notional representation of the world model including the background (aqua
colored)
background:
In(u, v) =
{
Am∗(u
′, v′) +Qn (u, v) ∈ V
Bs +W (u, v) ∈ B
(3.7)
where Bs is one component of the background B, V is the set of pixels explained by
world triangles and B is the set of background pixels. For simplicity let us assume that
the set B has R unique pixels in it.
With this change the likelihood for the nth image in camera c takes the form:
p(I|G,A,B,Kc, Tc) =
∏
k∈V
p(In,ck |G,A,Kc, Tc)
∏
k∈B
p(In,ck |G,B,Kc, Tc), (3.8)
where we have moved the image index, n, and the camera index, c, to the superscript
of I to allow for pixel index k. The background model is not of primary interest and
furthermore, owing to the formulation, the model and associated inference procedure
can marginalize over this quantity as follows:
p(I|G,A,K,T) =
∫ Nc∏
c=1
NI∏
n=1
p(In,c|G,A,B,Kc, Tc)p(B)dB (3.9)
=
Nc∏
c=1
NI∏
n=1
∏
k∈Vnc
p(In,ck |G,A,Kc, Tc)
∫ Nc∏
c=1
NI∏
n=1
∏
k∈Bn
p(In,ck |G,B,Kc, Tc)p(B)dB (3.10)
Where equation (3.8) was substituted in equation (3.9) to get (3.10). Note that the
image pixels described by existing primitives can be moved outside of the integral. We
can perform the integration in (3.10) component-wise for each of the R pixels in the
background. If we let
p(In,ck |G, B,Kc, Tc) = N (xk; bs, σ2w) (3.11)
and
p(bs) = N (bs;µb, σ2b ) (3.12)
Sec. 3.3. Prior Probability Models 27
correspond to a single background pixel, then the integral for that component, say bs,
can be evaluated to be:
ϕ
(
x;µb, σ
2
b , σ
2
w
)
=
∫ N∏
k=1
p(Ik|G, B,Kc, Tc)p(B)dB
=
σw
(
√
2piσw)n
√
Nσ2b + σ
2
w
exp
{
−1
2
(
1
σ2w
N∑
k=1
x2k +
µ2b
σ2b
)}
× exp

(
σ2b
∑N
k=1 xk + σ
2
wµb
)2
2σ2wσ
2
b
(
Nσ2b + σ
2
w
)
 (3.13)
For proof of this see §C.1.
The background marginalizing of equation (3.13) has the expected consequence of
coupling pixels across different images that observe the same background location. This
can be easily seen by noticing that the product inside the integral does not distinguish
which image the observations come from, simply that they belong to the same back-
ground pixel.
Following equation (3.13), the image likelihood can be expressed as:
p(I|G,A,K,T) =
∏
r∈B
ϕ
(
xr;µb, σ
2
b , σ
2
w
) Nc∏
c=1
NI∏
n=1
∏
k∈Vnc
p(In,ck |G,A,Kc, Tc) (3.14)
where xr are the image values that correspond to background pixel r across all the
images.
Equation (3.14) has two components, the first couples the background across the
images and computes the marginalization. The second, computes the likelihood of a
given image pixel under the current camera and world parameters.
 3.2.3 GPS observation model
The GPS observation model is a simple additive Gaussian model, Zcn = Tc + W
where W ∼ N (w; 0, σ2w). This leads a Gaussian likelihood,
Zcn|Tc ∼ N (z;Tc, σ2w). (3.15)
 3.3 Prior Probability Models
In Bayesian modeling, information that is known a priori regarding a parameter of
interest is encoded via what is referred to as the prior probabilistic model. These models
are typically tailor-made for applications and highlight some desired characteristics by
placing higher or lower probability on certain regions of the parameter space. In this
section we discuss the prior models used in this work. We further note that in our case,
28 CHAPTER 3. MODEL
the posterior model is dominated by the likelihood probability in that the influence of
the prior model quickly diminishes.
 3.3.1 Appearance Model
As we will discuss later, a natural prior probability model to use in the appearance
model is a Normal distribution. The use of this model will lead to a close form inference
computation, greatly reducing computational cost (c.f. §3.4) .
 3.3.2 Geometry Model
The geometry prior model is not as easily determined as the appearance prior model.
In order to select a model to use we must carefully consider the type of behavior that we
want to encourage (or discourage) in the model. Furthermore, we should choose prior
models that reflect the geometrical properties present in man-made urban environments.
We suggest two such prior geometric models. The first is a prior model on the normal
of a triangle. The motivation behind this model is that in man-made environments,
surfaces tend to be either horizontal or vertical. In aerial images, the majority of
surfaces encountered are either ground or rooftops, or building sides. Thus placing a
prior model that encourages horizontal or vertical directions seems to be in-line with
both our modeling goals and the underlying world geometry. A third order symmetric
Dirichlet distribution with small concentration parameter can achieve this purpose.
The second model is related to the size of each triangle rather than the orientation.
We would like world primitives to have roughly the same size, this has two key advan-
tages, first: it discourages triangles from shrinking or growing with out bounds; second,
it allows the user to pick texture sizes independent of triangle size variability. Mathe-
matically, we can place a normal distribution on the triangle area or perimeter, where
the mean is the desired property value and the variance controls how much deviation
we would allow.
 3.3.3 Extrinsic Parameter
As we mentioned in the model discussion, Tc is the trajectory of the c
th camera. That
is, we are not only modeling the camera position and orientation for a given image, but
the joint position and orientation for a given sequence of images. To this end we use a
Gaussian Process Prior (GP prior) over the camera extrinsic parameters. The Gaussian
Process prior incorporates the knowledge that aerial images are typically obtained in
a regularly spaced temporal sequence, and that the camera extrinsic parameters vary
smoothly across the images. This is typically the case when a single moving platform
collects images.
Mathematically, we can motivate the use of the prior model as follows in one di-
mension. Let Tn be the camera parameters of the n
th image, and C ∼ GP(0,Σ) be
Gaussian Process from which Tn is an observation. That is, Tn = C + W , where
W ∼ N (w; 0, σ2I). It follows that, Tn|C ∼ N (C, σ2I), where I is the identity matrix.
Sec. 3.4. Inference 29
If we marginalize over C, this leads to Tn ∼ N (0,Σ + σ2I). This can be interpreted as
Tn following a Gaussian Process with covariance kernel that is the sum of the Gaussian
Process prior kernel, Σ, and a constant term, σ2I. Letting the kernel be the squared
exponential (eq. 2.20), the kernel of Tn becomes:
k(x, x′) = υ2 exp
(
−(x− x
′)2
2`2
)
+ σ2δx=x′ (3.16)
this kernel is a small deviation from the one presented in §2.9.
As we will see shortly, it will be useful to consider looking at the distribution of
Tn|T\n, this can be interpreted as predicting a new camera parameter given N−1 other
parameters from a Gaussian process, or as obtaining the conditional of a multivariate
Gaussian with N dimensions. Let us take the second point of view for simplicity. Then
if T ∼ N (0,Σ), we can obtain Tn|T\n ∼ N (µn,Σn) via Shur’s complement of T , such
that
µn = x
>C−1x (3.17)
and
Σn = k − x>C−1x (3.18)
where x is the nth column of Σ without the nth entry, C is the submatrix obtained by
removing the nth column and row from Σ, and k is the (n, n) entry of Σ.
We note that while we found the GP prior useful in our implementation it can be
easily replaced with a more general (non-informative) prior model, such as uniform.
This would be needed if the image order is not known apriori or if a small UAV was
used as an imaging device (the flight path of small UAVs are highly susceptible to
changes in wind direction and hence not smooth).
Throughout this work we assume that images collected by the same camera are
obtained at a regular time interval, i.e. xj − xi ∈ Z,∀(i, j) ∈ Ni (for use in equation
(3.16)). This constant frame rate assumption is enforced to simplify the bookkeeping,
and work with the image position in the sequence rather than the image collection time.
 3.3.4 Intrinsic Parameter
We opted to use a uniform prior model over the camera focal length for simplicity of
computation. While one could exploit the knowledge that aerial images generally have
longer focal lengths, the complications of incorporating such a model are unlikely to
impact algorithmic performance in our case. An example of such a prior model might
be a Gamma distribution with a suitable shape and scale parameter.
 3.4 Inference
This section describes the inference algorithms used. For simplicity of exposition
we discuss the case where only one camera is used, (Nc = 1), and drop the subscript.
As we will see, computational efficiency is the main guiding factor in what follows.
30 CHAPTER 3. MODEL
 3.4.1 Appearance
In the model presented inference over appearance would take the form
p(A|I,G,K, T ) ∝
NI∏
i=1
p(Ii|G,A,K, T )
Np∏
k=1
p(Ak) (3.19)
Using the observation model described in section 3.2.2 implies that the first term of
the rhs of equation (3.19) is normally distributed. If the prior probability model on Ak
were also normal, we could solve equation (3.19) in closed form for observed pixels. We
note that ultimately Ak are pixels on an image with support [0, 255] for 32 bit images,
a normal distribution has infinite support; this implies we are placing probability mass
on non-realizable values. We emphasize that this amount is negligible for a suitable set
of means and variances, µ = 128 and σ = 10 for example, furthermore we assume that
the underlying surfaces are continuous, allowing any resolution and we only observe
integer values due to physical limitations in the imaging device.
Since both terms in the rhs of equation (3.19) are Gaussian, the lhs would also be
a Gaussian, that is
p(A|I,G,K, T ) = N (a; µˆ, σˆ2) (3.20)
where
µˆ =
µr˙2 + σ2
∑n−1
i=0 r˙
2
\izi
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
(3.21)
σˆ =
r˙σ√
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
, (3.22)
r˙ =
∏n−1
j=0 rj , and r˙\i =
∏n−1
j=0,j 6=i rj . For proof see appendix B.
Equations (3.21) and (3.22) provide the update for each pixel of the appearance
images. Note that while the form of eq. (3.21) and (3.22) presented here are general,
they can easily be simplified to the case where image noise is constant by letting r˙ = rn,
and r˙2\i = r
n−1, and to the single observation case, by removing the summations (for
proof of this see B.4).
 3.4.2 Camera Parameters
Inference over camera parameters can be decomposed as inference over intrinsic and
extrinsic parameters respectively.
Intrinsic Parameters
The joint model of equation (3.1) leads to the following intrinsic inference equation:
p(K|I,G,A, T ) ∝ p(K)
NI∏
i=1
p(Ii|G,A,K, T ) (3.23)
Sec. 3.4. Inference 31
Unlike the appearance estimation we cannot solve (3.23) in closed form due to the
intricate dependence of the image likelihood in the intrinsic parameters. Recall that the
likelihood dependence on K of equation (3.14) is hidden into the computation of Am∗
and (u′, v′), as well as the assignment of pixels to the background. This computation
requires a nonlinear projective transformation from 3D world coordinates to 2D image
coordinates. Instead we can either optimize equation (3.23) to obtain a MAP estimate
or use an MCMC methods to sample from the full distribution as an approximation.
This choice is highly dependent on our modeling goals. For computational reason, we
focus on optimization methods (e.g. MAP estimates).
Regardless of optimization or sampling, one still needs to evaluate equation (3.23).
In our implementation we rely on the OpenGL pipeline to generate virtual images with
parameters K and T and scene G and A to evaluate the likelihood as discussed in
section 3.2.2, implementation details will be discussed in chapter 4.
Extrinsic Parameters
The joint model leads to the following inference expression over extrinsic parameters
for the nth image:
p(Tn|G,A, In,K,T\n, Zn) ∝ p(In|G,A, Tn,K)p(Zn|Tn)p(Tn|T\n) (3.24)
Equation (3.24) contains not only the image likelihood, and the GPS likelihood,
but also the likelihood of the current camera parameters conditioned on all the other
configurations. This term, is a result of the Gaussian Process prior placed on T , and
can be evaluate as discussed in section 3.3.3.
We note that p(Tn|T\n) encourages extrinsic parameters configuration that fit well
with the sequence of camera parameters not just the current observation. Furthermore,
the hyperparameters of the GP prior enforce a set of characteristics that can be learned
from the data as discussed in section 2.9. These characteristics are data specific; in our
application it typically translates to spatial smoothness over the image sequence. As
with the intrinsic parameters, extrinsic parameters can be optimized or sampled. For
computational reasons we will focus on optimization methods.
 3.4.3 Geometry
Inference over geometric variables in the model presented takes the form:
p(G|L, I,A,T,K,B) ∝
NI∏
i=1
p(Ii|G,A,B,K, T )
Nl∏
l=1
p(Ll|G)
Np∏
k=1
p(Gk) (3.25)
Similar to camera pose inference, the complicated form of the image likelihood
precludes exact inference procedures for equation (3.25) in close form. Furthermore,
the LiDAR likelihood is computationally intensive since the data association problem
discussed in section 3.2.1 requires doing a distance minimization for each LiDAR ob-
servation over the entire collection of world primitives. As before we will optimize for
32 CHAPTER 3. MODEL
configuration of world geometry that maximizes equation (3.25). As a consequence of
this, the results of geometry estimation will be initialization dependent. Implementa-
tion details that attempt to reduce the computational complexity of this problem will
be discussed in chapter 4.
Chapter 4
Implementation
The idea of computational efficiency was presented in chapter 3 when discussing
inference techniques for the proposed model. And, while efficient algorithms were pre-
sented, the crucial implementation details needed to achieved the desired level of effi-
ciency were not discussed. As such, the goal of this chapter is to present and examine
those implementation details.
This chapter will first analyze the appearance computation; once this key component
is well understood, we will examine the image and LiDAR likelihood equations and
discuss efficient procedures to evaluate them. Following the likelihood computation
we will discuss a parallel implementation of the geometry updates. The chapter will
culminate with the introduction of texture atlases to speed-up the rendering of large
and complex scenes. Throughout this chapter we will provide pseudo-code algorithms
for CPU and GPU implementations.
 4.1 Appearance Computation
In the model presented in chapter 3, c.f. 3.1, Am the appearance of the m
th primi-
tive is an image of a user specified size, T ×T . We are interested in learning the values
for each pixel according to the update equations (3.21) and (3.22). For a set of cam-
era parameters and world structure, the appearance equations require collecting all of
the observation pixels (image pixels) that correspond to a given pixel in the primitive
appearance and weighting them appropriately to produce the desired texture. We note
that only visible triangles need to go through this procedure, since we do not have any
information for occluded or out of view primitives.
The key piece of information needed for the computation above, is knowing the
forward mapping from image pixels to triangle appearance pixel. This mapping can
be decomposed into two parts; first, we need to associate an observed image pixel to a
triangle, and secondly, we need to determine which pixel in the triangle’s appearance
map generated the observation. One can certainly think of many ways of obtaining the
information above, from explicitly searching over all appearance and observation pixels
(reverse mapping), to projecting each image onto the world and search for appearance
matches (hybrid forward and reverse mapping).
For our purposes we use a forward mapping from image pixels to world appearance as
33
34 CHAPTER 4. IMPLEMENTATION
Triangle Mapping
Unique TextureTexture Mapping
Gk Appearance
Image
Triangle Mapping
Unique TextureTexture Mapping
Gk Appearance
Image
Figure 4.1: Appearance computation procedure. First determine which primitive gen-
erated the observation using a triangle mapping, secondly using the texture mapping
determine which pixel generated the observation.
described earlier. The forward mapping has several advantages. First, the complexity
is bounded by the number of images and the size of each image; we point out that
the reverse mapping depends on the number of world triangles and the size of the
appearance image, and thus scales poorly to scenes with a large number of primitives1.
Secondly, using the OpenGL rendering pipeline the data association can be easily and
quickly determined as demonstrated next.
The mapping of image observation to triangle appearance can been seen in Figure
4.1. The mapping can be explained in two parts. The first part, shown in the top row of
the figure, maps every observation pixel to a triangle (or background) via a unique color
mapping of each primitive. This data association problem can be performed efficiently
by using the GPU rendering pipeline to encode the index of the primitive as an RGB
triple. As a consequence, observing a particular RGB triplet correspond to observing a
particular primitive. In this work, we used the mapping:
i =
r + 256g + 2562b
δ
(4.1)
1Forward mapping has the same dependency. However, the update computation is only performed
over primitives which have observations associated to them.
Sec. 4.1. Appearance Computation 35
and
r = δ ∗ i & 0x0000FF,
g = (δ ∗ i & 0x00FF00) >> 8,
b = (δ ∗ i & 0xFF0000) >> 16, (4.2)
where δ = 256
3
Np
, & is the logical and operator, and >> is the right shift operator.
Furthermore, we use the alpha channel to differentiate between background and
world triangle, (background is given an alpha value of 0, while primitives are set to the
max allowed, 255), this allows for up to 2563 unique primitives. We note that other
encoding schemes are possible, for example background can always be encoded as black
and allow the use of the alpha channel for additional primitives (allowing 2564 − 1
unique primitives). We found our simple scheme to be sufficient.
The second part of the mapping involves determining which pixel in the appearance
map generated the observation. This mapping can be seen in the bottom row of Figure
4.1. For this mapping we create a unique texture, the same size as the appearance map
where each RGB triple encode the pixel location, similar to the mapping for primitive
number but using δ = 1. For textures smaller than 256 × 256, this mapping can be
achieved with a single render that encodes both x and y location (red channel would
be x and blue channel would be y). If the texture is larger, separate renders is required
for each dimension.
We can put these two pieces together and obtain the appearance estimation method
shown in Algorithm 4.1 for a CPU computation, or Algorithm 4.2 for a GPU computa-
tion. The main distinction between CPU and GPU implementation is that on CPU all
the rendering is done ahead of time and stored, then the textures are computed; in the
GPU implementation due to memory constraints, renders are done one at a time and
aggregated to the texture, once all images are accounted for the texture is normalized.
Details for the GPU kernels were not provided since their function is the same as the
pixel loop in the CPU algorithm.
Intuitively the forward mapping makes sense since it only consider observation pix-
els, i.e. it assigns observation pixels to appearance pixels that are visible and does not
try to fill in unobserved pixels. Stated differently, it produces holes in the appearance
maps. These holes are not visible when the textured world is seen from an observation
view, but when new views are considered, pixels that were not previously seen, now
become visible.
A possible solution for minimizing this rendering artifact is to allow each observation
to influence its neighbors as well. We can think of the neighbor mapping as letting
an observation pixel map to a region in the appearance image rather than a single
pixel. Empirically we have found that using neighbor-hole-filling on the four connected
neighbors works well, and does not over-smooth appearances. We further note, that
selecting a smaller appearance image per primitive can also reduce the number of empty
pixels.
36 CHAPTER 4. IMPLEMENTATION
Algorithm 4.1 Appearance Estimation pseudo-code - CPU
1: Compute unique world projection for all images Imaptri (associates each pixel with a
triangle or background)
2: Compute unique texture projection into all images Imaptex (associates each pixel with
a texture coordinate)
3: for k = 1:1:Np do . compute image weights
4: wk = 1
5: Compute plane normal nk
6: for i = 0:1:numImages do
7: Compute image view direction vi
8: Compute triangle-image weight wik =
1
|v>i nk|
9: Compute triangle weight wk = wk ∗ wik
10: end for
11: end for
12: Allocate memory for Atexk and A
weight
k (the texture values and weights), set to zero
13: for i = 1:1:Ni do . Loop over images
14: for p=0:1:size(Ii) do . Loop over pixels for i
th image
15: if Ii(p) ==background then
16: Continue pixel loop
17: end if
18: Compute triangle source of Ii(p), by looking at I
map
tri (p), call it T
19: Compute texture location of Ii(p), by looking at I
map
tex (p), call it P
20: Compute triangle-image weight wiT =
1
|v>i nT |
21: Compute iteration weight w = wT
wiT
22: Set AtexT (p) = A
tex
T (p) + wIi(p)
23: Set AweightT (p) = A
weight
T (p) + w
24: Optional Fill in neighbors of p, as in lines 22 and 23.
25: end for
26: end for
27: Compute and save weighted texture for Ak by A
tex
k /A
weight
k
28: Optional Fill in empty pixels in Gk, with four neighbor average.
29: De-allocate memory for Atexk and A
weight
k
In terms of computational complexity both CPU and GPU algorithms are bounded
by the number of images and the size of each image. The main limitation of the CPU
algorithm is the data transfer between main memory and device memory, since it re-
quires transferring two images from device to main memory for each observation, and
then transferring all the appearance textures from main memory to device memory.
The GPU implementation does not suffer from this problem since the texture maps
are never transferred, i.e. they remain in device memory and are accessed via the tex-
Sec. 4.2. Image Likelihood Computation 37
Algorithm 4.2 Appearance Estimation pseudo-code - GPU
1: Compute Weights as before.
2: Create Triangle Map and Texture Map framebuffers (assign 2 textures pointers,
depth and color to each)
3: Allocate device memory for Atex,Aweight, Imaptri , and I
map
tex .
4: for i = 1:1:Ni do . Loop over images
5: Bind Triangle Map framebuffer, render triangle map as viewed from image i.
6: Bind Texture Map framebuffer, render texture map as viewed from image i.
7: Call texture Accumulation Kernel. . Same as pixel loop in algorithm 4.1
8: end for
9: Call texture normalize kernel. . Computation of line 27 in algorithm 4.1
10: De-allocate memory for Atexk and A
weight
k
ture accumulation kernel; similarly, the texture data always remains in device memory.
Hence the main limitation of the GPU computation is the render itself.
Table 4.1 shows the computation time and speed up factors for several datasets.
We can see from the table that at worse the GPU implementation is three times faster
than CPU.
Data Toy Problem Lubbock CLIF (subset)
# Triangles 24 80,000 80,000 227,000 227,000
# Images 15 3 3 24 24
Texture /Atlas Size 256/1024 8/512 16/1024 8/1024 16/1024
CPU Time (s) 1.747 1.361 2.435 8.145 11.304
GPU Time (s) 0.564 0.201 0.193 1.826 1.9404
Speed up 3.1 6.7 12.6 4.5 5.8
Table 4.1: CPU/GPU speedup (times are the average of 5 estimations), all test done
on an NVIDIA GTX-580 graphics card
 4.2 Image Likelihood Computation
Estimation of camera pose and world geometry rely heavily on the evaluation of the
image likelihood. Thus fast and efficient computation of this likelihood is crucial. In
this section we describe how we implement the image likelihood and how we leverage
the power of graphics hardware to obtain high performance.
The image likelihood as given in section 3.2.2 is:
p(I|G,A,K,T) =
∏
r∈B
ϕ
(
xr;µb, σ
2
b , σ
2
w
) Nc∏
c=1
NI∏
n=1
∏
k∈Vnc
p(In,ck |G,A,Kc, Tc), (4.3)
38 CHAPTER 4. IMPLEMENTATION
reproduced here for convenience, where the first term in the rhs performs the back-
ground marginalization and the second term computes the observation likelihood. As
discussed previously, the background marginalization couples all the observation pixels
that belong to a given background pixel as given by equation (3.13). This compu-
tation involves identifying which observation pixels should be grouped together, and
computing some statistics over them.
Recall, that the background in the model is a plane at ground level. For computa-
tional simplicity of the background data association we can attach an image of some
user specified size, say Bs×Bs to this plane. With this image in place, we can perform
the observation data association in a similar manner as the texture location was done
in the previous section. That is, we can let the background image have an alpha value
of zeros as previously mentioned, but also encode each background pixel’s RGB triple
with its location using equations (4.1) and (4.2), with δ = 1.
The evaluation of the data term in the rhs of equation (4.3) requires identifying the
association between an observation pixel and a world appearance pixel, then evaluating
a Gaussian distribution. This computation can be done efficiently in the OpenGL
pipeline with two renders, the first would map observation pixels to triangles, and
the second would map appearance pixels to observation pixels. With the information
provided in these two renders we can turn equation (4.3) into:
p(In|G,A,Kc, Tc) =
∏
k∈Sn
N (ik − iˆk; 0, r2m∗). (4.4)
where iˆ is the textured mapped world as viewed from image n.
The joint evaluation of the background and image can be seen in algorithm 4.3.
Several observations can be made about algorithm 4.3. First, we point out that the
background pixel mapping can be done in either the texture mapping or the triangle
mapping render, since it will not interfere with either evaluation. Secondly, notice that
the purpose of the triangle mapping render is to identify the noise level that should be
associated with the Gaussian evaluation, since we are not interested in knowing which
primitive generated the observation, only its appearance value which we obtain from
the texture mapping.
The above discussion suggests that we can improve performance by rendering only
once, for the texture mapping, and not for the triangle mapping. Mathematically, this
change implies changing the noise characteristics for the image observation model from
image and primitive dependent to a fixed value. As a preview of results to come, we
highlight that this change is possible.
The pseudo-code version of the image likelihood computation in GPU is identical
that of CPU. We note that while the pseudo-code is identical some implementation
details do change, for example we must be careful about memory usage since it is limited
in graphics hardware. Furthermore, some operations such as variable increments must
be dealt with great care, since in a multi-thread environment, multiple threads trying
to write to the same variable at the same time will yield undefined results. For this
Sec. 4.2. Image Likelihood Computation 39
Algorithm 4.3 Image Likelihood Evaluation pseudo-code - CPU/GPU
1: Compute image weights for all triangles in all images, call it wmn , where n indexes
image, and m indexes planes
2: Set ll = 0;
3: Create array of structure bgTrack, size B2s , with elements: count, x, xsquared
(initialize all elements to zero).
4: for i = 0:1:Ni do
5: Render texture map world as viewed from image i, transfer framebuffer image
to main memory, call it Iˆ.
6: Render Triangle map as viewed from image i, transfer framebuffer image to main
memory, call it Itri. . note that Iˆ , Itri, and Ii have the same size
7: for p=0:1:size(Ii) do
8: if Itri(p) 6= background then
9: Map Itri to plane number using eq. (4.1), call it m.
10: Set σ = wmi
11: Increment log likelihood ll = ll +− (Ii(p)−Iˆ(p))2
2σ2
12: else
13: Map Itri to background pixel, call it b
14: Increment counts: bgTrack[b].count +=1
15: Increment observations: bgTrack[b].x += Ii(p).
16: Increment squared observations: bgTrack[b].xsquared += I2i (p)
17: end if
18: end for
19: end for
20: for i = 0:1:B2s do . computes background contribution
21: if bgTrac[i].count>0 then
22: Compute background marginalization of equation (3.13) using the elements
of bgTrack[i], call it ϕ
23: Increment log likelihood ll = ll + ϕ.
24: end if
25: end for
particular issue, shared memory can be instantiated and allow thread block elements
to sync their values, then add the local value to the global one, typically using atomic
operators.
Table 4.2 shows the computation time and speed up factors for the image likelihood
computation for several datasets. We can see from the table that at worse the GPU
implementation is three times faster than CPU.
40 CHAPTER 4. IMPLEMENTATION
Data Toy Problem Lubbock CLIF (subset) CLIF (all)
# Triangles 24 80,000 227,000 227,000
# Images 15 3 24 50
Image Size 1200x900 1336x891 822x1326 822x1326
Texture/Atlas
Size
256/1024 8/512 16/1024 16/1024
CPU Time (s) 2.252 0.295 2.865 5.858
GPU Time (s) 0.754 0.066 0.424 0.873
Speed up 2.99 4.47 6.76 6.71
Table 4.2: CPU/GPU Image Likelihood evaluation speed-up (times are the average of
5 estimations), all test done on an NVIDIA GTX-580 graphics card
 4.3 LiDAR Likelihood Computation
As discussed in §3.2.1 the LiDAR observation model contains a data association
problem2, for which the model selects the closest world primitive as the one that gen-
erated the LiDAR measurement. Mathematically, the association can be solved by:
Gˆk(n) = arg min
G∈G
d2(Ln, G). (4.5)
as presented earlier. This simple equation has some computational issues associated
with it since it involves for every LiDAR point searching for the closest triangle from
the collection of all world triangles, i.e. O
(
NiNp
)
, where Ni is typically in the orders
of high hundreds of thousands and Np in the low-mid hundreds of thousands.
Intuitively, we can reduced the number of primitives considered by equation (4.5)
since it is very unlikely that a primitive very far from the measurement generated it.
Again, this computation would require computing distances between the LiDAR point
in question and all the primitives, the same computation we want to avoid.
One approach that we can take is to use data structure that facilitates this search,
in particular space partitioning data structures [45]. By using such data structures we
can quickly remove a great quantity of search primitives. Out of the data structures
considered, point-based k-d trees was chosen due to its simplicity and expressibility.
A K-D tree is a binary tree where the underlying k dimensional space is partitioned
on just one attribute at each level (and thus attributes are being cycled through as a
function of levels); at each level a test is made that determines which branch of the tree
to take.
Ideally, we would like to build the tree over the set of primitives and for each input
LiDAR measurement identify the closest primitive. This idea, however, is prohibitive
since the set of primitives is changing (i.e. we are learning the geometry parameters)
2Unlike the appearance likelihood this association problem cannot be solved via renders since the
LiDAR point visibility depends on viewing direction.
Sec. 4.4. Geometry Computation 41
which would require re-learning the tree every time a primitive changes. Instead we
can learn the tree over the static LiDAR observations. Once we learned the tree, we
can input a primitive and obtain a list of observation close to the plane. We note that
since the k-d tree implementation only works with points, and query term consist of
the midpoint of the triangle and a maximum search radius. The kd-tree search returns
all tree points that are within the specified radius of the query. We select this radius
to be about twice the distance from midpoint to endpoint of the triangles. While this
might seem very broad, we found that this works well in practice.
This search allows multiple LiDAR measurements to be associated with a given
plane and multiple planes to be associated with a given measurement. Furthermore,
there could be planes with no LiDAR associated to them, however, the converse is not
possible; every LiDAR measurement must be associated with a plane. To ensure this is
the case, we must identify non-associated points and attribute a plane via exhaustive
search. For observations with multiple planes, the closest plane is selected.
Figure 4.2 shows four association examples, where we have color-coded the LiDAR
observations according to the distance from observation to plane. Figures 4.2a and
4.2b clearly shows the search radius allowed when using the k-d tree. We can see the
association interaction between the selected plane and the other world planes in Figures
4.2c and 4.2d, where the cyan and yellow points, can be associated with the selected
plane however this is unlikely since other planes are closer.
In terms of computation, building a tree for the CLIF Stadium dataset with 705,798
LiDAR points and 227,000 triangles takes 900 milliseconds. The k-d search takes on
average 0.98 milliseconds, with a minimum of 0.5ms and maximum of 3ms. The k-
d search is able to associate 99.99% of all LiDAR points to primitives; 37 points were
associated via exhaustive search, each taking on average 90ms. The above times suggest
close to two order of magnitude speedup by introducing the k-d tree search.
 4.4 Geometry Computation
In this section we discuss the implementation details for inferring world geometry
parameters. Namely, we are concerned with evaluating equation (3.25), replicated here
for primitive Gk for convenience,
p(Gk|L, I,A,T,K,B,G\k) ∝ p(Gk)
NI∏
i=1
p(Ii|G,A,B,K, T )
∏
l∈Ak
p(Ll|Gk) (4.6)
where Ak is the set of LiDAR measurements associated with Gk. As we saw in the pre-
vious sections, the evaluation of the likelihoods in this equation can be computationally
intensive; while the savings established in sections 4.2 and 4.3 help reduce the over-
all compute time of one primitive, the sheer number of primitives for which equation
(4.6) needs to be evaluated quickly increases the computation time. This is particularly
true for sampling or optimization where equation (4.6) becomes the target distribution
42 CHAPTER 4. IMPLEMENTATION
(a) Association (points colored by
distance)
(b) Association (points colored
distance)
(c) Association (other planes
shown in gray)
(d) Association (other planes
shown in gray)
Figure 4.2: LiDAR and world primitive association. (Color coding: planes in ques-
tion color-coded, points associated with planes color-coded according to distance, other
planes shown in gray, other points shown in orange-magenta)
Sec. 4.4. Geometry Computation 43
or the objective function, and thus requires being evaluated hundreds or thousands of
times per primitives.
Several observations can be made from equation (4.6). First, we can decouple
primitive Gk from all other primitives in the LiDAR likelihood if we do not change the
associations while evaluating the expression, i.e. fix Ak for all evaluations of primitive
Gk. The association would be updated after the current primitive or all primitives have
been updated. Second, when evaluating the image likelihood as a function of Gk only
a very small subset of pixels change. These two observations along with the desired to
reduce computation time motivate evaluations of equation (4.6) for multiple primitives
at a time.
The evaluation of equation (4.6) for multiple primitives at a time has the advantage
of reducing the number of draws required to compute the image likelihood, which is the
computational bottleneck as discussed previously. However, in order to take advantage
of this savings the sampling or optimization routines used must propose independent
perturbations for triangles in question; these perturbations must be combined to pro-
duce a single render and the likelihood for each primitive must be computed separately,
then the sampling/optimization scheme must make independent decisions based on each
likelihood evaluation. Assuming that this is the case, let us investigate the necessary
changes to compute image likelihoods in the multi-triangle case.
 4.4.1 Image Likelihood evaluation under multiple primitives
In order to extend the approach to multiple primitives we must determine a way
of computing the image likelihoods for a given set of primitives. As stated in equation
(4.6) the image likelihood is computed over all pixels despite not being associated with
primitive Gk. However, the only change in the likelihood should come from the pixels
affected by the change of Gk, since it is the only element varying. This however would
not be the case when multiple primitives are considered.
To see how we can properly account for pixels associated with multiple planes let us
take a closer look at the image likelihood computation. As described in section 3.2.2,
the image likelihood is
p(I|G,A,K,T) =
∏
r∈B
ϕ
(
xr;µb, σ
2
b , σ
2
w
) NI∏
n=1
∏
k∈Vn
p(Ink |G,A,K, T ), (4.7)
We can split set of visible pixels in the nth image, Vn, into those generated by primitives
which we are optimizing over, P, and those that are not, then the image likelihood takes
the form:
p(I|G,A,K,T) =
∏
r∈B
ϕ
(
xr;µb, σ
2
b , σ
2
w
) NI∏
n=1
∏
k∈P
p(Ink |G,A,K, T )
∏
k∈Vn\P
p(Ink |G,A,K, T )

(4.8)
44 CHAPTER 4. IMPLEMENTATION
We are interested in the set of pixels generated by the planes we are optimizing
over, P, and the interactions between this set and the background and other pixels. To
see this last part, consider the effect of shrinking a plane as part of an optimization
or sampling algorithm, if we were to only compute the likelihood over pixels generated
by that plane for this step only, the number of pixels over which this calculation takes
place would have strictly decreased from the previous evaluation, yielding an artificially
lower likelihood. Expressed differently, a degenerate solution would be to shrink planes
so that they are not visible. In order to avoid degeneracies as the one pointed above,
we must carefully account for pixel changes when evaluating likelihoods.
To better understand the bookkeeping necessary to achieve the correct calculation,
let us concentrate on calculating the partial likelihood for one triangle. We begin by
noticing that between any two likelihood evaluations, there are two possible changes:
first, the set of pixels over which the evaluation takes place can change; and second,
the value of the underlying rendered image can change. These two changes can be
visualized for a cartoon region and values in Figures 4.3a and 4.3b as the change in
the evaluation mask, and the color change in the rendered image. Let us denote the
value of the initial rendered image by f0(·), and the value of the new image by f1(·);
furthermore, lets denote the evaluation masks by M0 and M1 respectively.
With this notation we can calculate the change between the image likelihood as:
δ01(M0,M1) = f1(M0 ∪M1)− f0(M0 ∪M1). (4.9)
We note that δ01(M0,M1), properly accounts for changes in the evaluation mask and
the function values. If we wanted to obtain the new absolute likelihood, as apposed to
the change, we simply add the initial total likelihood and the change,
f1(M) = f0(M) + δ01(M0,M1)
= f0(M) + f1(M0 ∪M1)− f0(M0 ∪M1) (4.10)
To better understand, equations (4.9) and (4.10), let us analyze the notional com-
putation depicted in figure 4.3. Parts (a) and (b) of Figure 4.3 show the initial and
subsequent computation, where we denote changes in the image values by the change
in color from blue to red, and the change in pixels by the changes in the evaluation
masks (white denotes pixels to include, and black pixels to ignore). Figure 4.3c shows
the necessary steps in terms of masks and functions to evaluate to compute the relative
image likelihood. We can easily see special cases of this general computation from both
figure 4.3 and equation (4.9). For example, if the evaluation masks remain the same,
M0 = M1, then we are simply taking the difference between the image values, that
presumably have not changed, leading to a delta of zero and no change in the likelihood
function.
Now that we understand how to compute the incremental likelihood for a single
triangle, the extension to the multi-triangle case follows trivially, we simply let each
triangle have its own evaluation mask. With this minor adjustment we can use all of
the previous derivations.
Sec. 4.4. Geometry Computation 45
(a) Initial render values and evaluation
mask (M0)
(b) New rendered values and evaluation
mask (M1)
+= - -
(c) Relative Image Likelihood computation
Figure 4.3: Relative image likelihood computation. Suppose changing a primitive’s
parameter produces the images shown in in blue and red (with corresponding mask) in
(a) and (b). (c) shows how to combine the renders to produce the desired likelihood
computation.
Computationally we can evaluate the multi-plane image likelihood as shown in al-
gorithms 4.4 and 4.5. The computation is divided in an initialization step and a subse-
quent computation since the initial step is much simpler; it only requires being able to
evaluate current values. On the other hand, subsequent evaluations are slightly more
complicated since for every pixel that we encounter we have to not only compute its
current triangle association but its past association as well.
Several observations can be made from the algorithms, first we notice that memory
usage has dramatically increased, (more than doubled if using 4 byte floats for ll),
since we must now maintain all the previous pixel likelihood values for each image, as
well as the evaluation mask. This large memory consumption is prohibitive for graphic
hardware, where memory is limited. Furthermore, the use of structures in graphics
hardware is cumbersome, since memory management is difficult. As a result, graphics
implementation will require unrolling the structures into one continuous array, creating
extra bookkeeping details or creating extra kernels to handle memory operations.
46 CHAPTER 4. IMPLEMENTATION
Algorithm 4.4 Multi-Triangle Image Likelihood evaluation - Initialization -
CPU/GPU
1: Select N planes compute the likelihood over. (N < 256)
2: Initialize structure bgTrack (see algorithm 4.3), set all elements to zeros.
3: Initialize an array of N structure planeLL, with members imgLL, imgPxCount,
bgPixelsToAdd (set all to zeros) . bgPixelsToAdd variable in size, use containers
of variable size (e.g. std::vector)
4: Initialize array of Ni structure imEvalMask, with members mask, ll, and llpixel-
ToAdd, backgroundIndex. Set mask and ll to the size of the corresponding ith
image. . as above: llpixelToAdd, backgroundIndex are variable in size, use
appropriate container.
5: for i = 1:1:Ni do . image loop
6: for p ∈ Si do . pixel loop
7: if p visible then
8: Compute pixel log likelihood, call it ll.
9: Set imEvalMask[i].ll[p]=ll
10: Identify world triangle associated with current pixel, call it T.
11: if optimizing over T then
12: imEvalMask[i].mask[p]=T
13: planeLL[k].imgLL += ll; . the kth entry is associated with triangle T
14: end if
15: else
16: Add p to bgTrack . see algorithm 4.3.
17: end if
18: end for
19: end for
20: Compute background values. . see algorithm 4.3.
21: Fill background values in imEvalMask using llpixelToAdd, and backgroundIndex.
22: maintain imEvalMask and planeLL in memory.
Sec. 4.5. Texture Atlas 47
Both algorithms call for identifying the primitives associated with each pixel. This
however is computationally expensive and not necessary, since we are only interested
in knowing if we are optimizing over this triangle, and its index in the planeLL stack.
We can achieve this mapping by using the alpha channel of the triangle’s texture, then
visibility would be given by determining if alpha channel is greater than zero (previously
established as background) and different than ordinary triangles, 255. The position in
the planeLL structure can be determine by looking at the actual alpha value, once we
ensure we are not in one of the two former cases.
The results for the multi-plane geometry evaluation will be discussed in §5.2.2; as
a preview: no difference exists between single and joint evaluations (provided that
primitives do not overlap). Computationally, joint geometry estimation is over two
order of magnitude faster than its single counterpart.
 4.5 Texture Atlas
As we saw in §4.2, the ability to render a textured mapped scene is crucial for the
image likelihood computation. Furthermore as hinted above the limiting factor of the
GPU implementation is the speed of the rendering itself, rather than the computation.
As a result, we are interested in minimizing the drawing time for the scene.
A naive rendering routine would require a texture binding for every world primitive,
since each primitive has its own texture. Furthermore, the primitive’s parameter (vertex
coordinates, texture coordinates, normal, etc.) would be transferred from main memory
to device on every draw. Both of these components are very inefficient.
In order to improve performance, we can first push all world primitive parameters
to device memory, so that the communication penalty is suffered only when parameters
are changed, i.e. once, rather than at every draw. Secondly, we can group multiple
appearances into a texture atlas, and minimize the number of bindings required at
drawing time. With these modifications, at drawing time, we simply need to bind to
the texture atlas and primitive parameters, and specify how many elements to draw
(corresponding to the number of primitives per atlas).
In this section we discuss the implementation details for achieving this goal.
 4.5.1 Static Texture Coordinates
According to the proposed model, each triangle has a canonical texture, corre-
sponding to a right triangle of user-specified size (Figure 3.3). This modeling choice,
has several consequences:
• Texture Coordinates are static.
• Computed texture is subject to different scaling for each direction.
• Rendered texture pixel need not be square.
48 CHAPTER 4. IMPLEMENTATION
Algorithm 4.5 Multi-Triangle Image Likelihood evaluation - Subsequent computations
- CPU/GPU
1: Set all members of bgTrack and planeLL to zero, except imgLL.
2: Set llpixelToAdd and backgroundIndex to zero for all elements of imEvalMask.
3: for i = 1:1:Ni do . image loop
4: for p ∈ Si do . pixel loop
5: if p visible then . likelihood computation or background addition
6: Compute pixel log likelihood, call it ll.
7: else
8: Add p to bgTrack . see algorithm 4.3.
9: end if
10: Identify world triangle associated with pixel p, call it T .
11: Tˆ = imEval[i].mask[p]. . Identify previous triangle associated with pixel p.
12: if optimizing over T or Tˆ then . multi-plane bookkeeping
13: if optimizing over T then
14: Set index = k; . where k is the position of triangle T
15: Set planeLL[index].imgLL += ll
16: Increment planeLL[index].imgPxCount by one.
17: else if optimizing over Tˆ then
18: if p visible then
19: Set index = k;
20: Set planeLL[index].imgLL += ll
21: Increment planeLL[index].imgPxCount by one.
22: else
23: Set index = l; . where l is the position of triangle Tˆ
24: Add p to planeLL[index].backgroundPxToAdd
25: end if
26: end if
27: Subtract previous ll, planeLL[index].imgLL -=imEvalMask[i].ll[p]
28: end if
29: if p visible then . update visibility and ll for next iteration
30: Set imEvaMask[i].ll[p] = ll;
31: Set imEvaMask[i].mask[p] = T;
32: else
33: Set imEvaMask[i].mask[p] = -1;
34: Add p to imEvalMask[i].llPixelToadd
35: Add appropriate background pixel to imEvalMask[i].backgroundIndex
36: end if
37: end for
38: end for
39: Compute background values. . see algorithm 4.3.
40: Fill background values in imEvalMask using llpixelToAdd, and backgroundIndex.
41: Add background values in planeLL using bgPixelsToAdd.
Sec. 4.5. Texture Atlas 49
The first of these points is essential in practice, since the texture coordinates do not
need to be re-calculated when the geometry of a triangle is changed, yielding substantial
computational savings; however, these savings comes at a cost - the last two points
above.
A fixed texture coordinate defines a transformation of pixels from appearance map
(texture image), and the world primitive (triangle being textured mapped). This trans-
formation is implicitly done via OpenGL’s texture mapping, where the forward trans-
formation is achieved using the unique texture mapping (Figure 4.1) and the reverse
transformation is achieved using the static coordinates.
We note that having appearances be right triangles has the added benefit of allowing
two canonical textures to be stored using a single square image (the only type OpenGL
allows). Thus providing us the ability of drawing twice as many triangles per atlas
binding.
 4.5.2 Implementation
In this section we go through the necessary bookkeeping for creating, updating and
drawing the texture atlas.
Note that there are two user parameters:
• T is the texture size (texture will be T × T )
• A is the atlas size (atlas will be A×A)
where A ≥ T , and both are powers of two (A needs to be power of two to satisfy
OpenGL standard for textures, T need not be a power of two, but in order to use the
fast texture calculation developed earlier we need it to be a power of two as well).
The general idea it to have a texture atlas that looks like Figure 4.4, where even
textures are on the top portion of the texture rectangle, and odd textures are on the
bottom portion of the texture rectangle, below we describe the initialization, update
and drawing procedure for the texture atlas.
Initializing
Assuming that the number of triangles is known (call it M , and triangles are index
by k, where k ∈ [0,M − 1]). Then the initial set up of the texture atlas can be seen in
Algorithm 4.6. The algorithm can be broken down into three main parts, first allocating
main memory to contain the list of triangle properties, including the atlas associated
with each primitive as well as its texture neighbor. A texture neighbor, is the triangle
that shares the texture square in the texture atlas, i.e. 0 and 1, or 14 and 15 in figure
4.4, this information will be needed for the texture updates later.
The second part of the algorithm computes the texture coordinate in the atlas, this
computation is dependent on the triangle’s ordering. The last step of the algorithm is
to transfer all parameters to the graphics card.
50 CHAPTER 4. IMPLEMENTATION
Algorithm 4.6 Texture Atlas Initialization
1: Set the values of A and T .
2: Compute N = bAT c, the number of textures per dimension.
3: Compute Na = d M2N2 e, the number of atlases needed to represent all the triangles.
4: Create an array of Na atlases in device memory, each of size A×A, initialize them
to zero, call it pAtlas.
5: Create pTex an array that contains the atlas index for each triangle.
6: Create texNeigh an array that contains the texture neighbor (either up or down
for each triangle.
7: Compute δ = 1A , the one pixel offset in the diagonal between the top and bottom
texture.
8: Compute Texture Coordinates using the counter-clockwise order shown in Figure
4.5.
9: for k = 0:1:M-1 do
10: pTex[k] = pAtlas[k mod 2N2]. . assigns atlas to triangle
11: if k mod 2 == 0 then
12: Compute: x =
k
2
mod N2
N ; and y =
k
2
mod N2−x
N
13: T k1 = (x, y + δ)
14: T k2 = (x+
1
N − δ, y + 1N )
15: T k3 = (x, y +
1
N )
16: Set texNeigh = k + 1;
17: else
18: Compute: x =
k−1
2
mod N2
N ; and y =
k−1
2
mod N2−x
N
19: T k1 = (x+ δ, y)
20: T k2 = (x+
1
N , y)
21: T k3 = (x+
1
N , y +
1
N − δ)
22: Set texNeigh = k − 1;
23: end if
24: end for
25: Create device vertex array, normal array, texture coordinate array, and index array
( 0 to M − 1) and transfer primitive information.
Sec. 4.5. Texture Atlas 51(0,0)
(3t,3t)
(3t,0)
(0,3t)
0
1
2 3
4
5
6
7
8
9
10
11
12
13
14
15 16 17
(t,0) (2t,0)
(0,t)
(0,2t)
T3k
T1k
T2k
T1k+1 T2
k+1
T3k+1
(0,0)
(1,1)
(1,0)
(0,1)
0
1
2 3
4
5
6
7
8
9
10
11
12
13
14
15 16 17
U
v
Figure 4.4: Sample Atlas, where the numbers indicate which triangle the texture belongs
to, the coordinates are expressed in terms of atlas UV.
(0,0)
(3t,3t)
(3t,0)
(0,3t)
0
1
2 3
4
5
6
7
8
9
10
11
12
13
14
15 16 17
(t,0) (2t,0)
(0,t)
(0,2t)
T3k
T1k
T2k
T1k+1 T2
k+1
T3k+1
(0,0)
(1,1)
(1,0)
(0,1)
0
1
2 3
4
5
6
7
8
9
10
11
12
13
14
15 16 17
U
v Figure 4.5: Sample Atlas drawing direction (1,2,3) for both the kth and kth+1 triangles.
Update
Let us say we want to update the texture of the kth triangle, with some texture
image, data. The update procedure can be seen in Algorithm 4.7. The update procedure
consists of first identifying the orientation of the kth triangle and its neighbor; then
building the replacement image texData accordingly. Once the replacement image has
been created, it can be uploaded to the texture atlas using glTexSubImage2D with the
proper starting and ending coordinates.
Draw
The draw routine for texture atlases consist of Algorithm 4.8. It is important to
point out that the point draw order is counter-clockwise as shown in Figure 4.5.
52 CHAPTER 4. IMPLEMENTATION
Algorithm 4.7 Texture update for the kth primitive
1: if texNeigh is valid then . check status of neighbor triangle
2: if k mod 2 == 0 then . Orient the texture data
3: topTex = data, bottomTex = texNeigh.
4: else
5: topTex = texNeigh, bottomTex = data.
6: end if
7: Construct texData using the correct portions of topTex and bottomTex.
8: else
9: texData = data.
10: end if
11: if k mod 2 == 0 then
12: x = (k2 mod
N2
A2
) mod NA
13: y =
[( k
2
mod N
2
A2
)−x]A
N
14: else
15: x = (k−12 mod
N2
A2
) mod NA
16: y =
[( k−1
2
mod N
2
A2
)−x]A
N
17: end if
18: Replace part of the atlas image (easiest way is to use glTexSubImage2D, with x and
y as offsets, and texData as texture data).
Algorithm 4.8 Draw routine using texture atlases
1: Bind vertex, normal, texture coordinate, and index arrays.
2: Compute lastDrawCount = M mod 2N2. . this is the number of triangles to
draw on the last atlas (it might not be full)
3: for i = 0 : 2N2 : M − lastDrawCount do
4: Bind pTex[i] Texture.
5: Draw 2N2 primitives in order. (easiest way is to use glDrawRangeElements
6: end for
7: if lastDrawCount 6= 0 then
8: Bind pTex[i+ 1] Texture.
9: Draw lastDrawCount primitives in order.
10: end if
Sec. 4.5. Texture Atlas 53
 4.5.3 Speed up
Render times using individual textures and texture atlases are shown in Table 4.3.
From the render times we can see that texture atlases provide five orders of magnitude
speed up. Furthermore, the marginal gains of having larger and larger atlases can be
seen in the table below; up to the point where all triangles can be drawn on a single
atlas. Further increase of the atlas size can only slow down the rendering time since we
risk the texture being larger than what graphics hardware can cache.
It should be noted that for a scene with a moderate number of triangles, greater than
a hundred thousand, the texture atlas implementations achieves real-time rendering
(60Hz) while the individual textures do not (approximately 40Hz).
Method Texture Size Atlas Size Number of Render Time
Atlases (micro seconds)
Single Textures 8 - - 110,000 - 115,000
Texture Atlas 8 256 40 9.4 - 11.5
Texture Atlas 8 512 10 6.9 - 8.5
Texture Atlas 8 1024 3 5.5 - 7.3
Texture Atlas 8 2048 1 5.1 - 6.9
Texture Atlas 8 4096 1 5.7 - 7.2
Texture Atlas 8 8192 1 5.7 - 7.2
Table 4.3: Render time for different texture methods on 80k triangles. (tested: on
NVIDIA GTX-580)
54 CHAPTER 4. IMPLEMENTATION
Chapter 5
Results
In this chapter we discuss experimental results. The goal of these experiments is first:
to validate the model; second, to compare it to traditional reconstruction approach;
and third to demonstrate advantages of the model discussed in Chapter 3. We begin
by discussing some of the parameter s used throughout this work. We then analyze
the estimation of each latent variable of the proposed model, namely the geometry and
appearance of the primitives and the intrinsic and extrinsic camera parameters. We
continue by comparing reconstruction results with a publicly available implementation
of structure from motion (SfM). We conclude by highlighting some key aspects of the
model not shared by traditional approaches. The results presented here are based on
three main datasets, for a description of the data see appendix A.
 5.1 Experiment Parameters
As mentioned in chapter 3 we obtain MAP estimates for all random variables in
the model. For all results presented here the image noise model is assumed to be
independent and identically distributed (iid) Normal with zero mean and a 10 pixel
standard deviation, the appearance prior model was iid Normal with mean 128 and
standard deviation 15. LiDAR noise measurement has been chosen according to values
found in the literature [30] and consistent with empirical observations, iid Normal with
zero mean and 12 centimeters standard deviation.
Unless otherwise noted the canonical appearance for each primitive is assumed to
be 16 × 16 image. The size of atlas is 1024 × 1024. Throughout this work we use a
four neighbor appearance fill, as described in §4.1 , and the background appearance
consisted of 256 × 256 pixels. The GPS position noise parameters are modeled as iid
Normal with zero mean and with a standard deviation of 20 feet; the orientation is
again modeled as iid Normal with zero mean and 5◦ standard deviation.
 5.2 Model Validation
This section present results pertaining to the estimation of latent variables in the
model presented in chapter 3. The goal of the experiments is to provide some insight
into different aspects of the latent parameter estimation within the described model. In
55
56 CHAPTER 5. RESULTS
order to better understand the behavior of the model we will first discuss the appear-
ance, geometry and camera parameter estimation with an emphasis on computational
efficiency. Following that we will analyze the trade off between information sources
present in the model, particularly the trade off between geometry and appearance in-
formation. We will conclude the section with a brief look at geometry initialization and
its effect on reconstruction output.
 5.2.1 Appearance Estimation
As presented in §3.2.2, the image observation model contains a noise term that
is dependent on the camera’s viewing angle and the surface’s orientation; while this
model is intuitive it is computationally expensive. This section analyzes the appearance
estimation in the context of this noise model and provides a less computationally intense
model while maintaining the desirable results. This section also briefly examines texture
atlases.
Image Noise Model: Noise function of viewing angle
We begin by validating our choice to parameterize image noise by the angle of inci-
dence between the world plane normal and the image viewing direction, c.f. Figure 5.1.
From the figure we can see that the quality of the texture obtained from a given image
degrades as the normal of the triangle and the viewing direction become increasingly
orthogonal (left to right in the figure). We note that the poor quality is a result of
the reverse projection implementation (c.f. §4.1) and not of the observation. The high
variability in the quality of these textures prohibits choosing a single noise level for
which a suitable reconstruction would be possible. In order to maintain the simplicity
of an additive noise level, while modeling these artifacts we use a noise level that varies
as the viewing conditions between the image and the world geometry change.
Consequently, we choose the following standard deviation for the additive noise
term:
ri =
1
|n>mv|
where nm is the normal of the m
th triangle, v is the viewing direction of the current
image and i indexes pixels in the current image1. We can substitute this standard
deviation in the appearance estimation equations of sections 3.4.1.
One interpretation of the appearance update equations, (3.21) and (3.22), is that of
averaging the observed pixels and inversely weighting each observation by the associated
noise term. The weighted appearance contribution for the triangle in Figure 5.1, can
be seen in Figure 5.2. From the figure we can see that the main contribution comes
1We not that this standard deviation becomes infinite when the viewing angle and the primitive
normal are orthogonal. In our implementation when set the noise to a large value, 104, when this
happens
Sec. 5.2. Model Validation 57
planes_unweighted_4_12.PNG
planes_unweighted_4_14.PNG
planes_unweighted_4_4.PNG
planes_unweighted_4_0.PNG
Figure 5.1: Appearance as a function of viewing angles. Top row: Texture for a given
plane obtained from the rendered image shown in the bottom row.. As viewing angle
increases, texture quality rapidly degrades. This artifact is a result of the reverse
projection scheme, c.f. §4.1.
from the images that view the triangle almost head on; images 4 and 12, those in the
right side of Figure 5.1, receive almost no weight due to their oblique viewing angles.
Figures 5.3 and 5.4 show a similar behavior for a different triangle.
We emphasize that the weights are evaluated on a per-pixel basis. This is due to the
fact that over the set of observations, not all pixels for a given primitive are observed
an equal number of times (eg. due to occlusions). This has the expected consequences
of having variable weights for each pixel (rather than each image) as can be seen in
Figure 5.5. From the figure we can see that despite having a poor viewing angle some
pixels, e.g. the strips shown in the bottom center, get a high weight since they are only
visible in a small set of images, possibly only one. However, this has the disadvantage
of having to compute the weights for each image and world triangle combination; which
can be computationally expensive for scenes with a large collection of triangles, such as
the ones we are interested in.
Image Noise Model: Constant Noise
While the variable noise model presented in the prior section achieves the desired re-
construction quality, its computational complexity makes it unappealing. As previously
mentioned the main drawback of the model is the computation of pixel contributions
(weights) for each image and latent primitive pair. Alternatively it would require main-
taining a variable stack of weights for each pixel, which can be equally cumbersome
and undesirable in GPUs. The motivation for introducing an angle-dependent noise
58 CHAPTER 5. RESULTS
plane 4 : im
g 1 (approx weight= 0.26)
plane 4 : im
g 0 (approx weight= 0.25)
plane 4 : im
g 2 (approx weight= 0.18)
plane 4 : im
g 14 (approx weight= 0.17)
plane 4 : im
g 3 (approx weight= 0.07)
plane 4 : im
g 13 (approx weight= 0.06)
plane 4 : im
g 4 (approx weight= 0.00)
plane 4 : im
g 12 (approx weight= 0.00)
Sum
 of W
eighted im
ages
Figure 5.2: Weighted Appearance for a given triangle (the brighter the image the higher
the weight). Bottom-right tile is sum of the weighted sources. Sources of Fig. 5.1 seen
here in the middle of the top row, left in the center row, and left and center in the
bottom row.
Sec. 5.2. Model Validation 59
plane 0 : im
g 12 (var= 0.083852)
plane 0 : im
g 11 (var= 0.093792)
plane 0 : im
g 13 (var= 0.107897)
plane 0 : im
g 10 (var= 0.158405)
plane 0 : im
g 14 (var= 0.224837)
plane 0 : im
g 9 (var= 0.569248)
plane 0 : im
g 0 (var= 1.528360)
Figure 5.3: Un-weighted Appearance for a given triangle (ordered in increasing angle
between primitive normal and source image viewing direction). Notice how the quality
of the texture degrades with increasing viewing angle.
plane 0 : im
g 12 (approx weight= 0.27)
plane 0 : im
g 11 (approx weight= 0.24)
plane 0 : im
g 13 (approx weight= 0.21)
plane 0 : im
g 10 (approx weight= 0.14)
plane 0 : im
g 14 (approx weight= 0.10)
plane 0 : im
g 9 (approx weight= 0.04)
plane 0 : im
g 0 (approx weight= 0.01)
Sum
 of W
eighted im
ages
Figure 5.4: Weighted Appearance for a given triangle (the brighter the image the higher
the weight). Bottom-right tile is sum of the weighted sources. Source images can be
seen in Figure 5.3
60 CHAPTER 5. RESULTS
plane 6 : im
g 8 (approx weight= 0.33)
plane 6 : im
g 9 (approx weight= 0.28)
plane 6 : im
g 7 (approx weight= 0.27)
plane 6 : im
g 10 (approx weight= 0.16)
plane 6 : im
g 6 (approx weight= 0.14)
plane 6 : im
g 11 (approx weight= 0.04)
plane 6 : im
g 5 (approx weight= 0.03)
Sum
 of W
eighted im
ages
Figure 5.5: Weighted Appearance for a given triangle. (images ordered in increasing
viewing direction, the brighter a pixel, the higher the weight). Bottom right image is
the weighted sum of the sources.
model was based on a reverse projection implementation where the source image was
projected into the primitive and the result rendered from a head-on view to generate
the canonical texture. This projection techniques requires that the source image ex-
plains the entire primitive, despite the size of the texture source. As a result, the source
image is typically interpolated; this interpolation can result in some of the striations
seen on Figure 5.1 where the small component of the source image is stretched to fill
the primitive.
However if we only look at a forward projection method, as described in algorithm
4.2, then the projections from observations to textures would be at the pixel level,
meaning observation pixels would only influence a single texture pixel, and no interpo-
lation would be required. Figure 5.6 shows the textures obtained from the same source
images as Figure 5.1 using forward projection. From the figure we can see that the
forward projection method does not attempt to fill the entire appearance image, it fills
in the visible pixels. We can also see that the number of filled pixels decreases as a
function of viewing angle, which should be expected since this is the same quantity as
present in the source image.
Figure 5.6 leads us to believe that if we use the forward projection method we use
a constant noise level since the artifacts we are trying to deter with the variable-noise
model are no longer present. The results of this approach with a standard deviation
of 10 is shown in Figure 5.7 for each of the latent primitives discussed in the previous
section. Qualitatively the results of both methods are identical, c.f. Figures 5.2, 5.4,
Sec. 5.2. Model Validation 61
Figure 5.6: Texture generated by forward projection method (source images same as
Figure 5.1). Forward projection associates an observation pixel with a single texture
pixel, and possibly its neighbors. As a result interpolation is no longer needed, the
black pixels in the lower triangular region of the image represent missing data.
(a) plane 0 (b) plane 4 (c) plane 6
Figure 5.7: Learned appearances using a constant noise level and forward projection.
Reverse projection results can be seen on the last image of Figures 5.2, 5.4, and 5.5.
5.5 and Figure 5.7. Computationally we no longer need to compute or store weights for
all the image and geometric primitives, all we need are pixel counts, resulting in large
computation savings, see section 4.1.
Texture Atlases
As discussed previously, the use of texture atlases greatly improves performance.
Typical texture atlases can be seen in Figures 5.8 and 5.9. Several observations can
be made about the texture atlas, first we can clearly see the effects of fixed texture
coordinates, where some textures are scaled unevenly. This occurs when the texture
coordinates map the hypotenuse of a primitive to one of the texture edges rather than
the texture image hypotenuse; this results in stretching in one image direction and
compression in the other (see the second block of Figure 5.8, where the windows on
the bottom portion of the texture image are scaled differently than the ones on the top
portion). The affine transformation needed to undo this skew is performed by openGL
62 CHAPTER 5. RESULTS
Figure 5.8: Toy problem texture atlas, appearance size 256×256, atlas size 1024×1024
(1 of 1). Notice the effect of fixed texture coordinate.
(in the current implementation) and is transparent to the end-user (unless one looks at
the texture atlas directly as the case here).
The appearances of latent primitives that are not visible in any image appear as
black (0,0,0) in the texture atlas. Close inspection of Figure 5.8 shows a single pixels
along the diagonal for some non-visible triangles. These pixels are a result of openGL
rendering artifacts where occasionally due to object culling and inaccuracy of the depth
buffer, a single pixel of a non-visible triangle will be visible. We note that this typically
happens when primitives with vastly different normal intersect.
Sec. 5.2. Model Validation 63
Figure 5.9: Lubbock texture atlas, appearance size 8× 8, atlas size 512× 512. (1 of 10)
64 CHAPTER 5. RESULTS
Figure 5.9 shows a texture atlas for the Lubbock dataset. As compared with the
toy example the main differences are that texture sizes can be made much smaller since
triangle sizes are typically smaller and hence have a smaller number of observation
pixels, and that texture atlases might not be entirely populated, as reflected by the
entire black rows in the figure. The order of appearance in the texture atlas depends
on the order of the world triangles in the stack of primitives; if we wanted to pack the
atlases, so that all visible textures came first, we would need to order the primitives
according to visibility. This has the advantages of reducing the number of atlases
needed, and thus minimizing number of texture bindings at rendering time, and also
reducing the memory usage in graphics hardware. However, it has the disadvantage of
having to change the atlases whenever visibility changes. We opted not to implement
these changes since our renderings already achieve real time performance and we have
enough texture memory to hold all atlases for the large scenes that we are interested in
analyzing.
 5.2.2 Geometry Estimation
In this section we analyze and discuss the estimation of latent geometry variables
in the proposed model. Due to the large number of geometric primitives, emphasis is
placed on computational efficiency. As a result we begin by comparing the single prim-
itive estimation with the joint estimation over multiple primitives, c.f. §4.4, in terms of
accuracy of the results and computational complexity. Once the performance criteria
have been met, we present reconstruction examples of typical scene building blocks such
as horizontal planes, vertical planes, and trees. These examples are characteristic of
the model performance on each building block.
Individual vs. Joint Geometric Optimization
As demonstrated in chapter 4 there are several advantages of looking at a multi-
primitive update scheme, the primary being computational complexity. However, before
looking at computational complexity we must ensure that the results obtained via both
methods are comparable. To this end, Figure 5.10 shows the negative log likelihood of
LiDAR as a function of optimization run iteration, image and total for both individual
and multi-plane optimization scheme for a given plane. Several observations can be
made about the figure.
First we note that the LiDAR likelihood dominates the estimation procedure for
most of the initial iterations since the change in image likelihood is not as significant
for this portion; at around iteration seventy, the LiDAR likelihood levels off and the
image likelihood dominates for the remaining iterations. This is the typical behavior we
see in the model, where both terms are competing to explain their respective data. The
dominant term2 depends on a variety of factors such as the number of LiDAR points
2We note that the negative log-likelihood terms are additive across modalities, so that the relative
scale matters to determine dominance.
Sec. 5.2. Model Validation 65
0 20 40 60 80 100 120 140 160 180 200
0.5
1
1.5
2
x 105
∝
 
LL
 
 
Single LiDAR LL
0 20 40 60 80 100 120 140 160 180 200
1.295
1.2955
1.296
1.2965
1.297
x 108
Number of Iterations
∝
 
LL
 
 
Single Image LL
Single Total LL
0 20 40 60 80 100 120 140 160 180 200
0.5
1
1.5
2
x 105
 
 
Multi LiDAR LL
0 20 40 60 80 100 120 140 160 180 200
0.5
1
1.5
2
2.5
x 105
Number of Iterations
 
 
Multi Image LL
Multi Total LL
Figure 5.10: Individual and Joint optimization for a given plane (left and right column
respectively), divided into LiDAR (top row) and Image likelihood (bottom row). The
scale factor changes on the multi-plane image likelihood, this is due to ignoring all
primitives not being optimized. Plane 35147, CLIF Image Stack
0 20 40 60 80 100 120 140 160 180 200
1.295
1.2955
1.296
1.2965
1.297
x 108
Iteration Number
∝
 
LL
Likelihood comparison
 
 
Single Image LL
Single Total LL
Multi Image LL (offseted)
Multi Total LL (offseted)
Figure 5.11: Individual and Joint Likelihoods side-by-side for planes of Fig. 5.10.
Multi-plane likelihood scaled with constant offset of pixels not associated with current
planes
66 CHAPTER 5. RESULTS
50 100 150 200 250
101
102
103
104
Ti
m
e 
(s)
 
 
Single Plane Optimization
Multi Plane Optimization
50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
M
ar
gi
na
l T
im
e 
Co
st
/it
er
at
io
n 
(s)
 
 
Cost (per iteration) for additional planes
50 100 150 200 250
10−1
100
101
102
103
Number of Planes Optimized over
Sp
ee
d 
Up
 F
ac
to
r
 
 
Single/Multi Speed Up factor
Figure 5.12: Computation time comparison for single and multi-plane optimizations for
the CLIF Intersection scene. Top plot: Cost for each method as a function of number
of planes being optimized. Middle plot: per iteration additional cost as a function
additional planes added (base is 1). Bottom plot: speed up factor
associated with the plane, the number of pixels of the plane that are visible across the
image collection, and the noise parameters.
Secondly, the difference between the individual and multi-plane estimation can be
seen in the figure if we look at the scale of the ordinate for the bottom row, where we see
a difference of three orders of magnitude. This difference corresponds to the likelihood
contribution of all the pixels in the image that are not associated with the primitive in
question. This quantity is constant since the other pixels are not being changed. We
note that there is no change in LiDAR likelihood since the multi-plane scheme only
affects image likelihoods.
Figure 5.11 shows the individual and multi-plane likelihood side-by side, (after ac-
counting for the constant offset). From the figure we can see that the likelihoods are
identical for most of the iterations shown. We can also see that around iteration 170
the evaluations differ. This difference is negligible and comes as a result of numerical
approximation in the computation, which impacts the optimization routines. We point
out that, as desired, both individual and multi-plane schemes produced the same result.
In terms of computational complexity, the multi-plane is substantially faster than
the single plane. This can be seen in the top plot of figure 5.12, where the computation
time for the multiple plane estimation scheme is always lower than single plane. The
exception to this is the case when a single plane is being considered; for this case the
Sec. 5.2. Model Validation 67
overhead associated with the bookkeeping for the multi-plane is more computationally
intensive than performing the full likelihood computation.
This result is not surprising since as discussed earlier the main cost of the likelihood
computation is rendering the primitives, which for the individual plane scales linearly
with the number of planes considered; and is constant for the multiple plane case.
As before, we point out that our implementation is limited to a maximum number of
253 planes in parallel since we are encoding plane identity in the alpha channel of the
texture maps, which are limited to 255 labels and two are already in use (background
and regular planes).
A closer look at the top plot of Figure 5.12 reveals that the computation time of
multi-plane is not constant. Which is confirmed when we look at the marginal cost for
including additional planes in the optimization (middle plot of figure 5.12). We note
that the marginal cost increases with each additional plane included, but this increase
is minor when compared with the single optimization. Possible reasons for this cost
included the expansion in the required bookkeeping and the graphics hardware memory
management (memory usage increases linearly with the number of planes analyzed in
the multiple plane case). The overall speed up factor can be seen in the bottom plot of
Figure 5.12. From the figure we can see that the gains are anywhere between slightly
less than one and over one hundred.
Overall, this section has shown that the multi-primitive optimization scheme is much
faster than the single-primitive counterpart. Furthermore, the experiments performed
showed that this performance gain does not come at the cost of accuracy since both
single- and multi-plane optimization yield similar local optima.
Scene Reconstruction Analysis
In this section we discuss the ability of the model to reconstruct a variety of scene el-
ements. For the purpose of this discussion we will focus on the following scene elements:
horizontal planes (such as ground plane or rooftops), buildings, and trees. This small
subset of categories were chosen because they have varying geometric and appearance
properties as a means of examining different aspects of the proposed model. Through-
out this section we will be referring to multiple sites of the CLIF dataset, as shown in
the reconstruction overview of Figure 5.13. Along with explicit reference where appro-
priate, we will color-code the border of each scene detail with the color shown in the
overview figure for easy reference.
Horizontal Planes
We begin the scene analysis with the most common type of surface orientation we
encounter, horizontal planes, that is extended horizontal regions that are largely planar
and typically composed of multiple primitives. Horizontal planes are ubiquitous in aerial
imagery, since they account for most of the surfaces seen. The common examples include
the rooftops and ground (while not necessarily planar, the ground can be approximated
as planar).
Examples of horizontal reconstructions can be seen in Figures 5.14 - 5.17. Figures
68 CHAPTER 5. RESULTS
Figure 5.13: Scene overview, regions that will be examined in this section are shown in
different colors. Subsequent region details are color-matched.
(a) Oblique (b) Profile
(c) Side View
Figure 5.14: Planar Reconstruction Example - Stadium Parking Lot
5.14 and 5.15 show reconstructions of strictly planar regions from three different views,
an oblique view (roughly aligned with the observations viewing direction), a profile
view (typically counter to the viewing direction) and a side-view. We can see from both
images that the model estimates the geometry fairly well. This is not surprising since
most of the LiDAR returns stem from horizontal surfaces, with the correct association,
this would drive ground primitives to explain a large number LiDAR observations. The
combination of a large collection of measurements with low noise usually leads to higher
quality reconstructions.
Sec. 5.2. Model Validation 69
(a) Oblique (b) Profile
(c) Side View
Figure 5.15: Planar Reconstruction Example - River
(a) Oblique (b) Profile
Figure 5.16: Planar Reconstruction Example - Stadium Field
70 CHAPTER 5. RESULTS
(a) Oblique
(b) Side View
Figure 5.17: Planar Reconstruction Example - Roof (two distinct heights). Heights
level smoothed due to incorrect LiDAR association.
Sec. 5.2. Model Validation 71
(a) Profile (north) (b) Profile (south)
Figure 5.18: Building Reconstruction Example - Tower 1 (northern-most)
Figures 5.16 and 5.17 are interesting cases since they are not strictly planar. For
example the goal post in the stadium field generates a small number of LiDAR observa-
tions, however, these posts are not captured3 in the latent representation and as a result
the corresponding LiDAR observations are associated with the closest primitive, which
is the ground plane. This causes ground primitives to rise to explain these observations
as seen in front and back of the stadium as shown in Figure 5.16b.
As another example Figure 5.17 contains two different height levels, which can be
seen from the side view, however, the reconstructed transition between both levels is
smoothed over by latent primitives that cover both levels. This again, can be explained
by the LiDAR association, in this case the returns from both levels are being associated
with the same plane, since there is no alignment that would satisfy both constraints, the
inference algorithms chooses to average the heights and grow the primitives, resulting
in the larger triangles seen in figure 5.17b.
Buildings
Buildings are the next set of structures we are interested in. When compared to the
ground plane or rooftops, buildings are more complex since they couple the horizontal
planes (such as ground and rooftops) with the vertical planes that make up the building
walls. The interactions between different planes needed to model buildings makes them
particularly challenging to model.
Examples of building reconstructions can be seen in Figures 5.18-5.21. These ex-
amples are all interesting and demonstrate the variety of geometry the model can re-
construct. Figure 5.18 shows the coupling between the ground plane and the roof via
3A variety of factors contribute to the goal post not being properly reconstructed, some of which
are the relative size of them with respect to the image resolutions, and the limited viewing.
72 CHAPTER 5. RESULTS
(a) Front (b) Profile
Figure 5.19: Building Reconstruction Example - Part of Stadium
long and narrow primitives that comprise the side of the tower. From the figure we
can see that the camera view (a) respects the multi-faceted geometry and appearance
of the building. While this reconstruction is pleasing, it is by no means perfect, as can
be seen from the incorrect geometry in the roof (should not be planar) and the back
portion of the building (not faceted), see appendix A for images of the scene.
Figures 5.19 and 5.20, shows part of the stadium and a slanted building respectively.
These reconstructions are particularly interesting since their structure is not that of a
traditional building. For example the scoreboard on the stadium is long and thin, and
similarly to the goal post of the field, yield a small number of LiDAR returns, which
causes primitives to elongate to explain them and produce the shape seen in figure
5.19a. Similarly the incline of the bleachers in the stadium and the roof structure of the
slanted building produce diagonal planes that are reconstructed well, while simultane-
ously keeping the sides of the buildings vertically and shrinking in size to accommodate
for the change in height, Figures 5.19b and 5.20c.
Having considered isolated structures, we now consider multiple structures. This
case can be best examined in the context of an intersection, cf. Figure 5.21 (where we
have intentionally limited the extend of buildings to better show the details between
buildings). Figure 5.21 contains two profile views, and two front views; one of each
approximately corresponds to an image perspective (bottom row). As before we can see
from the figure that the coupling between the ground plane and the roof are achieved
by narrow vertical triangles. These triangles as seen in the front view represent the
building walls. Furthermore when the intersection is visualized from views different
than observation we can see that some of the vertical triangles do not necessary end at
the boundary of the roof (or ground), and thus produce a sharp, peaked appearance.
In terms of the interaction between buildings, we can see that the geometry of the
Sec. 5.2. Model Validation 73
(a) Side 1
(b) Side 2
(c) Front
Figure 5.20: Building Reconstruction Example - Slanted Building
74 CHAPTER 5. RESULTS
(a) Profile 1 (b) Front (street level)
(c) Profile 2 (d) Front
Figure 5.21: Building Reconstruction Example - Building Intersection
Sec. 5.2. Model Validation 75
(a) Trees as seen from image 0 (b) Trees from arbitrary location
Figure 5.22: Trees Example 1 (near stadium parking lot). Reconstructed trees as viewed
from different orientations.
intersection is fairly smooth with no major gaps or holes in it; this is expected since
the alternative is to associate image pixels with background, which is discouraged in
the model. As a result, it is possible that near primitives would change their position
and orientation to fill in some gaps.
Trees
Trees do not exactly fit within the planar assumptions that the model imposes. Fur-
thermore, LiDAR returns vary significantly near trees, with some returns corresponding
to the tree itself (e.g. trunk, branches and leaves), and some others penetrating through
leaves and reaching the ground below it. Consequently, association of LiDAR returns
in the vicinity of trees is challenging. These two aspects combined yield poor quality
reconstruction, cf. Figures 5.22 and 5.23.
The figures demonstrate that when a tree is viewed from positions close to ob-
servation views, they appear like trees, since the appearance matches the projection
associated with that camera view. However, when the viewing direction is expanded
we can see the chaotic nature of the reconstruction. As alluded earlier, the association
between latent primitive and LiDAR points is largely responsible for this since the Li-
DAR likelihood encourages primitives to expand and explain the observations, yielding
large primitives which are not suitable for explaining the fine details of foliage.
Failure Cases
Most of the failure cases seen by the model arise from the LiDAR data association,
as seen by the stadium field and scoreboard examples, c.f. §4.3 for the challenges of
this association problem. Another (extreme) example is presented in Figure 5.24, where
we can see that the a primitive on the side of the tower is associated with the LiDAR
returns obtained from the light poles in the field (stream of two points seen in the
76 CHAPTER 5. RESULTS
(a) Trees as viewed from image 0 (b) Trees as viewed from image 16
(c) Trees viewed from arbitrary location (d) Trees viewed from arbitrary location
Figure 5.23: Trees Example 2 (near river). Reconstructed trees as viewed from different
orientations.
bottom row of the figure). This causes the primitive to grow in the direction of the
LiDAR return since this large deviation will incur a very high penalty under the Normal
noise model, generally overpowering image likelihood.
We note that while the noise model is the enforcer of the penalty, the real culprit
is the approximation of the data association with a static kd tree. The kd tree search
for a given primitive associates a larger number of LiDAR measurement with larger
primitives. This is the case for this particular triangle. As discussed earlier this choice
was made to alleviate computational cost, but it should be re-examined. Despite these
failure cases, the model reconstruction quality is adequate.
 5.2.3 Camera Parameter Estimation
Camera pose estimation as performed by the proposed model is reviewed in this
section. We begin by comparing the model when we include the current image as a
texture source for the camera parameter being learned and for the case where we rely
on the appearance provided by other observations (cf. §3.4.2). We then proceed to
analyze the ability of the model to converge in the form of likelihood capture ranges.
We conclude this section with a brief look at the influence of the camera pose prior
model.
Sec. 5.2. Model Validation 77
(a) Tower 2 (b) Tower 2 profile
(c) Tower 2 profile with LiDAR (d) Tower 2 profile with LiDAR
Figure 5.24: Failure Case Example - Tower 2 (southern-most). Incorrect LiDAR asso-
ciation causes primitive to extend to explain measurement.
78 CHAPTER 5. RESULTS
350 400 450 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x (ft)
Sc
al
ed
 C
os
t F
un
ct
io
n 
Va
lu
e
60 80 100 120 140 160
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y (ft)
500 1000 1500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
z (ft)
−3.25 −3.2 −3.15 −3.1 −3.05 −3 −2.95
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a (rad)
0.1 0.2 0.3 0.4 0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
b (rad)
Sc
al
ed
 C
os
t F
un
ct
io
n 
Va
lu
e
−3 −2.5 −2 −1.5 −1 −0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
g (rad)
0.1 0.15 0.2 0.25 0.3 0.35 0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
fov (rad)
 
 
All parameters correct (Unif prior, no tex)
All parameters correct (Unif prior, with tex)
Figure 5.25: Likelihood Comparison with and without current image contribution (green
and cyan respectively). Intersection image 0. Both methods have same local optima.
Basin of attraction varies for each parameter.
Texture Source Inclusion
As discussed previously, the camera pose computation allows us to either condition
on all other observations to generate the latent appearance and use this appearance
to learned the current camera pose or simultaneously learn camera pose and latent
appearance. This last element has a significant computational cost associated with
it, since it requires one to re-compute the appearance for each new camera parameter
combination (cf. section 4.1). Due to this cost, we would like to bypass this computation
if possible and only condition on the appearance as given by other observations.
To ascertain weather this was possible, we probed the negative log-likelihood equa-
tions4 one parameter at a time for each case (holding the other parameters at their
current levels). The results of this parameter sweep can be seen in Figures 5.25 for the
Intersection scene. The figure demonstrates that the capture range for all parameters
is increased when we do not consider the current image contribution to the appearance.
Intuitively this makes sense since by considering the current contribution we are pro-
jecting image pixels into world coordinates, mixing them with other image pixels (not
necessarily from the same underlying world location) and re-projecting them back to
the image frame. This mixing of different pixels cannot improve the capture range and
4Conceptually, the likelihood function can be thought of as a cost function that we are trying to
optimize, minimize for the results shown in this section.
Sec. 5.2. Model Validation 79
−150 −100 −50 0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x (ft)
Sc
al
ed
 C
os
t F
un
ct
io
n 
Va
lu
e
−150 −100 −50 0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y (ft)
−150 −100 −50 0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
z (ft)
−1 −0.5 0 0.5 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a (rad)
−0.2 −0.1 0 0.1 0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
b (rad)
Sc
al
ed
 C
os
t F
un
ct
io
n 
Va
lu
e
−0.5 0 0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
g (rad)
−0.05 0 0.05
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fov (rad)
 
 
Ground Truth
All parameters correct (Unif prior, no tex)
All parameters correct (Unif prior, with tex)
Figure 5.26: Likelihood Comparison with and without current image contribution (green
and cyan respectively, ground truth denoted by red vertical line, parameters were trans-
lated so that the ground truth appears at zero). Lubbock image 2. Both methods have
similar local optima. Capture range varies for each parameter.
causes additional local optima in the parameter space.
On the other hand, the case where we rely on image evidence from the rest of the
observations seems to have all the desired characteristics. From the figure we can see
that the function is fairly smooth and has the same desired optima as its counterpart.
This method seems to work well when there is other image evidence, such as in the
intersection case, where 44 other images are imaging the same region. In order to
catalog the behavior of the two cases for a limited number of images, we repeated the
experiment on the Lubbock scene, which contains only 3 images; the results can be seen
in Figure 5.26.
From the figure we see that both methods have local minima at the corresponding
ground-truth location. However, we no longer have the expansion of the capture range;
in fact, the opposite takes place, the capture range decreases. This somewhat surprising
result can be attributed to the wide baseline between images (approximately 90◦ apart).
These large differences coupled with the small number of images available provide little
useful information to help align the current image by only considering other images.
Based on the experiments above we can conclude that we can estimate the current
camera parameters using appearance contributions from the other images. However,
we must be careful in the limiting case where the image baseline is very wide and the
number of images is low.
80 CHAPTER 5. RESULTS
Camera Pose Capture Range
The camera pose capture range, or basin of attraction, refers to the maximum
separation between the starting point and the optimum that allows the optimum to
be recovered for any given parameter. Informally, the basin of attraction specifies how
far away from a true solution one can initialize the model and still reach the desired
optima.
We have already seen examples for the basin of attraction in Figures 5.25 and 5.26.
By looking at those two figures we can see the significant difference in the capture range
between the two scenes, which varies depending on the number of images, the baseline
between the images and the scene geometry. As an example, for the intersection in the
CLIF data set, the inference procedures converges to the correct altitude (or close to
it) if the initialization is within 500 feet as compared to the Lubbock data set for which
the initialization must typically be within 50 feet.
Another interesting question that arises, is how does the capture range change as
we get closer to the optimum? In order to address this, we looked at the basin of
attraction after a given number of updates to all other camera parameters, cf. Figure
5.27. The figure shows the capture range after updating all other observation (referred
to as a batch) once, five, and in increments of five until thirty batches was reached.
One key observation can be made from the figure: the capture range does not change
but the location of the local optimum does. This observation can be decoupled into
two parts, first the capture range is static, meaning given the current state of the latent
parameters at the time of a batch the local optimum can be reached starting from the
same distance each time. This is important and desired since it provides stability to
the solution, that is, starting within a given distance is likely to provide the correct
minimum.
Furthermore, as seen from the figure, the location of the local minimum changes in
systematic manner as the batches increase –e.g. always increasing for the x dimensions,
or decreasing the y dimension. This change is the parameter change produced by the
last batch (for both the current camera and the other cameras). The systematic changes,
along with the decrease in magnitude (as seen from the figure from the curves getting
progressively closer), further indicate the presence of a local optima and the ability
to reliably converge to it. In addition, we can see that the use of coordinate descent
techniques will provide the desired local optima after a sufficient number of batches.
Prior Model Comparison
As mentioned earlier the model is likelihood (or data) dominated. This is primarily
the case for camera extrinsic parameters, where the influence of the Gaussian Process
Prior quickly diminishes. We compare the effect of the GP prior with an improper
uniform prior model in Figure 5.28 for four images of the CLIF Stadium sequence. The
Figure shows that the inference algorithms achieve similar parameter values indepen-
dent of which prior distribution is used after about thirty batches. The exception to
Sec. 5.2. Model Validation 81
7000 8000 9000 100000
5
10
15 x 10
7
x (ft)
∝
 
LL
−1000 0 1000 2000 30000
0.5
1
1.5
2
2.5
3
3.5
4 x 10
8
y (ft)
0.8 0.9 1 1.1 1.2
x 104
0
2
4
6
8
10
12
14 x 10
7
z (ft)
−3.2 −3.1 −3 −2.9 −2.80
0.5
1
1.5
2
2.5
3
3.5 x 10
8
pitch (rad)
0.55 0.6 0.65 0.7 0.750
5
10
15 x 10
7
yaw (rad)
∝
 
LL
−3.5 −3 −2.5 −2 −1.5 −1 −0.50
5
10
15 x 10
7
roll (rad)
0.1 0.2 0.3 0.40
2
4
6
8
10
12
14
16 x 10
7
fov (rad)
 
 
Batch 1
Batch 5
Batch 10
Batch 15
Batch 20
Batch 25
Batch 30
Figure 5.27: Parameter capture range after optimizing all images: once, five, ten, ...,
and thirty times. Notice the basin of attraction remains constant across batches. (CLIF
Stadium Image 0)
this is the altitude for which after thirty batches the optimization has not converged;
convergence occurred after forty-five batches. We note that the smoothness of the Gaus-
sian Process prior cannot be determined from figure 5.28 since the GP prior encourages
smoothness across multiple images, not across different batches for the same image.
This experiment shows that the local optima found are insensitive to the choice of prior
model.
 5.2.4 Appearance and Geometry trade-off
In this section we discuss the inherent trade-off between appearance and geometry
information present in the proposed model. This trade-off occurs when considering
the information that each type of observations, images and LiDAR, provide about
the world. In these terms, LiDAR provides information primarily about geometry,
while images provide information primarily about appearance. While each type of
observation has a dominant information type, the other type is still present, for example
material properties, and hence appearance information can be inferred from LiDAR
measurements; and geometry can be estimated from image correspondence.
In this section we will qualitatively characterize the model performance as we vary
the number of observations of each type of measurement. By varying the number of
measurements we will be able to witness the interaction between the two information
types. The main goal is to characterize the reconstruction quality as the number of
82 CHAPTER 5. RESULTS
X
d(f
t)
0 5 10 15 20 25 30
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
Y
d(f
t)
0 5 10 15 20 25 30
500
1000
1500
2000
2500
3000
3500
Zd
(ft
)
0 5 10 15 20 25 30
1
1.005
1.01
1.015
1.02
1.025
1.03
1.035
1.04
×d10
4
Numberdofdbatches
ad
(r
ad
)
0 5 10 15 20 25 30
-3.1
-3.05
-3
-2.95
-2.9
-2.85
-2.8
-2.75
Numberdofdbatches
bd
(r
ad
)
0 5 10 15 20 25 30
0.58
0.6
0.62
0.64
0.66
0.68
Numberdofdbatches
gd
(r
ad
)
0 5 10 15 20 25 30
-2.25
-2.2
-2.15
-2.1
-2.05
-2
-1.95
-1.9
-1.85
GPdimgd1
Udimgd1
GPdimgd4
Udimgd4
GPdimgd7
Udimgd7
GPdimgd10
Udimgd10
Figure 5.28: Parameter Change for Uniform and GP prior as a function of number
of optimization runs. Approximately the same local optima is found after about 30
batches independent of which prior probability model is used. (CLIF Stadium)
LiDAR/image measurements vary. In the remaining of this section we discuss the
result of two experiment that try to answer these questions: in the first, we varied the
number of images while maintaining the number of LiDAR measurements constant; in
the second, we varied the number of LiDAR measurements while maintaining a constant
number of images.
Reconstruction as a function of Images
In this experiment, we varied the number of input images from 49, 20, 10, 5 to 2
images for the CLIF Image stack, while maintaining all other parameters the same. In
other words, camera initialization (via GPS) and latent geometry was the same for all
image configuration (initial configuration, i.e. GPS initialization, can be seen in figure
5.29a for the 49 image case). The images selected for each configuration were chosen
to maximize separation between images. As an example, the first and last images in
the sequence were chosen for the two-image experiment, while one image was selected
every other five in the ten-image experiment.
We optimized over the camera pose for each camera in each set. Each set of images
was optimized with the same procedures consisting of four runs:
• Run 1: Standard Update, GP prior, image noise σ = 10, ten batches.
• Run 2: Standard Update, GP prior, image noise σ = 5, five batches.
Sec. 5.2. Model Validation 83
• Run 3: Standard Update, GP prior, image noise σ = 2, five batches.
• Run 4: Reduced optimization step, GP prior, image noise σ = 5, five batches.
The parameters for each run indicate the number of times we cycled through the im-
ages, i.e. ten times in run 1 and the image noise standard deviation used. We note that
varying the image noise level does not have any noticeable difference in the reconstruc-
tion output. On the other hand decreasing the number of batches could have significant
impact, mainly not enough iterations to converge to the optima.
The results of this experiment for the first three sets can be seen in Figure 5.29. The
Figure shows two views of the reconstructed world for each set. From the figure we can
see that the quality of the reconstruction only marginally degrades when reducing the
number of images from 49 to 20. This degradation appears in the form of spotty black
pixels in the reconstruction. These black pixels occur as a result of missing information
in the latent appearance model, i.e. no observation pixel mapped to that particular
appearance location. If we further reduce the number of images from 20 to 10, the
degradation further increases; this is expected since we are reducing the number of
observation pixels. To see why this is the case consider the top right image section of
the first image on the bottom row of figure 5.29, where the change in inferred appearance
can clearly be seen as the number of image that observe the region changes from one
to three.
We note that while distracting this deterioration is minor and can be easily reduced
by either increasing the number of appearance pixels each observation pixel is allowed
to influence (four for these experiments) as discussed in §4.1 or by post-processing the
appearance maps to fill in the holes, either with simple neighbor fill or via smoothing
filtering operations.
Despite the artifacts mentioned earlier the three sets of images converge to similar
configuration of camera parameters, and are aligned well with each other, (as seen by
the sharp features in Figure 5.29). If we continue to decrease the number of images to 5-
and 2-images, the same does not hold as seen by 5.30a for the 5-image case. The figure
shows inconsistent appearance, caused by incorrect camera parameters. The limited
image evidence with wide baseline coupled with the high uncertainty in camera pose
starting parameter causes the optimization to converge to a local optima. In order to
combat this we increased the number of runs performed, from four to eight, but this
did not help. We note that this example follows under the category described in §5.2.3,
where the capture range for the correct optima is fairly small and is thus hard to find.
As an interesting result, since we had already obtained the camera pose for the prior
sets, which included all the image used in the 5- and 2-image set, we can compute the
empirical difference between starting and ending camera poses. This allows us to run
experiments in which we can vary the starting position, i.e. obtain a fictitious initial
GPS measurement and adjust the GPS noise accordingly. The results of running one of
such experiments on the five-image case can be seen in Figure 5.30b. We can estimate
that if the noise in the starting position (GPS sensor noise) is cut in half, we are able
84 CHAPTER 5. RESULTS
(a) Initial Configuration
(b) 49 images
(c) 20 images
(d) 10 images
Figure 5.29: Top row: Initial Parameter configuration for appearance reconstructions
using all 49 images. Other rows: Reconstruction as a function of number of input
images (fixed world geometry and LiDAR, same number of optimizations run for all).
Sec. 5.2. Model Validation 85
(a) 5 images
(b) 5 images (half noise)
Figure 5.30: Reconstruction as a function of number of input images, two views. (a)
original noise level; (b), noise level reduced by a factor of two. (fixed world geometry
and LiDAR, same number of optimizations run for all)
86 CHAPTER 5. RESULTS
Figure 5.31: LiDAR density, 700k, 500k, 200k, 100k, 50k respectively.
to recover a good set of camera parameters, for better pose estimates we should further
reduce the noise level.
Reconstruction as a function of LiDAR measurements
In order to further characterize the appearance geometry trade-off we analyzed
the reconstruction obtained by the proposed model as we varied the number of LiDAR
measurements (while maintaining a constant number of images). The number of LiDAR
measurements in the sets varied from 700,000, 500,000, 200,000, 100,000 to 50,000 by
randomly removing measurements from the previous set of measurements; the LiDAR
measurements clouds can be seen in Figure 5.31. As before, we maintained all other
parameters fixed, including the initialization and the number of images. The latent
geometry for each of the sets was inferred by running a single iteration of the multi-
plane geometry estimate.
The results of this experiment can be seen in Figure 5.32. From the figure we
can see that quality of the results is fairly constant for the first three sets (700k-200k
measurements), and differs for the remaining cases. We note that as the number of
measurements decreases, the number of measurements associated with each primitive
decreases, as a consequence, LiDAR has a smaller influence on the parameter estimate.
In addition, some primitives will have no LiDAR measurements associated with them
and their parameter estimates would be based on image evidence only.
Furthermore, as the number of observation decreases, the chance of incorrect data
associations also decrease, for example the incorrect plane on the side of one of the
towers (cf. Figure 5.24) is no longer present in the 300k and lower experiments since the
lamp measurements are not in the LiDAR set. Typically these type of miss-association
Sec. 5.2. Model Validation 87
Algorithm 5.1 Geometry Primitive Initialization
1: Divide LiDAR point cloud into a grid of M ×M elements.
2: Allocate memory for new sub-sampled point cloud, consisting of M2, 3D points.
3: for m = 1 : M2 do
4: Obtain the mean height of each LiDAR measurement inside grid element m.
5: Save the 3D point in the sub-sampled point cloud using the grid element’s center
and mean height.
6: end for
7: Obtain initial primitive estimate by triangulating (e.g. Delaunay Triangulation)
the grid element center and projecting them to their mean height.
causes the biggest change in parameter estimates due to the model’s need to explain
every measurement, including these outliers.
 5.2.5 Geometry Initialization
All results disused so far have had the same world geometry initialization. This
initialization consists of quantizing the 2D projection of LiDAR measurement cloud into
a grid of specified size, see algorithm 5.1. This initialization has several advantages,
first it allows us to control the number of primitives used; secondly it creates mostly
right triangles of the same size since the sub-sampled point cloud is in grid shape –only
triangles on the boundary of significant height changes deviate from this.
As we have seen so far, this initialization performs well but it has some disadvan-
tages, namely there is no overlap between triangles, so a slight shift in their parameters
produces background. Furthermore, because of the averaging occurring in grid elements
it is unlikely that the plane produced for the grid fits exactly the LiDAR measurements
(the exception to this is when there is a single measurement in the element), this fur-
ther reinforces the earlier observation that most of the likelihood improvement in the
geometry estimation comes from the LiDAR likelihood.
In order to avoid these disadvantages, and test the ability of the model to reconstruct
a variety scenes we attempted a different initialization. This alternative initialization
consisted of assigning a geometric primitive to each LiDAR observation as seen in Figure
5.33. As can be seen from the figure each plane is centered on a LiDAR observation,
and the plane orientation was chosen by fitting a local plane to the set of neighboring
measurements, 5 neighbors throughout this work. Furthermore the size of the primitive
was chosen to produce some overlap between neighboring primitives. In contrast, to
the Delaunay initialization, the resulting primitives do not share boundaries. While
this property is not preserved during the inference procedure, it does reduce initial
ambiguity as to which primitive explains which observation pixel, c.f. Figure 5.34.
The latent primitives obtained through this initialization were updated as before.
The results of the updates can be seen in Figure 5.35. At first glance it appears that
the geometry update was not very successful, but looking closer, especially from image
88 CHAPTER 5. RESULTS
(a) 700k measurements
(b) 500k measurements
(c) 200k measurements
(d) 100k measurements
(e) 50k measurements
Figure 5.32: Reconstruction as a function of LiDAR measurements (fixed number of
images)
Sec. 5.3. Structure from Motion Comparison 89
Figure 5.33: Unique planes per LiDAR measurement, each plane centered at its corre-
sponding measurement.
perspective (as the top image in the figure), we can see that the previously empty space
has now been filled, i.e. planes have expanded and/or moved to fill in the background
regions. This is not surprising for a few reasons: first, by construction the initial LiDAR
likelihood is as good as it can be, with the distance between each point and a plane
being zero. This means that initially only the image likelihood can improve, and since
the model slightly penalizes background the main gain stems from reducing the number
of background pixels.
This reduction of background pixels can come at a cost as seen by the oversize
primitive in the top left corner of the intersection in Figure 5.35 . In this case, the
inference procedure covered the background in the region since the image likelihood
dominated the size penalty and the LiDAR likelihood. This is the typical trade off
behavior of the model.
To summarize, this section shows that the results obtained with the proposed model
are initialization dependent.
 5.3 Structure from Motion Comparison
The goal of this section is to compare the performance of the proposed model to
well established structure from motion techniques. Results of the proposed model will
be compared to results obtained using the implementation of Snavely et al. [50] ( i.e.,
Bundler) and the post-processing method of Furukawa et al. [23] (i.e., PMVS2). We
will be comparing three scenes on reconstruction quality and computation time.
90 CHAPTER 5. RESULTS
Figure 5.34: One plane per LiDAR measurement initialization (planes with no texture
appear black, background is white).
Sec. 5.3. Structure from Motion Comparison 91
Figure 5.35: LiDAR initializing after updating all planes once (planes with no texture
appear black, background is white).
92 CHAPTER 5. RESULTS
Figure 5.36: Bundler reconstruction of CLIF Intersection - Ortho view (1,715 points)
 5.3.1 Reconstruction Comparison
Reconstruction comparison will be based on the CLIF scenes, namely the Inter-
section, the Stadium Image Stack and the Stadium Only datasets. For each of these
datasets we ran Bundler followed by PMVS with their default parameters. The fea-
tures used for Bundler were SIFT features [37]. We note that these reconstructions are
fundamentally different, both in intent and approach, for example we are interested in
recovering higher order primitives, triangles, while Bundler reconstructs a sparse set of
points. Due to such variations we are only considering a qualitative comparison.
CLIF Intersection
We begin the comparisons with the intersection scene. The results after running
Bundler can be seen in Figure 5.36. We note that Bundler produces a sparse scene,
with only over 1500 points. However, despite the high sparsity we are able to see the
basic outline of the intersection in the ortho view, where three out of the four corners
are recovered and the sidewalk markings can be clearly seen.
The results after running PMVS can be seen in Figure 5.37. From the three views we
can see that PMVS expands the output of Bundler, increasing the reconstructed points
to over 12,000, covering the entire visible area. Furthermore we can see from the views
that the flat horizontal surfaces such as the street and rooftops are well reconstructed,
and maintain their flat nature. The sides of buildings on the other hand do not always
maintain their vertical orientation and are at times slanted.
In terms of appearance each point is colored with the corresponding image pixel
values, which leads to a reasonable distinction between road and sidewalk for example.
From the bottom image of Figure 5.37 we can see that there are a number of outliers
Sec. 5.3. Structure from Motion Comparison 93
in the figure, easily seen for points with heights above building levels. Despite these
outliers the reconstruction is fair.
Reconstructions for this scene using the proposed method can be seen in Figure 5.38.
From the figure we can see that the reconstruction is dense, and that the appearance
of the scene is easier to distinguished since appearances are based on image patches,
rather than single pixel colors. In terms of geometry we can see that the reconstruction
of horizontal surfaces are comparable to PMVS. Similarly to Bundler the reconstruction
of vertical features can be slanted and not purely vertical, c.f. the top right intersection
corner (same in all images). Furthermore the sharp planar discontinuities on trees and
sides of buildings can be visually displeasing.
CLIF Stadium Image Stack
The Bundler reconstruction for the Stadium image stack was fairly sparse for the
size of this scene, over 3,000 points, which makes it difficult to distinguish any features.
As a result we will begin the discussions with the PMVS results, cf. Figure 5.39.
The PMVS results are dense with over 77,000 points covering the entire scene. The
collection of points let us clearly see the underlying scene structure, from the Stadium
to the buildings surrounding the stadium. Again the horizontal surfaces are very well
reconstructed, leading to excellent ground coverage.
Building sides are fairly dense and vary from being highly vertical to more slanted.
As examples of these we can use the front wall of the stadium (as seen in the top image),
and the towers. As before there are a few reconstruction errors, the most noticeable
are the severe slant in the towers and the outliers in the back wall of the stadium.
Reconstruction of the scene using the proposed method can be seen in Figure 5.40.
From the images we can see that as before the use of patches allows a clear scene
interpretation, and allows rapid and easy identification of scene parts. Taking a closer
look at the images we note that horizontal surfaces are well reconstructed and fine
details such as parking lines and cars can be easily distinguished. The performance of
our algorithm on side of buildings varies, with some sides being highly vertical, such as
the buildings in front of the stadium), and some being vertical but oversize, such as the
buildings behind the stadium in the middle image.
As pointed out earlier, the performance of the algorithm on trees is less than ideal,
leading to the highly peaky areas around the river. There are other interesting artifacts
visible such as the incorrect data association example of Figure 5.24 in the side of the
tower.
CLIF Stadium Only
The Stadium Only dataset was run through Bundler and PMVS, however, the re-
constructions obtained are mirror inverses of the underlying scene. The Bundler recon-
struction can be seen in Figure 5.41, containing over 25,000 points. This reconstruction
appears to be mirrored, so that the head of the stadium (where the scoreboard is)
appears in the bottom of the figure, where it should be at the top.
94 CHAPTER 5. RESULTS
Figure 5.37: PMVS reconstruction of CLIF Intersection, 3 views. (12,208 points)
Sec. 5.3. Structure from Motion Comparison 95
Figure 5.38: Reconstruction of CLIF Intersection using proposed algorithm, 3 views.
(3,490 visible planes)
96 CHAPTER 5. RESULTS
Figure 5.39: PMVS reconstruction of CLIF Image Stack, 3 views. (77,552 points)
Sec. 5.3. Structure from Motion Comparison 97
Figure 5.40: Reconstruction of CLIF Image Stack using proposed algorithm, 3 views.
(81,573 visible planes)
98 CHAPTER 5. RESULTS
Figure 5.41: Bundler reconstruction of CLIF Stadium Only. Oblique view, note that
the reconstruction is inverted (25,134 points)
Running PMVS, further confirmed that the scene is mirrored, cf. Figure 5.42. Closer
look at the images reveals that the scene is inside out, meaning what we are perceiving
to be the up direction of the reconstruction is actually the bottom. The PVMS results
are very dense with over 156,000 points and allow very detailed account of ground plane,
including roads and other interesting details.
Our reconstruction can be seen in Figure 5.43. As in the previous two cases, we can
use the triangular reconstruction to model very accurately the scene details in planar
regions. For example the street markings and parking lot lines can easily be identified
from the reconstruction. Furthermore, all sides of the stadium are well reconstructed,
including those in the shadow of the front wall. Other than the trees next to the river,
this reconstruction is commendable.
 5.3.2 Computation Time
In this section we discuss the computational complexity in terms of running time for
Bundler+PMVS; and the model presented in this thesis. The run time of Bundler+PMVS
can be seen in Table 5.1. From the table we can see that anywhere between 26% to 72%
of the time Bundler+PMVS is either finding or matching keypoints. The scene param-
eter optimization takes the remaining portion of the time. Overall, the computation
time for each of the scenes is quite low.
The run time for the proposed model can be seen in Table 5.2. From the table we can
see that the algorithm spends a significant amount of time computing scene geometry.
These high run-times are mostly due to the large number of visible primitives in the
scene, ranging from 3,000 to 80,000. We further note that on average the time per
Sec. 5.3. Structure from Motion Comparison 99
Figure 5.42: PMVS reconstruction of CLIF Stadium Only, 3 views. Note that the
reconstruction is inverted (156,290 points)
100 CHAPTER 5. RESULTS
Figure 5.43: Reconstruction of CLIF Stadium using proposed algorithm, 3 views.
(12,912 visible planes)
Sec. 5.4. Additional Reconstructions 101
KeyPoint
Scene
Finding Matching
Bundler PMVS Total
CLIF Intersection 47 80 294 68 489
CLIF Stadium Images 461 1,414 909 330 3,114
CLIF Stadium Only 293 725 286 90 1,401
Table 5.1: Bundler and PMVS time breakdown (all times in seconds)
iteration for a single plane is under a second with the multi-plane optimization method.
We further note that computing appearances is extremely fast, mostly due to the use
of CUDA.
Camera Pose Geometry Appearance
Scene
Time/Iter # Iter Time Time/Iter # Iter Time Time
CLIF
Inter-
section
7.26 20 6,534 170 2 5,100 1.52
CLIF
Stadium
Images
9.75 20 4,680 110 2 74,580 2.15
CLIF
Stadium
Only
6.86 20 7,260 197 2 20,882 2.01
Table 5.2: Our time breakdown. All times in seconds, iterations in camera pose refers
to updating each camera once; iterations in geometry refers to updating 250 planes
once (each update runs until specified tolerance is met or specified number of likelihood
evaluations reached, whichever first). Total time for camera pose is obtained by T =
Nimages ∗Niter ∗Titer, time for geometry is obtained by T = dNplanes/250e∗Niter ∗Titer.
 5.4 Additional Reconstructions
This section provides additional reconstructions of the scenes analyzed earlier. The
views presented here are obtained from observation viewpoints.
Figure 5.44 shows the reconstruction of the Stadium Only dataset. Each of the
images in the figure corresponds to the reconstructed world as viewed from a particular
observation –cameras: 0, 16, 32, and 48 . From the figure we can see that the recon-
structions are sharp enough to mimic actual observations. It only by looking at certain
features such as trees that we can ascertain that we are seeing a reconstruction.
Figure 5.45 shows a reconstruction of the Intersection dataset. As with the prior
figure, each of these images corresponds to the reconstruction being seen from an ob-
servation perspective, cameras 0, 14, 29, and 44. Again from the figures we can see
102 CHAPTER 5. RESULTS
Figure 5.44: Stadium Only Reconstruction, left: reconstruction, right: original image,
top to bottom cameras: 0,16,32,48.
Sec. 5.5. Beyond Reconstructions 103
the subtle details of the scene such has the double lines of the roads, or the left turn
marking on the pavement. Furthermore, if one looks carefully one can see the shadow of
what appears to be a pole with traffic lights on it in the center of the intersection. It is
only by looking at building walls that we are able to deduce that this is a reconstruction
and not an actual image.
 5.5 Beyond Reconstructions
In this section we discuss two aspects of the model that go beyond simply recon-
structing the underlying scene. These added features are: having absolute scale and
orientation; and being able to identify moving objects in the scenes. As we will see
these features follow from our modeling choices and are key distinctions between the
proposed model and traditional SfM.
Furthermore, these features are primary motivators for the proposed model. We note
that while scene reconstruction is the principal focus of many algorithms (SfM included)
it is generally an intermediate step to more complex reasoning about a scene. However,
traditional reconstruction methods don’t readily address or support this possibility.
Our goal in this section is to highlight the potential of the proposed model in order to
address those higher level queries.
 5.5.1 Absolute Scale and Orientation
As pointed out earlier, we can use LiDAR measurements to identify the absolute
scale and orientation for our reconstructions since LiDAR measurements are geocoded.
This allows us to recover the scale factor that is typically unknown in traditional SfM.
One benefit of incorporating scale and orientation into the model is that it allows
for inference and analysis beyond simple reconstruction. One can reason about such
quantities as the absolute distance between two points.
Knowing the scale and orientation is crucial for applications such as route plan-
ning, where being able to measure distances in world coordinates is fundamental. Fur-
thermore, being able to reason about physical units via remote sensing as opposed to
traditional surveying techniques can result in both financial and time savings.
As an example of how the scale and orientation knowledge can be used within the
proposed mode we demonstrate the ability to measure distances in the football field,
Figure 5.46a. From the figure we can see that we are able to measure the distance of the
field to 360.445 × 163.235 feet, which corresponds to the actual distance of 360 × 160
feet for a football fields. This estimate is consistent with the uncertainty in both
the measurements and user interaction with the marking tools. Additional distance
measurements between the ten yard (30 feet) marks in the field can also be seen. We
can use the same distance measuring tools to directly measure unknown quantities such
as the size of the “O” in the field or stands, or the height of the stadium (cf. Figure
5.46b), which are approximately 40× 28ft , 48× 67ft and 126ft respectively.
104 CHAPTER 5. RESULTS
Figure 5.45: Intersection Reconstruction, left: reconstruction, right: original image,
top to bottom cameras: 0,14,29,44.
Sec. 5.5. Beyond Reconstructions 105
(a) Field View
(b) OSU Stadium
Figure 5.46: Two distance annotation examples of OSU Stadium.
106 CHAPTER 5. RESULTS
 5.5.2 Identifying Scene Movers
An implicit assumption of the proposed model is that the appearance and geometry
of the primitives is static. Due to movement in the scene, this assumption is violated in
some places. When this assumption is combined with two of our modeling choices we are
able to detect moving objects in reconstructions. The modeling choices of interest are
modeling scene geometry as higher order primitives and maintaining the temporal order
of images. With these components in place detection can be achieved by examining
image locations with low-likelihood under the inferred appearance model.
To see how these modeling choices provide a mechanism to detect movers in a scene,
let us analyze the individual choices. The dense primitives allow us to have support
over the entire image, this ensures that all image values get projected to a reconstructed
primitive and thus allow for likelihood computations for all pixels in the image. These
likelihood computation would only be possible in SfM if keypoint matches of the object
were made across multiple frames, which is unlikely for moving objects. Even if the
matches are established SfM triangulation methods assume that the matches correspond
to the same 3D location, and thus reconstruct the moving object to a single location.
On the other hand, the temporal order ensures that the object motion is consistent
from frame to frame. This is not strictly necessary since likelihood computations do
not depend on image order.
To demonstrate the model’s ability to detect moving objects, we computed image
likelihood of the CLIF Intersection dataset after estimating world geometry and ap-
pearance, and camera parameters. The results can be seen in Figure 5.47, where low
likelihood regions have been color coded in red on the original images and the images
have been ordered according to their temporal order. From the figure we can see that
the model captures scene movers fairly well. In this example the movers consist of cars
passing (or turning) across the intersection.
It is interesting to point out that there are some low likelihood regions that do
not correspond to movers, such as building edges (see bottom right portion of the
first image). These regions are highlighted since the model is not reconstructing them
correctly. As pointed out earlier, the model’s ability to reconstruct sharp edges depends
on viewing directions, particularly wide baseline to determine correct geometry. Despite
these spurious miss assignments it is clear that movers are reliably detected.
In summary, this chapter has shown the abilities of the proposed model. We have
shown that the model is able to infer geometry, appearance and camera parameters
reliably, obtaining reconstructions that are qualitatively similar to traditional structure
from motion techniques. Furthermore, we have seen the trade off present between the
multi-modal data sources used, and how this trade-off can lead to improvements in
reconstructions. We have also seen how the proposed model can be used to extend
traditional reconstruction results, and allow more complex reasoning about scenes.
Sec. 5.5. Beyond Reconstructions 107
Figure 5.47: Low likelihood regions colored in red. Regions correspond to moving
objects and sharp edges (sequence left to right CLIF Intersection images 14-28)
108 CHAPTER 5. RESULTS
Chapter 6
Conclusion
In this thesis we considered the problem of scene reconstruction from multi-modal
data, specifically LiDAR and aerial imagery. In so doing, we adopted a Bayesian view
of data fusion and formulated an approximate generative model for combining the
aforementioned data sources. The purpose of doing so is manifold. The content of the
thesis focused primarily on inference of geometric and appearance parameters across a
wide and diverse geographic region. We also demonstrated the capacity to utilize the
model for higher level reasoning, specifically estimating physical dimensions of objects
in the scene and detecting moving objects against a static background.
Our modeling approach has several advantages: first, the use of noisy observations
and probabilistic model allows us to capture uncertainty in reconstructions, second, the
model has the ability of incorporating additional scene measurements easily when avail-
able; third the model seamlessly handles missing data, and fourth the overall modeling
approach simplifies the process of incorporating new measurement modalities provided
a physical sensor model is available.
From an implementation stand point, this thesis has shown how to leverage the
power of graphics hardware, e.g. GPUs, to allow fast renderings and inference. Using
OpenGL we have been able to render scenes with hundreds of thousands of triangles
in real time. This in turn has allowed us to infer the appearance for each triangle in
the order of a few seconds. The combination of OpenGL and CUDA has allowed us to
render a textured mapped scene and compute image likelihoods at the microsecond level.
When combined the fast rendering and computation allow for camera pose inference in
the order of ten or twenty seconds. Similarly the geometry estimate of a single triangle
can be achieved on average in under a second.
From a reconstruction standpoint, we have shown that the model is able to recover
the correct camera pose if initialized within a broad capture range. Our experiments
demonstrate that the model is suitable for describing extended, relatively smooth sur-
faces (e.g. roads, buildings, and similar structures), but suffers when representing more
complicated surfaces such as foliage or ”thin” (relative to the measurement resolution)
structures (e.g. goal posts on a football field, and street lamps). Moreover, the dense
reconstructions and the use of images to represent scene appearances allow us to identify
small but crucial scene details such as pavement markings.
Furthermore, we have shown that the appearance and geometry trade-off present
109
110 CHAPTER 6. CONCLUSION
in the model between data sources can be used to obtain a comparable (and sometime
superior) reconstruction of complex urban scenes with fewer image observations over
traditional reconstruction method. When compared to classical structure from motion
the model performs well. The resulting reconstructions are comparable. Horizontal
surfaces appear to be qualitatively better than reconstructions of vertical surfaces. We
hypothesize that this is a consequence of a greater number of LiDAR measurements
being associated with the former as compared to the latter. Precisely quantifying this
is the subject of future experiments.
Extending beyond simple reconstructions the model presented here has two desirable
features: first we are able to determine orientation and absolute scale as demonstrated
by making distance measurements. Secondly, the higher order primitives in this model
and the temporal order of the image sequence allow us to detect moving objects in
scene. These features come from the modeling choices made when building the model.
 6.1 Possible Changes
Some very explicit choices were made in this work. While these decisions typically
simplified implementations or derivations they are not the only possibilities. In this
section we speculate on aspects of our work that could be changed or improved
The first of such improvements is the latent geometry parameterization. As formu-
lated, the geometry of primitives are independent in the generative model, the inde-
pendent triangle parameterization works well within the proposed model but has the
major disadvantage of requiring a very large number of triangles to represent the scene.
This large number of triangles increases the total geometry update time and renders
time. Moreover, since triangles are generally not overlapping changing triangle geome-
try leads to regions where background is visible. A possible improvement to alleviate
these problems could be to incorporate statistical dependency between as a function of
proximity.
Another geometry change could be to allow the model to change the number of
triangles. Right now, the number of primitives must be chosen apriori and remains
constant, only update the parameters of each triangle. By letting the model modify
the number of primitives, we would have the ability to increase expressibility in areas
where it is needed, and reduce the number of primitives in benign area to help mitigate
the computation time problems.
To further ease the computation burden, we can change the multi-plane implemen-
tation to remove the constraint that only a max of 253 planes can be optimized in
parallel. This constraints follows from the observation that we can save the association
render between image pixels and world primitives if we encode the primitive number
in the alpha channel of the appearance map. Since the appearance map is an 8-bit
number, we only have 255 possible labels and two are occupied, which leaves only 253
possible labels. This constraint can be removed if we perform the association render.
In theory the proposed change would allow all primitives to be optimized in parallel;
Sec. 6.2. Future Work 111
however, care must be taken since the memory requirements far exceed the available
memory in graphics hardware. This stems from the implementation that each optimized
plane must have its own copy of background appearance. As an alternative, each
primitive could maintain a dynamic list of background indices it influences.
 6.2 Future Work
There remain a great number of unanswered questions in the topics covered in this
thesis. In this section we describe future research topics related to the work presented.
The data association problem between LiDAR measurement and latent primitives
should be further studied. The main challenge in this problem stems from the com-
binatorics, that is, which LiDAR measurement corresponds to which latent primitive.
We have simplified the association by having a fixed kd tree over the measurements,
translating primitives into points and performing nearest neighbor queries on those.
This simplification allows for fast reasoning over the large space of measurements and
primitives. However, as we have seen this simplification can lead to data association er-
rors, where large planes are associated with more measurements. A remaining challenge
is that of formulating exact data association in a computationally feasible manner.
Throughout this work we have optimized to obtain point estimate of the probability
distributions. The optimization techniques used do not require gradient computation.
While this is ultimately simple, it is also inefficient and requires many more function
evaluations. An interesting and potentially very useful future research topic would be
to obtain expressions for the gradients for the parameters of interest.
An important extension to the work presented here is to perform quantitative model
evaluation and comparisons. These evaluations were not carried out due the lack of
ground truth data or general performance metrics for reconstruction algorithms. More-
over the difference in reconstruction types, points vs. triangles, between SfM and the
model presented here adds another level of complexity since creating a triangulation
from SfM can introduce errors, and generating points from our triangular model can be
misleading and lead to sampling issues. Future work should either obtain ground truth
data to validate the reconstructions or outline a comprehensive set of criteria in which
to quantify reconstructions.
Last but certainly not least, future work should investigating other types of noisy
observations to include in the model. The two observation we currently have com-
plement each other, but more realistic reconstructions could be achieved if we model
material properties, such as reflectivity and luminosity, or texture (as in smoothness
or roughness). The introduction of lighting for the purpose of shadow modeling could
also be interesting. Observations that attempt to capture these properties should be
considered and further investigated.
112 CHAPTER 6. CONCLUSION
Appendix A
Data Overview
In this appendix we briefly introduce and discuss the data sources used in this thesis:
toy dataset, Lubbock dataset, and CLIF dataset.
 A.1 Toy Dataset
The toy Dataset is a synthetic dataset consisting of 15 images, each image is 1200×
900, cf. Figure A.1. This dataset is primarily used to demonstrate key concepts since all
the parameters are known. It is helpful when computing appearances since the camera
pose is simply a rotation around two cubes that have been texture mapped with book
covers and other texture images. Four sample images of the sequence can be seen in
Figure A.2.
 A.2 Lubbock Dataset
The Lubbock Dataset, cf. Figure A.3, consist of three image of Campus Road,
Lubbock Texas. The three images are approximately ninety degrees apart, and can be
seen in Figure A.4, each image is 1336× 891.
Figure A.1: Overview of Toy dataset, 15 images (outline location shown in red).
113
114 APPENDIX A. DATA OVERVIEW
Figure A.2: Four Images of the toy dataset (left to right noiseless images 0,4,8,12)
Sec. A.2. Lubbock Dataset 115
Figure A.3: Overview of Lubbock dataset, 3 images (outline location shown in red).
Figure A.4: The three images of the Lubbock dataset
The Lubbock dataset also contains over five million LiDAR returns in a single tile
(c.f. Fig. A.5). This tile is 1000 × 1000 meters and encompasses all of the Campus
Road Section that we are interested in. Density of LiDAR returns is about one per
meter, with vertical resolution of 10 centimeters.
This dataset has been annotated, that is 2D-3D correspondences have been hand
marked between LiDAR and images, and approximate ground truth camera pose has
been computed via the gold standard method of [28]. As a result this dataset is par-
ticularly useful when comparing learned camera parameters. In addition, it can also
be used to demonstrate the performance of algorithms when a small number of images
with a wide baseline are considered.
116 APPENDIX A. DATA OVERVIEW
Figure A.5: Overview of Lubbock LiDAR, (color coded height above ground).
 A.3 CLIF 2007 Dataset
The Columbus Large Image Format (CLIF) 2007 Sample dataset [59] is used through-
out this work. It consist of 50 frames, each frame has 6 cameras, cf. Figure A.6. Each
frame is originally 2672×4016 pixels, and this work has been downsampled to 822×1326.
The area covered in one frame is approximately 1700× 2200 feet, the approximate area
of the sample set is 3500× 3800 feet (i.e. visible across the entire sequence).
LiDAR for the Ohio State area was obtained from [41]. The tile containing the OSU
stadium and surrounding area has over 727,000 returns. The density is approximately
one return in 3× 3 feet. An overview of the LiDAR tile can be seen in figure A.7.
In this work we focus on specific cites in the CLIF dataset, the stadium Image stack
(top center in Figure A.6), a crop of the stadium and a crop of the intersection (in the
bottom center in Figure A.6). We’ll discuss these three subsets next.
 A.3.1 CLIF Stadium Image Stack
The CLIF Stadium Image Stack consist of 49 images, one for each frame of camera
one (top center in figure A.6) from the CLIF dataset. One image was not included
since it was corrupted. An overview of the cameras can be seen in figure A.8, and four
sample images can be seen in figure A.9.
This dataset uses the LiDAR tile described earlier.
Sec. A.3. CLIF 2007 Dataset 117
Figure A.6: Overview of CLIF Cameras, frame 0, six cameras. Stadium Image stack,
top center; intersection seen in the bottom center image.
Figure A.7: Overview of CLIF LiDAR, (color coded height above ground).
118 APPENDIX A. DATA OVERVIEW
Figure A.8: Overview of CLIF Image Stack dataset, 50 images (outline location shown
in red).
Figure A.9: Four Images of the CLIF Image Stack dataset (left to right images
0,16,32,48)
Sec. A.3. CLIF 2007 Dataset 119
Figure A.10: Four Images of the CLIF Stadium Only dataset (left to right images
0,16,32,48)
 A.3.2 CLIF Stadium Only
The CLIF Stadium Only Dataset consist of a cropped version of the original CLIF
frames (not downsampled). Each image in this dataset is 1024×768, four sample images
can be seen in Figure A.10.
This dataset uses the LiDAR tile described earlier.
 A.3.3 CLIF Intersection
The CLIF Intersection dataset consist of 45 cropped portions of images from camera
two and zero (bottom left and center in Figure A.6) of the CLIF dataset. Only one
image for each frame was used, i.e. the sequence was maintained. Instances where the
scene was split between two images where discarded, four sample images can be seen
in Figure A.11.
The Intersection dataset falls outside the LiDAR tile described earlier. As a result,
a small cropped version of another tile is used for this dataset, cf. Figure A.12. This
mini-tile has the same properties as before, i.e. same density; it is simply smaller with
just over 20,000 returns.
120 APPENDIX A. DATA OVERVIEW
Figure A.11: Four Images of the CLIF Intersection dataset (left to right images
0,14,29,44)
Figure A.12: Overview of CLIF Intersection LiDAR, (color coded height above ground).
Appendix B
Appearance Computation
In this chapter we derive the update equations for the appearance in our model
conditioned on all other parameters. We begin by replicating our model in Figure B.1
and the joint probability distribution, equation (B.1).
L
K1 T1
I1 Z1
M
K2 T2
I2 Z2
Kn Tn
In Zn
GA
L
K1T1
I1Z1
M
K2T2
I2Z2
KnTn
InZn
GA
L
K1 T1
I1 Z1
M
K2 T2
I2 Z2
Kn Tn
In Zn
G A
L
G
A K T
I Z
NM
L
G A K T
I Z
N
M
L
G
A K
TI Z
NM
L
G
A K T
I Z
N
M
L
G A
K
T
I Z
N
M
L
G A K T
I Z
N
M
L
G A K T
I Z
N
M
L
G A K
TI Z
N
M
C
L
G A K T
I Z
NI
Np NC
NL
L
G A K T
I Z
NI
Np NC
NL
B
Figure B.1: Our Graphical Model Representation.
Joint Probability:
p(L, I,Z,G,A,T,K,B) =
Nc∏
c=1
NI∏
i=1
p(Ici |G,A,B,Kc, Tc)p(Zi|Tc)
Nl∏
l=1
p(Ll|G)︸ ︷︷ ︸
Likelihoods
×
Np∏
k=1
p(Gk)p(Ak)
Nc∏
c=1
p(Tc)p(Kc)p(B)︸ ︷︷ ︸
priors
(B.1)
 B.1 Observation Model
As discussed in chapter 3, the image observation model is:
Icn(u, v) = Am∗(u
′, v′) +Qn (B.2)
where
Qn ∼ N (q; 0, R(m∗)), Icn is the nth image of camera c, Am∗ is the m∗ triangle appear-
121
122 APPENDIX B. APPEARANCE COMPUTATION
ance, and (u, v) is the projected image coordinates of the appearance at coordinate
(u′, v′).
We note that (u, v) and (u′, v′), are related implicitly via the 3D point (x, y, z), or
explicitly via the homography between image Ak and In.
For any pixels (u, v), and (u′, v′) that are in correspondence, we can simply the
notation by letting:
z = In(u, v)
a = Am∗(u
′, v′),
where z denotes the observation and a denotes the underlying latent appearance. So
then, equation (B.2) can be written as:
z = a+ q
It follows that:
z|a ∼ N (z; a,R),
and
z ∼ N (z;µ,Σ +R),
if we let A have a Normal prior, such that A ∼ N (a;µ,Σ).
So then our task is to estimate A. We are interested in several cases, first what would
the estimate of A be from a single observation; for multiple observations with the same
noise characteristics; and from multiple observations with different noise characteristics.
 B.2 Summary
Before diving into the derivations we present the results. As we will demonstrated:
p(a|z) ∼ N (a; µˆ, σˆ), where µˆ and σˆ vary for the different cases considered:
Single Observation, z = a+ q:
µˆ =
σ2z + r2µ
σ2 + r2
(B.3)
σˆ =
σr√
σ2 + r2
(B.4)
Multiple Observation (same noise), zn = a+ q:
µˆ =
r2µ+ σ2
∑n−1
i=0 zi
r2 + nσ2
(B.5)
σˆ =
σr√
r2 + nσ2
(B.6)
Sec. B.3. Derivations 123
Multiple Observations (different noise), zn = a+ qn (where qn ∼ N (q; 0, rn)):
µˆ =
µr˙2 + σ2
∑n−1
i=0 r˙
2
\izi
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
=
µ
∏n−1
i=0 r
2
i + σ
2
∑n−1
i=0
(∏n−1
j=0,j 6=i r
2
j
)
zi∏n−1
i=0 r
2
i + σ
2
∑n−1
i=0
(∏n−1
j=0,j 6=i r
2
j
) (B.7)
σˆ =
r˙σ√
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
=
σ
∏n−1
i=0 ri√∏n−1
i=0 r
2
i + σ
2
∑n−1
i=0
(∏n−1
j=0,j 6=i r
2
j
) (B.8)
 B.3 Derivations
In this section we derive the forms of the posterior p(a|z), for the multiple observa-
tions with different noise, next section shows how to obtain the special cases.
 B.3.1 Multiple Observations - different noise
Assume that observation is the same as before In(u, v) = Ak(u
′, v′) +Qn, but with
Qn ∼ N (q; 0, rn). i.e. the noise variance depends on the image.
As before z = a+ q, but let’s denote the different variance explicitly as z = a+ qn.
Then,
p(a|z) =
∏n−1
i=0 p(zi|a)p(a)∏n
j=1 p(zj)
∝ p(a)
n−1∏
i=0
p(zi|a)
∝ N (a;µ;σ2)
n−1∏
i=0
N (zi; a, r2i )
∝ 1√
2piσ2
exp
{
−(a− µ)
2
2σ2
} n−1∏
i=0
1√
2pir2i
exp
{
−(zi − a)
2
2r2i
}
Let us define
r˙ =
n−1∏
j=0
rj
r˙\i =
n−1∏
j=0,j 6=i
rj
124 APPENDIX B. APPEARANCE COMPUTATION
then,
p(a|z) ∝ 1√
2piσ2
exp
{
−(a− µ)
2
2σ2
} n−1∏
i=0
1√
2pir2i
exp
{
−(zi − a)
2
2r2i
}
∝ 1√
2piσ2
exp
{
−(a− µ)
2
2σ2
}
1
(2pi)
n
2 r˙
n−1∏
i=0
exp
{
−(zi − a)
2
2r2i
}
∝ 1
(2pi)
n+1
2 r˙σ
exp
{
−(a− µ)
2
2σ2
−
n−1∑
i=0
(zi − a)2
2r2i
}
︸ ︷︷ ︸
∗
Aside:
∗ = −(a− µ)
2
2σ2
−
n−1∑
i=0
(zi − a)2
2r2i
= −1
2
[
(a− µ)2
σ2
+
n−1∑
i=0
r˙2\i(zi − a)2
r˙2
]
= −1
2
[
(a− µ)2
σ2
+
1
r˙2
n−1∑
i=0
r˙2\i(zi − a)2
]
= − 1
2r˙2σ2
[
r˙2(a− µ)2 + σ2
n−1∑
i=0
r˙2\i(zi − a)2
]
= − 1
2r˙2σ2
[
r˙2a2 − 2µr˙2a+ µ2r˙2 + σ2
n−1∑
i=0
r˙2\i
(
z2i − 2zia+ a2
)]
= − 1
2r˙2σ2
[
r˙2a2 − 2µr˙2a+ µ2r˙2 + σ2
n−1∑
i=0
r˙2\iz
2
i − 2σ2
n−1∑
i=0
r˙2\izia+ σ
2
n−1∑
i=0
r˙2\ia
2
]
= − 1
2r˙2σ2
[(
r˙2 + σ2
n−1∑
i=0
r˙2\i
)
a2 − 2
(
µr˙2 + σ2
n−1∑
i=0
r˙2\izi
)
a+ µ2r˙2 + σ2
n−1∑
i=0
r˙2\iz
2
i
]
= −
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
2r˙2σ2
a2 − 2
(
µr˙2 + σ2
∑n−1
i=0 r˙
2
\izi
)
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
a+
µ2r˙2 + σ2
∑n−1
i=0 r˙
2
\iz
2
i
r˙2 + σ2
∑n−1
i=0 r˙
2
\i

= −
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
2r˙2σ2
[
(a− µˆ)2 + c′
]
Sec. B.3. Derivations 125
where
µˆ =
µr˙2 + σ2
∑n−1
i=0 r˙
2
\izi
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
c′ =
µ2r˙2 + σ2
∑n−1
i=0 r˙
2
\iz
2
i
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
− µˆ2
then
p(a|z) ∝ 1
(2pi)
n+1
2 r˙σ
exp
{
−
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
2r˙2σ2
[
(a− µˆ)2 + c′
]}
∝ 1
(2pi)
n+1
2 r˙σ
√
r˙2+σ2
∑n−1
i=0 r˙
2
\i√
r˙2+σ2
∑n−1
i=0 r˙
2
\i︸ ︷︷ ︸
roll part to proportionality
exp
{
−
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
2r˙2σ2
(a− µˆ)2
}
exp
{
−
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
2r˙2σ2
c′
}
︸ ︷︷ ︸
roll to proportionality
∝ 1
(2pi)
1
2 r˙σ√
r˙2+σ2
∑n−1
i=0 r˙
2
\i
exp
{
−
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
2r˙2σ2
(a− µˆ)2
}
∝ 1
(2pi)
1
2 r˙σ√
r˙2+σ2
∑n−1
i=0 r˙
2
\i
exp
− (a− µˆ)
2
2 r˙
2σ2
r˙2+σ2
∑n−1
i=0 r˙
2
\i

∝ 1
(2pi)
1
2 r˙σ√
r˙2+σ2
∑n−1
i=0 r˙
2
\i
exp

− (a− µˆ)
2
2
(
r˙σ√
r˙2+σ2
∑n−1
i=0 r˙
2
\i
)2

∝ N (a; µˆ, σˆ2)
where
µˆ =
µr˙2 + σ2
∑n−1
i=0 r˙
2
\izi
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
=
µ
∏n−1
i=0 r
2
i + σ
2
∑n−1
i=0
(∏n−1
j=0,j 6=i r
2
j
)
zi∏n−1
i=0 r
2
i + σ
2
∑n−1
i=0
(∏n−1
j=0,j 6=i r
2
j
) (B.9)
σˆ =
r˙σ√
r˙2 + σ2
∑n−1
i=0 r˙
2
\i
=
σ
∏n−1
i ri√∏n−1
i=0 r
2
i + σ
2
∑n−1
i=0
(∏n−1
j=0,j 6=i r
2
j
) (B.10)
126 APPENDIX B. APPEARANCE COMPUTATION
 B.4 Special Cases
As a sanity check, we double check that the multiple observation with same noise
and the single observation are special cases of the updates derived in §B.3.1.
To obtain the multiple observation case, we can let rj = r, ∀j ∈ [0, n− 1], with this
definition, r˙ = rn, and r˙\i = rn−1.
Plugging the new values for r˙ and r˙\i into the general updates of equations (B.9)
and (B.10), we obtain
µˆ =
µr2n + σ2r2n−2
∑n−1
i=0 zi
r2n + nσ2r2n−2
σˆ =
rnσ√
r2n + nσ2r2n−2
from which we can factor out r2n−2 from numerator and denominator to obtain:
µˆ =
µr2 + σ2
∑n−1
i=0 zi
r2 + nσ2
(B.11)
σˆ =
rσ√
r2 + nσ2
(B.12)
Thus equations (B.11) and (B.12) refer to the multiple observation with same noise
model. Furthermore, if we let n = 1 in the equations above we obtain the single
observation case.
Appendix C
Gaussian Marginalization
In this appendix we derive the marginalization of a multiplication of Gaussian dis-
tributions.
 C.1 General Derivation
∫ N∏
n=1
N (xn;µ, σ2)N (µ;µ0, σ20) dµ (C.1)
=
∫ N∏
n=1
1√
2piσ2
exp
{
−(xn − µ)
2
2σ2
}
1√
2piσ20
exp
{
−(µ− µ0)
2
2σ20
}
dµ (C.2)
=
1(√
2piσ2
)n 1√
2piσ20︸ ︷︷ ︸
c
∫ N∏
n=1
exp
{
−(xn − µ)
2
2σ2
− (µ− µ0)
2
2σ20
}
dµ (C.3)
= c
∫
exp
{
− 1
2σ2
N∑
n=1
(xn − µ)2 − (µ− µ0)
2
2σ20
}
dµ (C.4)
= c
∫
exp
{
− 1
2σ2
N∑
n=1
(x2n − 2µxn + µ2)−
1
2σ20
(µ2 − 2µ0µ+ µ20)
}
dµ (C.5)
= c
∫
exp
{
− 1
2σ2
[
N∑
n=1
x2n − 2µ
N∑
n=1
xn + nµ
2
]
− 1
2σ20
(µ2 − 2µ0µ+ µ20)
}
dµ (C.6)
= c
∫
exp
−
1
2

(
n
σ2
+
1
σ20
)
︸ ︷︷ ︸
a
µ2 − 2
(
1
σ2
N∑
n=1
xn +
µ0
σ20
)
︸ ︷︷ ︸
b
µ+
(
1
σ2
N∑
n=1
x2n +
1
σ20
µ20
)
︸ ︷︷ ︸
d

 dµ
(C.7)
127
128 APPENDIX C. GAUSSIAN MARGINALIZATION
= c exp
{
−d
2
}∫
exp
{
−1
2
[
aµ2 − 2bµ]} dµ (C.8)
= c exp
{
−d
2
}∫
exp
{
−1
2
[
a
(
µ− b
a
)2
− b
2
a
]}
dµ (C.9)
= c exp
{
−d
2
}∫
exp
−
(
µ− ba
)2
2
(
1√
a
)2
 exp
{
b2
2a
}
dµ (C.10)
= c exp
{
−d
2
}
exp
{
b2
2a
}∫
exp
−
(
µ− ba
)2
2
(
1√
a
)2
 dµ (C.11)
= c exp
{
−d
2
}
exp
{
b2
2a
}∫ √
a
2pi
√
2pi
a
exp
−
(
µ− ba
)2
2
(
1√
a
)2
 dµ (C.12)
= c exp
{
−d
2
}
exp
{
b2
2a
}√
2pi
a





:1∫ √
a
2pi
exp
−
(
µ− ba
)2
2
(
1√
a
)2
 dµ (C.13)
= c exp
{
−d
2
}
exp
{
b2
2a
}√
2pi
a
(C.14)
(C.15)
Back-substituting∫ N∏
n=1
N (xn;µ, σ2)N (µ;µ0, σ20) dµ (C.16)
=
1(√
2piσ2
)n 1√
2piσ20
exp
{
−d
2
}
exp
{
b2
2a
}√
2pi
a
(C.17)
=
σ
(
√
2piσ)n
√
nσ20 + σ
2
exp
{
−1
2
(
1
σ2
N∑
n=1
x2n +
µ20
σ20
)}
exp

(
1
σ2
∑N
n=1 xn +
µ0
σ20
)2
2
nσ20+σ
2
σ2σ20

(C.18)
=
σ
(
√
2piσ)n
√
nσ20 + σ
2
exp
{
−1
2
(
1
σ2
N∑
n=1
x2n +
µ20
σ20
)}
exp

(
σ20
∑N
n=1 xn + σ
2µ0
)2
2σ2σ20
(
nσ20 + σ
2
)

(C.19)
Appendix D
OpenGL Implementation Details
 D.1 OpenGL projective texture transformation
In this section we discuss the implementation of projective texture transformations
in OpenGL, and how to relate image pixels to texture texels.
 D.1.1 Projective Transformation
We begin by noting that we can write the transformation from 3D world point to
image space as: 
x′
y′
z′
w′
 = PcamVcam

x
y
z
w
 (D.1)
where
Vcam is the projective camera view matrix,
Pcam is the projective camera projection matrix.
We complete the transformation from image projective space to pixels by x˜y˜
w′
 =
 12w 0 12w0 −12h h− 12h
0 0 1
 x′y′
w′
 =
 12(x′ + w′)whw′ − 12(y′ + w′)h
w′
 (D.2)
Note that in equation (D.2) we have ignored the z′ term since our images are 2
dimensional (we have also written the term h− 12h explicitly since it will be convenient
later. Also note, that the matrix diag(12) used above is needed due to OpenGL internal
storage of pixels.
We can obtain the final pixel location by:[
u
v
]
=
[
x′
w′
y′
w′
]
(D.3)
 D.1.2 Projective Texture Mapping
We begin by noting that we can write the texture transformations as:
129
130 APPENDIX D. OPENGL IMPLEMENTATION DETAILS

s
t
r
q
 = Tobject

x
y
z
w
 (D.4)
where
Tobject =

1
2 0 0
1
2
0 12 0
1
2
0 0 12
1
2
0 0 0 1
PpVpM (D.5)
where
M is an object to world transformation (identity in our cases, i.e. the object is
expressed in world coordinates).
Vp is the projective camera view matrix,
Pp is the projective camera projection matrix.
This is obtained from [14]
Let 
x′
y′
z′
w′
 = PpVp

x
y
z
w
 (D.6)
Note that equation (D.6) is simply a projective transformation of a 3D point.
Then, using, equation (D.6) in equations (D.4) and (D.5), we can write:
s
t
r
q
 =

1
2 0 0
1
2
0 12 0
1
2
0 0 12
1
2
0 0 0 1


x′
y′
z′
w′
 =

1
2x
′ + 12w
′
1
2y
′ + 12w
′
1
2z
′ + 12w
′
w′
 (D.7)
Equation (D.7) is the projective texture mapping for a point in 3D world to a 3D
texture. In our case, the textures are 2D, so we can disregard the t component of the
texture. These are un-normalized texture coordinates. The next step is to do a range
mapping to [0, 1] for each dimension.
The easiest way to do so, is to equate it to the orthographic projection we had
working before: 1−(h−(1/2y′/w′+1/2)h+0)/hTex. So we want to map this expression,
to something like aw′ . Where a is the term we are looking for
In what follows, h is the height of the original image, htex is the height of the texture
image (power of 2), by creation htex > h. Similarly for w and wtex, the width of image
and texture respectively.
We can achieve the range mapping as follows:
Sec. D.1. OpenGL projective texture transformation 131
1−
[
h− (12 y
′
w′ +
1
2)h+ 0
]
htex
=
htex −
[
h− (12 y
′
w′ +
1
2
w′
w′ )h
]
htex
=
htex − h+ (12 y
′
w′ +
1
2
w′
w′ )h
htex
=
htex − h+ (12y′ + 12w′) hw′
htex
=
(htex − h)w′ + (12y′ + 12w′)h
htexw′
=
(htex−h)w′+( 12y′+ 12w′)h
htex
w′
From above we can see that:
(htex − h)w′ + 12(y′ + w′)h
htex
(D.8)
is the term we are looking for. Note that the mirroring introduces an extra offset of
(1− hhtex )w′ (the first term in the equation above).
Similar procedure can be used to obtain s (without having to do the mirroring).
The final projective texture coordinates can be written as:
s
t
0
q
 =

1
2(x
′ + w′) wwtex[
(htex − h)w′ + 12(y′ + w′)h
]
h−1tex
0
w′
 (D.9)
 D.1.3 Pixels to Textures
Here we try to connect the pixel locations to texture locations. We can com-
pare equations (D.2) and (D.9). We can see that the un-normalized pixel coordinates
[x˜ y˜ w˜] are very similar to the [s t r q] coordinates.
A substitution of common terms yields:

s
t
0
q
 =

1
2
(x′ + w′)w︸ ︷︷ ︸
x˜
wtex
−1
(htex − h)w′ + 12(y′ + w′)h︸ ︷︷ ︸
hw′−y˜
h−1tex
0
w′

=

x˜ wtex
−1
(htexw
′ − y˜)h−1tex
0
w′
 (D.10)
132 APPENDIX D. OPENGL IMPLEMENTATION DETAILS
Bibliography
[1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building Rome in
a day, 2009.
[2] J. Alon and S. Sclaroff. Recursive Estimation of Motion and Planar Struc-
ture. IEEE Conference on Computer Vision and Pattern Recognition (2008),
2(Cvpr):2550, 2000.
[3] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal
algorithm for approximate nearest neighbor searching fixed dimensions. Journal
of the ACM, 45(6):891–923, 1998.
[4] C. Baillard and A. Zisserman. Automatic reconstruction of piecewise planar models
from multiple views. Proceedings 1999 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Cat No PR00149, 2:559–565, 1999.
[5] A. Bartoli and P. Sturm. Constrained Structure and Motion From Multiple Un-
calibrated Views of a Piecewise Planar Scene. International Journal of Computer
Vision, 52(1):45–64, 2003.
[6] A. Bartoli, P. Sturm, and R. Haraud. Projective structure and motion from two
views of a piecewise planar scene. Proceedings Eighth IEEE International Confer-
ence on Computer Vision ICCV 2001, 1(C):593–598, 2001.
[7] P. A. Beardsley, A. Zisserman, and D. W. Murray. Sequential updating of projec-
tive and affine structure from motion. International Journal of Computer Vision,
23(3):235–259, 1997.
[8] D. C. Brown. The bundle adjustment - progress and prospects. In Int Archives
Photogrammetry, volume 21, pages 1–33, 1976.
[9] L.-c. Chen, T.-a. Teo, Y.-c. Shao, Y.-c. Lai, and J.-y. Rau. Fusion of LiDAR Data
and Optical Imagery for Building Modeling. Building, 35(B4):2–7, 2003.
[10] L.-C. C. L.-C. Chen, T.-A. T. T.-A. Teo, J. Y. Rau, J.-K. L. J.-K. Liu, and W.-C.
H. W.-C. Hsu. Building reconstruction from LIDAR data and aerial imagery, 2005.
[11] F. Dellaert, S. M. Seitz, C. E. Thorpe, and S. Thrun. Structure from motion
without correspondence. IEEE Transactions on Medical Imaging, 26(2):557–564,
2000.
[12] A. R. Dick, P. H. S. Torr, S. J. Ruffle, and R. Cipolla. Combining Single View
Recognition and Multiple View Stereo For Architectural Scenes. In IEEE Interna-
tional Conference on Computer Vision, volume 00, pages 268–280. IEEE Computer
133
134 BIBLIOGRAPHY
Society, 2001.
[13] Digia. Qt Project, 2013.
[14] C. Everitt. Projective texture mapping.
[15] O. Faugeras, Q. Luong, S. Maybank, and G. Sandini. Camera self-calibration:
Theory and experiments. In European Conference on Computer Vision, volume
588, pages 321–334. Springer Berlin / Heidelberg, 1992.
[16] O. Faugeras and F. Lustman. Motion and Structure from Motion in a Piecewise
Planar Environment, 1988.
[17] A. Fischer. Extracting Buildings from Aerial Images Using Hierarchical Aggrega-
tion in 2D and 3D. Computer Vision and Image Understanding, 72(2):185–203,
1998.
[18] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model
fitting with applications to image analysis and automated cartography. Commu-
nications of the ACM, 24(6):381–395, 1981.
[19] A. W. Fitzgibbon and A. Zisserman. Automatic camera recovery for closed or open
image sequences. Computer VisionECCV98, 1:311–326, 1998.
[20] D. A. Forsyth, S. Ioffe, and J. Haddon. Bayesian structure from motion. Proceedings
of the Seventh IEEE International Conference on Computer Vision, 00(648):660–
665 vol.1, 1999.
[21] D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice
Hall, 2002.
[22] K. Fujii and T. Arikawa. Reconstruction of 3D urban model using range image
and aerial image, 2001.
[23] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–76, 2010.
[24] D. Gallup, J.-M. Frahm, and M. Pollefeys. Piecewise planar and non-planar stereo
for urban scene reconstruction. In Computer Vision and Pattern Recognition
(CVPR), 2010.
[25] Gilles Debunne. libQGLViewer, 2013.
[26] P. Gurram, H. Rhody, J. Kerekes, S. Lach, and E. Saber. 3D Scene Reconstruction
through a Fusion of Passive Video and Lidar Imagery, 2007.
[27] K. Hammoudi and F. Dornaika. A Featureless Approach to 3D Polyhedral Building
Modeling from Aerial Images. Sensors (Peterboroug, 11(1):228–259, 2010.
[28] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision, vol-
ume 16. Cambridge University Press, 2004.
[29] R. I. Hartley and P. Sturm. Triangulation. Computer Vision and Image Under-
standing, 68(2):146–157, Nov. 1997.
[30] M. E. Hodgson and P. Bresnahan. Accuracy of Airborne Lidar-Derived Elevation
: Empirical Assessment and Error Budget. Photogrammetric Engineering Remote
Sensing, 70(3):331–339, 2004.
[31] J. H. J. Hu, S. Y. S. You, and U. Neumann. Integrating LiDAR, Aerial Image and
Ground Images for Complete Urban Building Modeling, 2006.
[32] T. S. Huang and A. N. Netravali. Motion and structure from feature correspon-
dences: a review. Proceedings of the IEEE, 82(2):252–268, 1994.
BIBLIOGRAPHY 135
[33] M. Huber, W. Schickler, S. Hinz, and A. Baumgartner. Fusion of LIDAR data and
aerial imagery for automatic reconstruction of building surfaces, 2003.
[34] Khronos Group. Open Graphics Library (OpenGL), 2013.
[35] H.-H. Liao, Y. Lin, and G. Medioni. Aerial 3D reconstruction with line-constrained
dynamic programming, 2011.
[36] M. A. Lourakis and A. Argyros. SBA: A Software Package for Generic Sparse
Bundle Adjustment. ACM Trans. Math. Software, 36(1):1–30, 2009.
[37] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Interna-
tional Journal of Computer Vision, 60(2):91–110, 2004.
[38] D. J. C. Mackay. Information Theory , Inference , and Learning Algorithms, vol-
ume 22. Cambridge University Press, 2003.
[39] NVIDIA. Compute Unified Device Architecture (CUDA), 2013.
[40] C. Nvidia. NVIDIA CUDA C Programming Guide. Changes, (350):173, 2011.
[41] Ohio Statewide Imagery Program Data and Ohio Geographically Referenced In-
formation Program. Ohio Statewide Imagery Program Data.
[42] M. Pollefeys, D. Niste´r, J. M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp,
C. Engels, D. Gallup, S. J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang,
Q. Yang, H. Stewe´nius, R. Yang, G. Welch, and H. Towles. Detailed Real-Time
Urban 3D Reconstruction from Video. International Journal of Computer Vision,
78(2-3):143–167, 2007.
[43] G. Qian and R. Chellappa. Structure from Motion Using Sequential Monte Carlo
Methods. International Journal of Computer Vision, 59(1):5–31, 2004.
[44] C. Rasmussen and C. Williams. Gaussian processes for machine learning. The
MIT Press, 2006.
[45] H. Samet. Foundations of multidimensional and metric data structures. Morgan
Kaufmann Publishers, San Francisco, 2006.
[46] J. Sanders and E. Kandrot. CUDA by Example, volume 54. Addison-Wesley, 2010.
[47] D. Shreiner. OpenGL Programming Guide: The Official Guide to Learning
OpenGL, versions 3.0 and 3.1. Addison Wesley, 7 edition, 2010.
[48] S. N. Sinha, D. Steedly, and R. Szeliski. Piecewise planar stereo for image-based
rendering, 2009.
[49] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections
in 3D. Image Rochester NY, 25(3):835–846, 2006.
[50] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the World from Internet Photo
Collections. International Journal of Computer Vision, 80(2):189–210, 2007.
[51] H. Steuer. Height snakes: 3D building reconstruction from aerial image and laser
scanner data, 2011.
[52] P. Sturm and B. Triggs. A factorization based algorithm for multi-image projective
structure and motion. Lect Notes Comput Sci, 1065(April):709–720, 1996.
[53] I. Suveg and G. Vosselman. Reconstruction of 3D building models from aerial
images and maps. ISPRS Journal of Photogrammetry and Remote Sensing, 58(3-
4):202–224, Jan. 2004.
[54] R. Szeliski. Computer Vision : Algorithms and Applications. Computer, 5:832,
2010.
136 BIBLIOGRAPHY
[55] R. Szeliski and P. H. S. Torr. Geometrically Constrained Structure from Motion :
Points on Planes 1 Introduction. Lecture Notes in Computer Science, 1506:171–193,
1998.
[56] C. Tomasi and T. Kanade. Shape and motion from image streams under or-
thography: a factorization method. International Journal of Computer Vision,
9(2):137–154, 1992.
[57] B. Triggs, P. F. Mclauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle Adjust-
ment A Modern Synthesis. Synthesis, 34099:298–372, 2000.
[58] F. T. F. Tsai, T.-A. T. T.-A. Teo, L.-C. C. L.-C. Chen, and S.-J. C. S.-J. Chen. Con-
struction and visualization of photo-realistic three-dimensional digital city, 2009.
[59] US Air Force. Columbus Large Image Format (CLIF) 2007 Dataset .
[60] J. Vandekerckhove, D. Frere, T. Moons, and L. V. Gool. Semi-automatic modelling
of urban buildings from high resolution aerial imagery, 1998.
[61] S. W. S. Weng, G. Z. G. Zhao, and B. H. B. He. Rapid reconstruction of 3D
building models from aerial images and LiDAR data, 2010.
[62] M. X. M. Xie, K. F. K. Fu, and Y. W. Y. Wu. Building Recognition and Recon-
struction from Aerial Imagery and LIDAR Data, 2006.
[63] G. X. G. Xu, J. Terai, and H.-Y. S. H.-Y. Shum. A linear algorithm for camera
self-calibration, motion and structure recovery for multi-planar scenes from two
perspective images, 2000.
[64] Y. Zhang, Z. Zhang, J. Zhang, and J. Wu. 3D Building Modelling with Digi-
tal Map, Lidar Data and Video Image Sequences. The Photogrammetric Record,
20(111):285–302, 2005.