thesis/results.tex at master · bensapp/thesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
\chapter{Results}\label{sec:results}

In this chapter we go through quantitative results on the datasets Buffy,
Pascal and VideoPose (\secref{datasets}), analyzing behavior and design choices
of our methods.  We also compare our models---CPS, ESM and LLPS---against
competing models (\secref{competition}).  For much of the analysis we focus on
upper and lower arms only---in particular, elbow and wrist localization
accuracy.  The reasons for this are that (1) torso and head localization are
near-perfect given a detected person \citep{deva2011}, (2) arms are the most
interesting parts, involved in actions, hand-held objects and object-person
interactions.


\section{Coarse-to-fine cascade evaluation}\label{sec:cascade-eval}
 \begin{table}[tb]
\begin{center}
\input{cascade_results_table}
\caption[Coarse-to-fine cascade progression analysis.]{Coarse-to-fine cascade
progression analysis. We show the progression of state spaces in the cascade,
as well as reduction in the state space at each level (measuring efficiency),
and in the last column, how many arm hypotheses remain closely matched,
considering the closest match to groundtruth remaining from the unpruned
hypotheses (measuring accuracy). }
\label{tab:c2f}
\end{center}
\end{table}

In \tabref{c2f} we show the progress of our CPS model's coarse-to-fine cascade,
in the Buffy dataset.
As explained in \secref{impl-details}, we start with a small state space and
continue pruning and refining until we reach a somewhat fine $80 \times 80
\times 24$ grid.  At the end of the cascade we are left with on average $492$
states for each part, $99.67\%$ fewer states than the original $80 \times 80
\times 24$ state space. We see that after one level of the cascade, we have
already pruned more than half of the full state space away.  This
is intuitive because there are many easy decisions of states to reject based on
even geometry alone, \eg the left elbow does not ever appear in the upper right
corner of the person's bounding box.

As the cascade progresses, we do lose arm hypotheses close to the groundtruth
arms, seen in the last column of \tabref{c2f}.  However, the percent of hypotheses close
to groundtruth after the cascade process is still higher than any current
system's accuracy on lower arms (see \tabref{res-table}).  Thus this number
($68.4 \%$) is an upper bound on how well we could do with our small, pruned
set of states.

To verify that our pruning is better than heuristic pruning, we compare to
heuristic pruning in \tabref{c2f}, last row.  The heuristic is to sample states
proportional to their unary potential scores (\ie, HoG limb detectors), with
non-max suppression.  At the same number of states sampled as we obtain from
the cascade, the heuristic pruning misses $10\%$ more lower arm hypotheses.

Finally, it could be the case that the benefits of our rich features in the
last stage of CPS make discrepencies in accuracy of pruning strategies
neglible.  In other words, even the pruning heuristic retains $58.6\%$ of lower
arms in its hypothesis set, and it is possible that it could perform equally
well at final-level prediction when using the same features as CPS.  We see in
\figref{cascade-vs-pruning} that this is not the case.  CPS performs $5 \text{
to } 10\%$ better than simple detector pruning coupled with the rich features
we use in CPS.  This makes a strong case that the CPS state filtering strategy
is important.

\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.99\textwidth]{figs/cascade-vs-pruning.pdf}
\caption[Cascade versus heuristic pruning.]{Cascade versus heuristic pruning.
Here we see that our cascade progression, coupled with a rich set of features
does better than heuristic pruning with the same rich features used afterwards.
This indicates that not only do the rich features matter, but also the quality
of the states retained from CPS.}
\label{fig:cascade-vs-pruning}
\end{center}
\end{figure}


\section{Feature analysis}
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.99\textwidth]{figs/ablative-bars}
\caption[Feature analysis.]{Feature analysis.  On the left, we observe how
adding features contributes to final system performance of CPS on Buffy,
measuring the Area Under the Curve of pixel error distance.  On the right, we
observe how removing single feature modalities from our Ensemble of Stretchable
Models affects performance.}
\label{fig:ablative}
\end{center}
\end{figure}

One of the main contributions of this thesis is technical innovations that
allow us to include ``everything and the kitchen sink'' (in the house of vision
features).  At this point we wish to verify that the work done implementing,
computing and learning parameters for features makes a difference in
performance.  This is actually quite difficult to do in general, because it is
computationally infeasible to explore the combinatorially many possibilities of
feature sets. Non-additive interactions between feature types may occur. We
analyze the importance of features by grouping them by modality, and adding
or removing them from our full systems in turn, measuring the change in
performance.

In \figref{ablative} we do a feature analysis for CPS (left) and ESM (right),
grouping features by coarse modalities.  The first important thing to note is
that the rich features we include over standard edge template and geometry
information lead to better performance.  For CPS, including rich feature types
individually in conjunction with geometry features helps performance, in
particular pairwise cues such as color and contours that were infeasible to
compute without our cascade approach.  In the bar marked ``baseline PS'' we
evaluate a classical unimodal PS, whereas the geometric parameters in our
cascade models (``geom only'') are learned bin weights that can achieve
non-linearity.  All features together do significantly better than any one
feature modality in isolation.

On the right side of \figref{ablative}, we do an ablative analysis of our ESM
model.  The most important individual feature modality is optical flow, which
gives us a fairly good estimate of foreground/background separation in many
video frames.  Importantly, many of these feature modalities are not used in
pose estimation models because they require joint interactions which lead to
loopy, cyclic models.


\section{System results}\label{sec:system-results}
In this section we analyze end-to-end system results, using the publicly
available code for our systems and competitors, for reproducibility's sake.
\begin{table}[tb]
\begin{center}
\input{res-table}
\caption[PCP evaluation.]{PCP Evaluation of single frame pose estimation. PCP
is a fairly loose measure of accuracy and only reveals one precision operating
point.  We include the measure for historical reasons; for a more detailed
picture see \figref{results-mpose-buffy-pascal}.}
\label{tab:res-table}
\end{center}
\end{table}


\subsection{Single frame pose estimation}
\begin{figure}[tb]
\begin{center}
\includegraphics[width=1.00\textwidth]{figs/results-mpose-buffy-pascal.pdf}
\caption[Single frame pose estimation results.]{Single frame pose estimation
results.  Shown are our single frame models CPS and LPPS against
state-of-the-art competitor models (\secref{competition})}.
\label{fig:results-mpose-buffy-pascal}
\end{center}
\end{figure}
The performance of all single-frame models are shown on the MoviePose, Buffy
and Pascal datasets in \figref{results-mpose-buffy-pascal}.  The LLPS model
outperforms the rest across the three datasets.  \citet{deva2011} is the
closest competitor overall, but CPS outperforms the others slightly on the most
difficult Pascal dataset. \citet{eichner-tr} (and \citet{andriluka09}, whose
code we were unable to run; see \tabref{res-table}) is uniformly worse than the
other models, most likely due to the lack of discriminative training and the
unimodal modeling.   Localization accuracy is not the only way to measure the
quality of a model (\eg speed)---see \secref{discussion} for more discussion.

Surprisingly, the two simple prior pose baselines perform comparatively well.
The ``mean pose'' baseline is a lower bound on performance, but is competitive
with~\cite{eichner-tr} in some cases. The ``mean cluster prediction'' baseline
actually outperforms or is close to CPS and \citet{eichner-tr} on the three
datasets, at very low computational cost---$0.0145$ seconds per image.  This
surprising result indicates the importance of multimodal modeling in even the
simplest form.  The decent performance of these mean pose baseline is also an
indication of either the difficulty or lack of pose variation in these
datasets, or a combination of both.  In fact, the scatterplots in
\figref{dataset-scatterplots} do show that most elbows are very tightly grouped
in these single frame datasets.

We also report PCP scores in~\tabref{res-table}.  Comparing the rankings of
methods by \tabref{res-table} and \figref{results-mpose-buffy-pascal}, we see
that PCP does not give a complete story of performance.

\subsection{Video pose estimation}
\begin{figure}[tb]
\begin{center}
\includegraphics[width=1.00\textwidth]{figs/results-vpose.pdf}
\caption[Video pose estimation results.]{Video pose estimation results.  Shown
are our single frame model CPS, various forms of agreement for Ensembles of
Stretchable Models, and other single frame competitors.}
\label{fig:results-vpose}
\end{center}
\end{figure}

Quantitative results for pose estimation in video are shown
in~\figref{results-vpose}.  All three methods we explore for inference in our
pose estimation video model (Ensemble of Stretchable Models) outperform the
state-of-the-art single-frame methods by a significant margin.  Using just a
single one of our Stretchable Model trees already does significantly better
than single frame models.  This shows the usefulness of our stretchable model
(joint-centric) representation of pose, as well as some of the rich pairwise
interactions we use that other models do not.  It is important to note that
previous work has found that incorporating time persistence into models
actually {\em hurt } performance~\citep{posesearch,weisssapp10}---hence single
frame models are the most competitive models for which to compare.

Furthermore, we explore the different agreement methods discussed
in~\secref{stretchable-inference}.  From \figref{results-vpose}, we see that
the very fast, approximate decoding schemes (A single tree and $SV$ require
only a single inference pass) were comparably accurate to more expensive
methods ($SF$ takes about $2\times$ as long while we ran $DD$ for $500\times$
iterations, each requiring inference in each submodel)\footnote{Specifically,
running inference in all 6 models sequentially takes about 16 seconds per 30
frame clip (trivial to parallelize) on a Intel Xeon CPU, E5450 @ 3.00GHz.  $SF$
inference takes an additional 19 seconds.}. We found two important trends: (1)
On average, completely decoupled inference ($Independent$) was consistently
about 0.75-1.5\% worse than the inference methods that aggregated information
across models, and (2) solving {\em partial} agreement problems {\em exactly}
($SV, SF$) performed better than solving {\em complete} agreement {\em
approximately} with Dual Decomposition.

The absolute accuracy values are also a testament to the difficulty of the
dataset: At the tightest matching criterion, we only correctly localize half
the elbows.  This suggests that there is plenty of future progress to be made
on this dataset, and in this domain in general.