visionbook/generative_models.qmd at main · Invinsible-Coder/visionbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
# Generative Models {#sec-generative_models}

## Introduction

Generative models perform (or describe) the synthesis of data. Recall
the image classifier from @sec-intro_to_learning, shown again below
(@fig-gen_models_image_classification):

![A classifier maps images to labels.](./figures/generative_models/image_classification_bird.png){width="70%" #fig-gen_models_image_classification}

A generative model does the opposite (@fig-gen_models_image_generation):

![A generator maps labels (or other descriptions) to images.](./figures/generative_models/image_generation_bird.png){width="70%" #fig-gen_models_image_generation}

Whereas an image classifier is a function
$f: \mathcal{X} \rightarrow \mathcal{Y}$, a generative model is a
function in the opposite direction
$g: \mathcal{Y} \rightarrow \mathcal{X}$. Things are a bit different in
this direction. The function $f$ is **many-to-one**: there are many
images that all should be given the same label "bird." The function $g$,
conversely, is **one-to-many**: there are many possible outputs for any
given input. Generative models handle this ambiguity by making $g$ a
**stochastic function**.

:::{.column-margin}
Although we describe $\mathbf{y}$ as a label here, generative models can in fact take other kinds of instructions as inputs, such as text descriptions of what we want to generate or hand-drawn sketches that we wish the model to fill in.
:::

One way to make a function stochastic is to make it a deterministic
function of a stochastic input, and this is how most of the generative
models in this chapter will work. We define $g$ as a function, called a
**generator**, that takes a randomized vector $\mathbf{z}$ as input
along with the label/description $\mathbf{y}$, and produces an image
$\mathbf{x}$ as output, that is,
$g: \mathcal{Z} \times \mathcal{Y} \rightarrow \mathcal{X}$. Then this
function can output a different image $\mathbf{x}$ for each different
setting of $\mathbf{z}$. The generative process is as follows. First
sample $\mathbf{z}$ from a prior $p(Z)$, then deterministically generate
an image based on $\mathbf{z}$ and $\mathbf{y}$:

$$
\begin{aligned}    \mathbf{z} \sim p(Z)\\
    \mathbf{x} = g(\mathbf{z},\mathbf{y})
\end{aligned}
$$


:::{.column-margin}
Graphical model for generating $X \\| Y, Z$.

![](./figures/generative_models/graphical_model_y_z_to_x_white.png){width="30%"}
:::

This procedure is shown in
@fig-generative_models-image_generation_with_z:

![Making a generator stochastic by conditioning on a random variable.](./figures/generative_models/image_generation_with_z_birds.png){width="70%" #fig-generative_models-image_generation_with_z}

Usually we draw $\mathbf{z}$ from a simple distribution such as Gaussian noise, that is, $p(Z) = \mathcal{N}(\mathbf{0},\mathbf{1})$. The way to
think of $\mathbf{z}$ is that it is a vector of **latent variables**, which specify all the attributes of the images other than those determined by the label. These variables are called latent because they
are not directly observed in the training data (it just consists of images $\mathbf{x}$ and their labels $\mathbf{y}$). In our example,
$\mathbf{y}$ tells $g$ to make a bird but the $\mathbf{z}$-vector is
what specifies exactly which bird. Different dimensions of $\mathbf{z}$
specify different attributes, such as the size, color, pose, background,
and so on; everything necessary to fully determine the output
image.

## Unconditional Generative Models

Sometimes we wish to simply make up data from scratch; in fact, this is
the canonical setting in which generative models are often studied. To
do so, we can simply drop the dependency on the input $\mathbf{y}$. This
yields a procedure for making data from scratch:

$$\begin{aligned}
    \mathbf{z} \sim p(Z)\\
    \mathbf{x} = g(\mathbf{z})
\end{aligned}$$


:::{.column-margin}
Graphical model for an unconditional generative model.
![](./figures/generative_models/graphical_model_z_to_x_white.png){width="10%"}
:::

We call this an **unconditional generative model** because it is model of the unconditional distribution $p(X)$. Generally we will refer to unconditional generative models simply as
generative models and use the term **conditional generative model** for
a model of any conditional distribution $p(X \bigm | Y)$. Conditional
generative models will be the focus of @sec-conditional_generative_models;
in the present chapter we will restrict our attention, from this point on, to
unconditional models.

Why bother with (unconditional) generative models, which make up random
synthetic data? At first this may seem a silly goal. Why should we care
to make up images from scratch? One reason is *content creation*; we
will see other reasons later, but content creation is a good starting
point. Suppose we are making a video game and we want to automatically
generate a bunch of exciting levels for the player the explore. We would
like a procedure for making up new levels from scratch. Such have been
successfully used to generate random landscapes and cities for virtual
realities @parish2001procedural. Suppose we want to add a river to a
landscape. We need to decide what path the river should take. A simple
program for generating the path could be "walk an increment forward,
flip a coin to decide whether to turn left or right, repeat."

Here is that program in pseudocode:

![A simple generative model that draws images of rivers.](./figures/generative_models/simple_rivers_script.png){width="100%" #alg-generative_models-simple_rivers_script}


Here are a few rivers this program draws
(@fig-generative_models-rivers1):

![Procedurally generated rivers.](./figures/generative_models/G4_river.png){width="80%" #fig-generative_models-rivers1}

This program relies on a sequence of coin flips to generate the path of
the river. In other words, the program took a randomized vector (noise)
as input, and converted this vector into an image of the path of the
river. It's exactly the same idea as we described previously for
generating images of birds, just this time the generator is a program
that makes rivers (@fig-generative_models-gen_model_of_rivers_diagram):

![A generator that makes procedural rivers.](./figures/generative_models/gen_model_of_rivers_diagram.png){width="75%" #fig-generative_models-gen_model_of_rivers_diagram}

This generator was written by hand. Next we will see generative models
that *learn* the program that synthesizes data.

## Learning Generative Models

How can we learn to synthesize images that look realistic? The machine
learning way to do this is to start with a training set of *examples* of
real images, $\{\mathbf{x}^{(i)}\}_{i=1}^N$. Recall that in supervised
learning, an *example* was defined as an {`input`, `output`} pair; here
things are actually no different, only the `input` happens to be the
null set. We feed these examples to a learner, which spits out a
generator function. Later, we may query the generator with new,
randomized $\mathbf{z}$-vectors to produce novel outputs, a process
called **sampling** from the model. This two-stage procedure is shown in
@fig-generative_models-gen_model_training_vs_sampling:

![Learning and using a generator.](./figures/generative_models/gen_model_training_vs_sampling.png){width="90%" #fig-generative_models-gen_model_training_vs_sampling}

### What's the Objective of Generative Modeling?

The objective of the learner is to create a generator that produces
*synthetic* data, $\{\hat{\mathbf{x}}^{(i)}\}_{i=1}^N$, that looks like
the *real* data, $\{\mathbf{x}^{(i)}\}_{i=1}^N$. There are a lot of ways
of define "looks like" and they each result in a different kind of
generative model. Two examples are:

1.  Synthetic data looks like real data if matches the real data in
    terms of certain marginal statistics, for example, it has the same
    mean color as real photos, or the same color variance, or the same
    edge statistics.

2.  Synthetic data looks like real data if it has high probability under
    a density model fit to the real data.


The first approach is the one we saw in @sec-stat_image_models on statistical image models,
where synthetic textures were made that had the same filter response
statistics as a real training example. This approach works well when we
only want to match certain properties of the training data. The second
approach, which is the main focus of this chapter, is better when we
want to produce synthetic data that matches *all* statistics of the
training examples.

:::{.column-margin}
In practice, we may not be able to match all statistics, due to
    limits of the model's capacity.
:::


To be precise, the goal of the deep generative models we consider in
this chapter is to produce synthetic data that is *identically
distributed* as the training data, that is, we want
$\hat{\mathbf{x}} \sim p_{\texttt{data}}$ where $p_{\texttt{data}}$ is
the true process that produced the training data.

What if our model just memorizes all the training examples and generates
random draws from this memory? Indeed, memorized training samples
perfectly satisfy the goal of making synthetic data that looks real. A
second, sometimes overlooked, property of a good generative model is
that it be a simple, or smooth, function, so that it generalizes to
producing synthetic samples that look like the training data but are not
identical to it. A generator that only regurgitates the training data is
overfit in the same way a classifier that memorizes the training data is
overfit. In both cases, the true goal is to fit the training data and to
do so with a function that generalizes.

### The Direct Approach and the Indirect Approach

There are two general approaches to forming data generators:

1.  Direct approach: learn the function
    $G: \mathcal{Z} \rightarrow \mathcal{X}$.

2.  Indirect approach: learn a function
    $E: \mathcal{X} \rightarrow \mathbb{R}$ and generate samples by
    finding values for $\mathbf{x}$ that score highly under this
    function.


:::{.column-margin}
In the generative modeling literature, the direct approach is sometimes called an *implicit* model. This is because the probability density is never explicitly represented. However, note that this "implicit" model explicitly describes the way data is generated. To avoid confusion, we will not use this terminology.
:::

So far in this chapter we have focused on the direct approach,
and it is schematized in @fig-generative_models-gen_model_training_vs_sampling. Interestingly,
the direct approach only became popular recently, with models like
**generative adversarial networks** (**GANs**) and **diffusion models**,
which we will see later in this chapter. The indirect approach is more
classical, and describes many of the methods we saw in @sec-stat_image_models. Indirect approaches come in two
general flavors, density models and energy models, which we will
describe next. Both follow the schematic given in
@fig-generative_models-gen_model_training_vs_sampling_indirect:

![The indirect approach to generative modeling. The scoring function can be either a probability density or an energy function. The models we learned about in @sec-stat_image_models.](./figures/generative_models/gen_model_training_vs_sampling_indirect.png){width="100%" #fig-generative_models-gen_model_training_vs_sampling_indirect}

## Density Models

Some generative models not only produce generators but also yield a
probability density function $p_{\theta}$, fit to the training data.
This density function may play a role in training the generator (e.g.,
we want the generator to produce samples from $p_{\theta}$), or the
density function may be the goal itself (e.g., we want to be able to
estimate the probability of datapoints in order to detect anomalies).

In fact, some generative models *only* produce a density $p_{\theta}$
and do not learn any explicit generator function. Instead, samples can
be drawn from $p_{\theta}$ using a sampling algorithm, such as **Markov
Chain Monte Carlo** (**MCMC**), that takes a density as input and
produce samples from it as output. This is a form of the indirect
approach to synthesizing data that we mentioned above
(@fig-generative_models-gen_model_training_vs_sampling_indirect).

### Learning Density Models

The objective of the learner for a density model is to output a density
function $p_{\theta}$ that is as close as possible to
$p_{\texttt{data}}$. How should we measure closeness? We can define a
**divergence** between the two distributions, $D$, and then solve the
following optimization problem:
$$\mathop{\mathrm{arg\,min}}_{p_\theta} D(p_{\theta}, p_{\texttt{data}})$$
The problem is that we do not actually have the function
$p_{\texttt{data}}$, we only have samples from this function,
$\mathbf{x} \sim p_{\texttt{data}}$. Therefore, we need a divergence $D$
that measures the distance between $p_{\theta}$ and
$p_{\texttt{data}}$ while only accessing samples from
$p_{\texttt{data}}$.

:::{.column-margin}
The divergence need not be a proper distance *metric*, and often
    is not; it can be nonsymmetric, where $D(p,q) \neq D(q,p)$, and need
    not satisfy the triangle-inequality. In fact, a divergence is
    defined by just two properties: nonnegativity and
    $D(p,q) = 0 \iff p=q$.
:::


A common choice is to use the **Kullback-Leibler (KL) divergence**, which is defined as
follows:
$$\begin{aligned}
    p_{\theta}^* &= \mathop{\mathrm{arg\,min}}_{p_\theta} \mathrm{KL}\left(p_{\text {data }} \| p_\theta\right)\\
    &= \mathop{\mathrm{arg\,min}}_{p_\theta} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}\Big[-\log \frac{p_{\theta}(\mathbf{x})}{p_{\texttt{data}}(\mathbf{x})}\Big]\\
    &= \mathop{\mathrm{arg\,max}}_{p_\theta} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}\big[\log p_{\theta}(\mathbf{x})\big] - \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}\big[\log p_{\texttt{data}}(\mathbf{x})\big]\\
    &= \mathop{\mathrm{arg\,max}}_{p_\theta} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}\big[\log p_{\theta}(\mathbf{x})\big] \quad \triangleleft \quad\text{dropped second term since no dependence on $p_{\theta}$}
    \\
    &\approx \mathop{\mathrm{arg\,max}}_{p_\theta} \frac{1}{N} \sum_{i=1}^N \log p_{\theta}(\mathbf{x}^{(i)})
\end{aligned}$${#eq-generative_models-max_likelihood} where the final line is an empirical estimate of the
expectation by sampling over the training dataset
$\{\mathbf{x}^{(i)}\}_{i=1}^N$. @eq-generative_models-max_likelihood is the *expected log
likelihood of the training data under the model's density function*.


:::{.column-margin}
$\mathrm{KL}\left(p_{\text {data }} \| p_\theta\right)$ is sometime called the forward KL divergence, and it measures the probability of the data under the model distribution. The reverse KL divergence, $\mathrm{KL}\left(p_\theta \| p_{\text {data }}\right)$, does the converse, measuring the probability of the model's samples under the data distribution; because we do not have access to the true data distribution, the reverse KL divergence usually cannot be computed.
:::

Maximizing this objective is therefore a form of **max likelihood learning**. Pictorially we can
visualize the max likelihood objective as trying to push up the density over each observed datapoint:

![Fitting a max likelihood density model to data. The gray region holds a constant amount of mass; think of it as piles of dirt. To increase the amount of dirt at the locations of the green arrows you must remove dirt from other regions, indicated in red.](./figures/generative_models/max_likelihood_density.png){#fig-generative_models-max_likelihood_density width="100%"}

Remember that a probability density function (pdf) is a function
$p_{\theta}: \mathcal{X} \rightarrow [0,\infty)$ with
$\int_{\mathbf{x}} p_{\theta}(\mathbf{x})d\mathbf{x} = 1$ (i.e. it's
normalized). To learn a pdf, we will typically learn the parameters of a
family of pdfs. For example, in @sec-generative_models-gaussian_density_models, we will
cover learning the parameters of the Gaussian family of pdfs. All
members of such a the family are normalized nonnegative functions; this
way we do not need to add an explicit constraint that the learned
function have these properties, we are simply searching over a space of
functions *all of which* have these properties. This means that whenever
we push up density over datapoints, we are forced to sacrifice density
over other regions, so implicitly we are removing density from places
where there is no data, as indicated by the red regions in
@fig-generative_models-max_likelihood_density.

In the next section we will see an alternative approach where the
parametric family we search over need not be normalized.

## Energy-Based Models

Density models constrain the learned function to be normalized, that is,
$\int_{\mathbf{x}} p_{\theta}(\mathbf{x})d\mathbf{x} = 1$. This
constraint is often hard to realize. One approach is to learn an
unnormalized function $E_{\theta}$, then convert it to the normalized
density $p_{\theta} = \frac{e^{-E_{\theta}}}{Z(\theta)}$, where
$Z(\theta) = \int_{\mathbf{x}} e^{-E_{\theta}(\mathbf{x})}d\mathbf{x}$
is the normalizing constant. The $Z(\theta)$ can be very expensive to
compute and often can only be approximated.


:::{.column-margin}
The parametric form $\frac{e^{-E_{\theta}}}{Z(\theta)}$ is sometimes referred to as a **Boltzmann** or **Gibbs distribution**.
:::

:::{.column-margin}
Notice that $Z$ is a function of model parameters $\theta$ but not of datapoint $\mathbf{x}$, since we integrate over all possible data values.
:::


:::{.column-margin}
Note that low energy $\Rightarrow$ high probability.
So we minimize energy to maximize probability.
:::

**Energy-based models** (**EBM**s) address this by simply skipping the
step where we normalize the density, and letting the output of the
learner just be $E_\theta$. Even though it is not a true probability
density, $E_{\theta}$ can still be used for many of the applications we
would want a density for. This is because we can compare *relative
probabilities* with $E_\theta$:

$$\begin{aligned}    \frac{p_\theta(\mathbf{x}_1)}{p_\theta(\mathbf{x}_2)} = \frac{e^{-E_\theta(\mathbf{x}_1)}/Z(\theta)}{e^{-E_\theta(\mathbf{x}_2)}/Z(\theta)}
    = \frac{e^{-E_\theta(\mathbf{x}_1)}}{e^{-E_\theta(\mathbf{x}_2)}}
\end{aligned}$$ Knowing relative probabilities is all that is required
for sampling (via MCMC), for outlier detection (the relatively lowest
probability datapoint in a set of datapoints is the outlier), and even
for optimizing over a space of of data to find the datapoint that is max
probability (because
$\mathop{\mathrm{arg\,max}}_{\mathbf{x} \in \mathcal{X}} p_{\theta}(\mathbf{x}) = \mathop{\mathrm{arg\,max}}_{\mathbf{x} \in \mathcal{X}} -E_{\theta}(\mathbf{x})$).


To solve such a maximization problem, we might want to find the gradient
of the log probability density with respect to $\mathbf{x}$; it turns
out this gradient is identical to the gradient of $-E_{\theta}$ with
respect to $\mathbf{x}$! $$\begin{aligned}
    \nabla_{\mathbf{x}} \log p_{\theta}(\mathbf{x}) &= \nabla_{\mathbf{x}} \log \frac{e^{-E_{\theta}(\mathbf{x})}}{Z(\theta)}\\
    &= -\nabla_{\mathbf{x}} E_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log Z(\theta)\\
    &= -\nabla_{\mathbf{x}} E_{\theta}(\mathbf{x})
\end{aligned}$$ In sum, energy functions can do most of what probability
densities can do, except that they do not give normalized probabilities.
Therefore, they are insufficient for applications where communicating
probabilities is important for either interpretability or for
interfacing with downstream systems that require knowing true
probabilities.

:::{.column-margin}
A case where a density model might be
preferable over an energy model is a medical imaging system that needs
to communicate to doctors the likelihood that a patient has
cancer.
:::


### Learning Energy-Based Models

Learning the parameters of an energy-based model is a bit different than
learning the parameters of a density model. In a density model, we
simply increase the probability mass over observed datapoints, and
because the density is a normalized function, this implicitly pushes
down the density assigned to regions where there is no observed data.
Since energy functions are not normalized, we need to add an explicit
negative term to push up energy where there are no datapoints, in
addition to the positive term of pushing down energy where the data is
observed. One way to do so is called **contrastive
divergence** @hinton2002training. On each iteration of optimization,
this method modifies the energy function to decrease the energy assigned
to samples from the data (positive term) and to increase the energy
assigned to samples from the model (i.e., samples from the energy
function itself; negative term), as shown in
@fig-generative_models-contrastive_divergence:

![Fitting a max likelihood energy function to data, using contrastive divergence. @hinton2002training](./figures/generative_models/contrastive_divergence.png){width="100%" #fig-generative_models-contrastive_divergence}

Once the energy function perfectly adheres to the data, samples from the
model are identical to samples from the data and the positive and
negative terms cancel out. This should make intuitive sense because we
don't want the energy function to change once we have perfectly fit the
data. It turns out that mathematically this procedure is an
approximation to the gradient of the log likelihood function. Defining
$p_{\theta} = \frac{e^{-E_{\theta}}}{Z(\theta)}$, start by decomposing
the gradient of the log likelihood into two terms, which, as we will
see, will end up playing the role of positive and negative terms:
$$\begin{aligned}
    \nabla_{\theta} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\log p_{\theta}(\mathbf{x})] &= \nabla_{\theta} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\log \frac{e^{-E_{\theta}(\mathbf{x})}}{Z(\theta)}]\\
    &= -\mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\nabla_{\theta} E_{\theta}(\mathbf{x})] - \nabla_{\theta} \log Z(\theta)
\end{aligned}$${#eq-generative_models-cd}

The first term is the positive term gradient, which
tries to modify parameters to decrease the energy placed on data
samples. The second term is the negative term gradient; here it appears
as an intractable integral, so our strategy is to rewrite it as an
expectation, which can be approximated via sampling:
$$\begin{aligned}
\nabla_{\theta} \log Z(\theta) &= \frac{1}{Z(\theta)}\nabla_{\theta}Z(\theta) \\
\end{aligned}$${#eq-generative_models-grad_log_identity}

$$\begin{aligned}
\nabla_{\theta} \log Z(\theta) &= \frac{1}{Z(\theta)} \nabla_{\theta} \int_x e^{-E_{\theta}(\mathbf{x})}d\mathbf{x} &&\quad\quad \triangleleft \quad \text{definition of $Z$}\\
    &= \frac{1}{Z(\theta)} \int_x \nabla_{\theta} e^{-E_{\theta}(\mathbf{x})}d\mathbf{x} &&\quad\quad \triangleleft \quad \text{exchange sum and grad}\\
    &= \frac{1}{Z(\theta)} -\int_x e^{-E_{\theta}(\mathbf{x})} \nabla_{\theta} E_{\theta}(\mathbf{x})d\mathbf{x}\\
    &= -\int_x \frac{e^{-E_{\theta}(\mathbf{x})}}{Z(\theta)} \nabla_{\theta} E_{\theta}(\mathbf{x})d\mathbf{x}\\
    &= -\int_x p_{\theta}(\mathbf{x}) \nabla_{\theta} E_{\theta}(\mathbf{x})d\mathbf{x} &&\quad\quad \triangleleft \quad \text{definition of $p_{\theta}$}\\
    &= -\mathbb{E}_{\mathbf{x} \sim p_{\theta}}[\nabla_{\theta} E_{\theta}(\mathbf{x})] &&\quad\quad \triangleleft \quad \text{definition of expectation}
\end{aligned}$${#eq-generative_models-cd_neg}


:::{.column-margin}
In @eq-generative_models-grad_log_identity we use a very useful identity from the chain rule of calculus, which appears often in machine learning: $\nabla_{x} \log f(x) = \frac{1}{f(x)} \nabla_{x} f(x)$.
:::


Plugging @eq-generative_models-cd_neg back into
@eq-generative_models-cd, we arrive at:

$$\begin{aligned}  \nabla_{\theta} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\log p_{\theta}(\mathbf{x})] &= -\mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\nabla_{\theta}E_{\theta(\mathbf{x})}] + \mathbb{E}_{\mathbf{x} \sim p_{\theta}}[\nabla_{\theta} E_{\theta}(\mathbf{x})]
\end{aligned}$$

Both expectations can be approximated via sampling: defining $\mathbf{x}^{(i)} \sim p_{\texttt{data}}$ and
$\hat{\mathbf{x}}^{(i)} \sim p_{\theta}$, and taking $N$ such samples, we have

$$\begin{aligned}
    -\mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\nabla_{\theta}E_{\theta(\mathbf{x})}] + \mathbb{E}_{\mathbf{x} \sim p_{\theta}}[\nabla_{\theta} E_{\theta}(\mathbf{x})] &\approx -\frac{1}{N} \sum_{i=1}^N \nabla_{\theta}E_{\theta}(\mathbf{x}^{(i)}) + \frac{1}{N} \sum_{i=1}^N \nabla_{\theta}E_{\theta}(\hat{\mathbf{x}}^{(i)})\\
    &= \frac{1}{N} \nabla_{\theta} \sum_{i=1}^N (-E_{\theta}(\mathbf{x}^{(i)}) + E_{\theta}(\hat{\mathbf{x}}^{(i)}))
\end{aligned}$$

The last line should make clear the intuition: we establish a contrast between data samples and current model samples,
then update the model to decrease this contrast, bringing the model
closer to the data. Once $\mathbf{x}^{(i)}$ and $\hat{\mathbf{x}}^{(i)}$
are identically distributed, this gradient will be zero (in
expectation); we have perfectly fit the data and no further updates
should be taken.

Under our formalization of a learning problem as an objective,
hypothesis space, and optimizer, contrastive divergence is an
*optimizer*; it's an optimizer specifically built for max likelihood
objectives over energy functions. Contrastive divergence tells us how to
approximate the gradient of the likelihood function, which can then be
plugged into any gradient descent method.

### Comparing Different Kinds of Generative Models

Some generative models give density or energy functions, others give
generator functions, and still others give both. We can summarize all
these kinds of models with the following learning diagram:


:::{.imagify}
\begin{tikzpicture}
    \draw (0,1.2) rectangle (4.2,2.4); % outer box
    \fill[black!20] (0.1,1.3) rectangle (4.1,2.3); % gray box
    \node[] at (2.0,1.8) {{\bf Generative modeling}};
    \node[] at (-1.5,2.4) {Data};
    \node[] at (-1.5,1.8) {$\{x^{(i)}\}_{i=1}^N$};
    \node[] at (-0.5,1.8) {{\Large  $ \rightarrow$}};
    \node[] at (6.4,2.9) {Density function};
    \node[] at (6.4,2.3) {$p_{\theta}: \mathcal{X} \rightarrow [0,\infty)$};
    \node[] at (9.4,2.9) {Energy function};
    \node[] at (9.4,2.3) {$E_{\theta}: \mathcal{X} \rightarrow \mathbb{R}$};
    \node[] at (6.4,1.5) {Generator};
    \node[] at (6.4,0.9) {$g_{\theta}: \mathcal{Z} \rightarrow \mathcal{X}$};
    \node[] at (4.7,1.7) {{\Large $ \rightarrow$}};
\end{tikzpicture}
:::


Each of these families of model has its own advantages and
disadvantages. Generators have latent variables $\mathbf{z}$ that
control the properties of the generated image. Changing these variables
can change the generated image in meaningful ways; we will explore this
idea in greater detail in the next chapter. In contrast, density and
energy models do not have latent variables that directly control
generated samples. Conversely, an advantage of density/energy
functions is that they provide scores related to the likelihood of data.
These scores (densities or energies) can be used to detect anomalies and
unusual events, or can be optimized to synthesize higher quality data.

Some of the generative models we will describe in this chapter and the
next are categorized along these dimensions in @tbl-generative_models-types_of_gen_model.

Note that this is a rough categorization, meant to reflect the most straightforward usage
of these models. With additional effort some of the ❌s can be converted
to ✅s; for example, one can do additional inference to sample from a
density model, or one can extract a lower-bound estimate of density from
a variational autoencoder (VAE) or diffusion model.


| Method | Latents? | Density/Energy? | Generator? |
| :---: | :---: | :---: | :---: |
| Energy-based models | ❌ | ✅ (energy only) | ❌|
| Gaussian | ❌ | ✅ | ❌|
| Autoregressive models | ❌ | ✅ | ✅ (slow) |
| Diffusion models | ✅ (high-dimensional) | ❌ | ✅ (slow) |
| GANs | ✅ | ❌ | ✅ |
| VAEs | ✅ | ❌ | ✅ |

: Three desirable properties in a generative model. No method achieves all three (without caveats). VAEs (see @sec-generative_modeling_and_representation_learning) and GANs are good at representation learning (i.e., at finding a low-dimensional latent space). Autoregressive models are great if you want to estimate the likelihood of your data points. Energy-based models can be an especially efficient way to model an unnormalized density. {#tbl-generative_models-types_of_gen_model .bordered}

Generative models can also be distinguished according to their
objectives, hypothesis spaces, and optimization algorithms. Some classes
of model, such as autoregressive models, refer to just a particular kind
of hypothesis space, whereas other classes of model, such as VAEs, are
much more specific in referring to the conjunction of an objective, a
general family of hypothesis spaces, and a particular optimization
algorithm.

In the next sections, and in the next chapter, we will dive into the
details of exactly how each of these models works.

## Gaussian Density Models {#sec-generative_models-gaussian_density_models}

One of the simplest and most useful density models is the Gaussian
distribution, which in one dimension (1D) is: $$\begin{aligned}
    p_{\theta}(x) = \frac{1}{Z}e^{-(x-\theta_1)^2/(2\theta_2)}
\end{aligned}$$  This density has two parameters $\theta_1$
and $\theta_2$, which are the mean and variance of the distribution. The
normalization constant $Z$ ensures that the function is normalized. This
is the typical strategy in defining density models: create a
parameterized family of functions such that any function in the family
is normalized. Given such a family, we search over the parameters to
optimize a generative modeling objective.

For density models, the most common objective is max likelihood:
$$\begin{aligned}
     \mathbb{E}_{x \sim p_{\texttt{data}}}[\log p_{\theta}(x)] \approx \frac{1}{N} \sum_{i=1}^N \log p_{\theta}(x^{(i)})
\end{aligned}$$ For a 1D Gaussian, this has a simple form:
$$\begin{aligned}
     \frac{1}{N} \sum_{i=1}^N \log \frac{1}{Z}e^{-(x^{(i)}-\theta_1)^2/(2\theta_2)} = -\log(Z) + \frac{1}{N}\sum_{i=1}^N (x^{(i)}-\theta_1)^2/(2\theta_2)
\end{aligned}$$ Optimizing with respect to $\theta_1$ and $\theta_2$
could be done via gradient descent or random search, but in this case
there is also an analytical solution we can find by setting the gradient
to be zero. For $\theta_1^*$ we have: $$\begin{aligned}
    \frac{\partial \big(-\log Z + \frac{1}{N}\sum_{i=1}^N (x^{(i)}-\theta_1^*)^2/(2\theta_2^*)\big)}{\partial \theta_1^*} = 0\\
    \frac{1}{N}\sum_{i=1}^N 2(x^{(i)}-\theta_1^*)/(2\theta_2^*) = 0\\
    \sum_{i=1}^N x^{(i)} - \sum_{i=1}^N \theta_1^* = 0\\
    \theta_1^* = \frac{1}{N}\sum_{i=1}^N x^{(i)}
\end{aligned}$$ For $\theta_2^*$ we need to note that $Z$ depends on
$\theta_2^*$ and in particular notice that
$\frac{\partial (-\log Z)}{\partial \theta_2^*} = \frac{1}{2\theta_2^*}$:
$$\begin{aligned}
    \frac{\partial \big(-\log Z + \frac{1}{N}\sum_{i=1}^N (x^{(i)}-\theta_1^*)^2/(2\theta_2^*)\big)}{\partial \theta_2^*} = 0\\
    \frac{1}{2\theta_2^*} - \frac{1}{N}\sum_{i=1}^N 2(x^{(i)}-\theta_1^*)^2/(2\theta_2^*)^2 = 0\\
    2\theta_2^* - \frac{1}{N}\sum_{i=1}^N 2(x^{(i)}-\theta_1^*)^2 = 0\\
    \theta_2^* = \frac{1}{N}\sum_{i=1}^N (x^{(i)}-\theta_1^*)^2
\end{aligned}$$ You might recognize the solutions for $\theta_1^*$ and
$\theta_2^*$ as the empirical mean and variance of the data,
respectively. This makes sense: we have just shown that to maximize the
probability of the data under a Gaussian, we should set the mean and
variance of the Gaussian to be the empirical mean and variance of the
data.

This fully describes the learning problem, and solution, for a 1D
Gaussian density model. We can put it all together in the learning
diagram below:


![](./figures/generative_models/1d_gaussian_summary.png){width="80%"}

Gaussian density models are just about the simplest density models one
can come up with. You may be wondering, do we actually use them for
anything in computer vision, or are they just a toy example? The answer
is that *yes* we do use them---in fact, we use them all the time. For
example, in least-squares regression, we are simply fitting a Gaussian
density to the conditional probability $p(Y \bigm | X)$. If we want a
more complicated density, we may use a mixture of multiple Gaussian
distributions, called a **Gaussian mixture model** (**GMM**), which we
will encounter in the next chapter. It's useful to get comfortable with
Gaussian fits because (1) they are a subcomponent of many more
sophisticated models, and (2) they showcase all the key components of
density modeling, with a clear objective, a parameterized hypothesis
space, and an optimizer that finds the parameters that maximize the
objective.

## Autoregressive Density Models {#sec-generative_models-autoregressive}

A single Gaussian is a very limited model, and the real utility of
Gaussians only shows up when they are part of a larger modeling
framework. Next we will consider a recipe for building highly expressive
models out of simple ones. There are many such recipes and the one we
focus on here is called an **autoregressive model**.

The idea of an autoregressive model is to synthesize an image pixel by
pixel. Each new pixel is decided on based on the sequence already
generated. You can think of this as a simple sequence prediction
problem: given a sequence of observed pixels, predict the color of the
next one. We use the same learned function $f_{\theta}$ to make each
subsequent prediction.

@fig-generative_models-autoregressive_prediction_schematic shows this
setup. We predict each next pixel from the partial image already
completed. The red bordered region is the **context** the prediction is
based on. This context could be the entire image synthesized so far or
it could be a smaller region, like shown here. The green bordered pixel
is the one we are predicting given this context. In this example, we
always predict the bottom-center pixel in the context window. After
making our prediction, we decide what color pixel to add to the image
based on that prediction; we may add the color predicted as most likely,
or we might sample from the predicted distribution so that we get some
randomness in our completion. Then, to predict the next missing pixel we
slide the context over by one and repeat.


![An autoregressive model, $f_{\theta}$, that generates an image pixel by pixel. The black pixels are the remaining pixels to synthesize. Compare with the Efros-Leung model in @fig-sampling_efros_leung.](./figures/generative_models/autoregressive_prediction_schematic.png){width="100%" #fig-generative_models-autoregressive_prediction_schematic}

You might have noticed that this setup looks very similar to the
Efros-Leung texture synthesis algorithm in @sec-Efros-Leung_texture, and indeed that was also an
autoregressive model. The Efros-Leung algorithm was a nonparametric
method that worked by stealing pixels from the matching regions of an
exemplar texture. Now we will see how to do autoregressive modeling with
a learned, parameteric predicton function $f_{\theta}$, such as a deep
net.

These models can be easily understood by first considering the problem
of synthesizing one pixel, then two, and so on. The first observation to
make is that it's pretty easy to synthesize a single grayscale pixel.
Such a pixel can take on 256 possible values (for a standard 8-bit
grayscale image). So it suffices to use a categorical distribution to
represent the probability that the pixel takes on each of these possible
values. The categorical distribution is fully expressive: any possible
probability mass function (pmf) over the 256 values can be represented
with the categorical distribution. Fitting this categorical distribution
to training data just amounts to counting how often we observe each of
the 256 values in the training set pixels, and normalizing by the total
number of training set pixels. So, we know how to model one grayscale
pixel. We can sample from this distribution to synthesize a random
one-pixel image.

How do we model the distribution over a second grayscale pixel given the
first? In fact, we already know how to model this; mathematically, we
are just trying to model $p(\mathbf{x_2} \bigm | \mathbf{x_1})$ where
$\mathbf{x_1}$ is the first pixel and $\mathbf{x_2}$ is the second.
Treating $\mathbf{x}_2$ as a categorical variable (just like
$\mathbf{x_1}$), we can simply use a softmax regression, which we saw in
@sec-intro_to_learning-image_classification. In that section
we were modeling a $K$-way categorical distribution over $K$ object
classes, conditioned on an input image. Now we can use exactly the same
tools to model a $256$-way distribution over a the second pixel in a
sequence conditioned on the first.


:::{.column-margin}
In this section we index the first pixel as $\mathbf{x}_1$ rather than $\mathbf{x}_0$.
:::

What about the third pixel, conditioned on the first two? Well, this is
again a problem of the same form: a 256-way softmax regression
conditioned on some observations. Now you can see the induction:
modeling each next pixel in the sequence is a softmax regression problem
that models
$p(\mathbf{x}_n \bigm | \mathbf{x}_1, \ldots, \mathbf{x}_{n-1})$. We
show how the cross-entropy loss can be computed in
@fig-generative_models-autoregressive_softmax_regression. Notice that
it looks almost identical to @fig-softmax_regression_diagram from
@sec-intro_to_learning.

![Autoregressive prediction as next-pixel-classification.](./figures/generative_models/autoregressive_softmax_regression.png){width="100%" #fig-generative_models-autoregressive_softmax_regression}

If we have color images we need to predict three values per-pixel, one
for each of the red, green, and blue channels. One way to do this is to
predict and sample the values for these three channels in sequence:
first predict the red value as a 256-way classification problem, then
sample the red value you will use, then predict the green value as a
256-way classification problem, and so on. You might be wondering, how
do we turn an image into a sequence of pixels? Good question! There are
innumerable ways, but the simplest is often good enough: just vectorize
the two-dimensional (2D) grid of pixels by first listing the first row
of pixels, then the second row, and so forth. In general, any fixed
ordering of the pixels into a sequence is actually valid, but this
simple method is perhaps the most common.

So we can model the probability of each subsequent pixel given the
preceding pixels. To generate an image we can sample a value for the
first pixel, then sample the second given the first, then the third
given the first and second, and so forth. But is this a valid model of
$p(\mathbf{X}) = p(\mathbf{x}_0, \ldots, \mathbf{x}_n)$, the probability
distribution of the full set of pixels? Does this way of sequential
sampling draw a valid sample from $p(\mathbf{X})$? It turns out it does,
according to the **chain rule of probability**. This rule allows us to
factorize *any* joint distribution into a product of conditionals as
follows:

$$\begin{aligned}
    p(\mathbf{X}) &= p(\mathbf{x_n} \bigm | \mathbf{x_1}, \ldots, \mathbf{x}_{n-1})p(\mathbf{x}_{n-1} \bigm | \mathbf{x}_1, \ldots, \mathbf{x}_{n-2}) \quad \ldots \quad p(\mathbf{x}_2 \bigm | \mathbf{x}_1)p(\mathbf{x}_1)\\
    p(\mathbf{X}) &= \prod_{i=1}^n p(\mathbf{x}_i \bigm | \mathbf{x}_1, \ldots, \mathbf{x}_{i-1})
\end{aligned}$${#eq-generative_models-autoregressive_likelihood}

:::{.column-margin}
As a notational convenience, we define here that $p(\mathbf{x}_i \\ | \mathbf{x}_1, \ldots, \mathbf{x}_{i-1}) = p(\mathbf{x}_1)$ when $i=1$.
:::

This factorization demonstrates that sampling from such
a model can indeed be done in sequence because all the conditional
distributions are independent of each other.

### Training an Autoregressive Model

To train an autogressive a model, you just need to extract supervised
pairs of desired input-output behavior, as usual. For an autoregressive
model of pixels, that means extracting sequences of pixels
$\mathbf{x}_1, \ldots, \mathbf{x}_{n-1}$ and corresponding observed next
pixel $\mathbf{x}_n$. These can be extracted by traversing training
images in raster order. The full training and testing setup looks like
this (@fig-generative_models-autoregressive_train_predict):

![Training an autoregressive model, then sampling images from it.](./figures/generative_models/autoregressive_train_predict.png){width="100%" #fig-generative_models-autoregressive_train_predict}

It's worth noting here two different ways of setting up the training
batches, one of which is much more efficient than the other. The first
way is create a training batch that looks like this (we will call this a
*type 1 batch*): $$\begin{aligned}    &\texttt{input}: \mathbf{x}_k^{(i)}, \dots, \mathbf{x}_{k+n-1}^{(i)} \quad\quad \texttt{target output}: \mathbf{x}_{k+n}^{(i)}\\
    &\texttt{input}: \mathbf{x}_l^{(j)}, \dots, \mathbf{x}_{l+n-1}^{(j)} \quad\quad \texttt{target output}: \mathbf{x}_{l+n}^{(j)}\\
    &\ldots\nonumber
\end{aligned}$$ that is, we sample supervised examples (\[sequence,
completion\] pairs) that each come from a different random starting
location (indexed by $k$ and $l$) in a different random image (indexed
by $i$ an $j$).

The other way to set up the batches is like this (we will call this a
*type 2 batch*): $$\begin{aligned}
    &\texttt{input}: \mathbf{x}_1^{(i)}, \dots, \mathbf{x}_{n-1}^{(i)} \quad\quad \texttt{target output}: \mathbf{x}_n^{(i)}\\
    &\texttt{input}: \mathbf{x}_2^{(i)}, \dots, \mathbf{x}_{n}^{(i)} \quad\quad \texttt{target output}: \mathbf{x}_{n+1}^{(i)}\\
    &\ldots\nonumber
\end{aligned}$$ that is, the example sequences overlap. This second way
can be much more efficient. The reason is because in order to predict
$\mathbf{x}_n$ from $\mathbf{x}_1, \dots, \mathbf{x}_{n-1}$, we
typically have to compute representations of
$\mathbf{x}_1, \dots, \mathbf{x}_{n-1}$, and these same representations
can be reused for predicting the next item over, that is,
$\mathbf{x}_{n+1}$ from $\mathbf{x}_2, \dots, \mathbf{x}_n$. As a
concrete example, this is the case in transformers with causal attention
@sec-transformers-masked_attention, which have the property
that the representation of item $\mathbf{x}_{n}$ only depends on the
items that preceded it in the sequence. What this allows us to do is
*share computation between all our overlapping predictions*, and this is
why it makes sense to use type 2 training batches. Notice these are the
same kind of batches we described in the causal attention section (see
@eq-transformers-causal_training_batches).

### Sampling from Autoregressive Models

Autogressive models give us an explicit density function,
@eq-generative_models-autoregressive_likelihood. To sample from this
density we use , which is the process described previously, of sampling
the first pixel from $p(\mathbf{x}_1)$, then, conditioned on this pixel,
sampling the second from $p(\mathbf{x}_2 \bigm | \mathbf{x}_1)$ an so
forth. Since each of these densities is a categorical distribution,
sampling is easy: one option is to partition a unit line segment into
regions of length equal to the categorical probabilities and see where a
uniform random draw along this line falls. Autogressive models do not
have latent variables $\mathbf{z}$, which makes them incompatible with
applications that involve extracting or manipulating latent variables.

## Diffusion Models {#sec-generative_models-diffusion_models}

The strategy of autoregressive models is to break a hard problem into
lots of simple pieces. **Diffusion models** are another class of model
that uses this same strategy @sohl2015deep. They can be easy to
understand if we start by looking at what autoregressive models do in
the *reverse* direction: starting with a complete image, they remove one
pixel, then the next, and the next
(@fig-generative_models-reverse_autoregressive_sequence):

![An autoregressive sequence in reverse is a corruption process that removes one pixel at a time.](./figures/generative_models/reverse_autoregressive_sequence.png){width="100%" #fig-generative_models-reverse_autoregressive_sequence}

This is a *signal corruption* process. The idea of diffusion models is
that this is not the only corruption process we could have used, and
maybe not the best. Diffusion models instead use the following
corruption process: starting with an uncorrupted image, $\mathbf{x}_0$,
they add noise $\mathbf{\epsilon}_0$ to this image, resulting in a noisy
version of the image, $\mathbf{x}_1$. Then they repeat this process,
adding noise $\mathbf{\epsilon}_1$ to produce an even noisier signal
$\mathbf{x}_2$, and so on. Most commonly, the noise is isotropic
Gaussian noise. @fig-generative_models-forward_diffusion_process shows
what that looks like:

![Forward diffusion process.](./figures/generative_models/forward_diffusion_process.png){width="100%" #fig-generative_models-forward_diffusion_process}

After $T$ steps of this process, the image $\mathbf{x}_T$ looks like
pure noise, if $T$ is large enough. In fact, if we use the following
noise process, then
$\mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ as
$T \rightarrow \infty$ (which follows from equation 4 in
@ho2020denoising). $$\begin{aligned}
    \mathbf{x}_t &= \sqrt{(1-\beta_t)}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathbf{\epsilon}_t\\
    \epsilon_t &\sim \mathcal{N}(\mathbf{0},\mathbf{I})
\end{aligned}$$ The $\beta_t$ values should be $\leq 1$ and can be fixed
for all $t$ (like in the previous equations) or can be set according to
some schedule that changes the amount of noise added over time.

Now, just like autoregressive models, diffusion models train a neural
net, $f_{\theta}$, to *reverse* this process. It can be trained via
supervised learning on examples sequences of different images getting
noisier and noisier. For example, as shown below in
@fig-generative_models-reverse_diffusion_process, the sequence of
images of the bird getting noisier and noisier can be flipped in time
and thereby provide a sequence of *training examples* of the mapping
$\mathbf{x}_t \rightarrow \mathbf{x}_{t-1}$, and these examples can be
fit by a predictor $f_{\theta}$.

![Reverse diffusion process. The forward process creates supervision to train the reverse process.](./figures/generative_models/reverse_diffusion_process.png){width="100%" #fig-generative_models-reverse_diffusion_process}

We call $f_{\theta}$ a **denoiser**; it learns to remove a little bit of
noise from an image. If we apply this denoiser over and over, starting
with pure noise, the process should coalesce on a noise-less image that
looks like one of our training examples. The steps to train a diffusion
model are therefore as follows:

1.  Generate training data by corrupting a bunch of images (forward
    process; noising).

2.  Train a neural net to invert each step of corruption (reverse
    process; denoising).

As an additional trick, it can help to let the denoising function
observe time step $t$ as input, so that we have, $$\begin{aligned}
    \hat{\mathbf{x}}_{t-1} = f_{\theta}(\mathbf{x}_t, t)
\end{aligned}$$ This can help the denoiser to make better predictions,
because knowing $t$ helps us know if we are looking at a normal scene
that has been corrupted by our noise (if $t$ is large this is likely) or
a scene where the actual physical structure is full of chaotic, rough
textures that happen to look like noise (if $t$ is small this is more
likely, because for small $t$ we haven't added much noise to the scene
yet). @alg-generative_models-diffusion_model presents these steps in more formal detail.

![Training a diffusion model consists of first producing training pairs of the form {noisy image, less noisy image}. Then do supervised learning on these pairs.](./figures/generative_models/diffusion_model.png){width="100%" #alg-generative_models-diffusion_model}

In @alg-generative_models-diffusion_model we train the denoiser
using a loss $\mathcal{L}$, which could be the $L_2$ distance. In many
diffusion models, $f_{\theta}$ is instead trained to output the
parameters (mean and variance) as a Gaussian density model fit to the
data. This formulation yields useful probabilistic interpretations (in
fact, such a diffusion model can be framed as a variational autoencoder,
which we will see in @sec-generative_modeling_and_representation_learning).
However, diffusion models can still work well without these nice
interpretations, instead using a wide variety of prediction models for
$f_{\theta}$ @bansal2022cold.

One useful trick for training diffusion models is to reparameterize the
learning problem as predicting the noise rather than the
signal @ho2020denoising: $$\begin{aligned}
    f_{\theta}(\mathbf{x}_t, t) &= \epsilon_t &\quad\quad \triangleleft \quad\text{first predict noise}\\
    \hat{\mathbf{x}}_{t-1} &= \mathbf{x}_t + \epsilon_t &\quad\quad \triangleleft \quad\text{then remove this noise}
\end{aligned}$$

One way to look at diffusion models is that we want to learn a mapping
from pure noise (e.g.,
$\mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$) to data. It is
very hard to figure out how to create structured data out of noise, but
it is easy to do the reverse, turning data into noise. So diffusion
models turn images into noise in order to *create the training
sequences* for a process that turns noise into data.

## Generative Adversarial Networks {#sec-generative_models-GANs}

Autoregressive models and diffusion models sample simple distributions
step by step to build up a complex distribution. Could we instead create
a system that directly, in one step, outputs samples from the complex
distribution. It turns out we can, and one model that does this is the
**generative adversarial network** (**GAN**), which was introduced by
@goodfellow2014generative.

Recall that the goal of generative modeling is to make synthetic data
that looks like real data. We stated previously that there are many
different ways to measure "looks like" and each leads to a different
kind of generative model. GANs take a very direct and intuitive
approach: synthetic data looks like real data if a classifier cannot
distinguish between synthetic images and real images.

GANs consist of two neural networks, the **generator**,
$g_{\theta}: \mathcal{Z} \rightarrow \mathcal{X}$, which synthesizes
data from noise, and a **discriminator**,
$d_{\phi}: \mathcal{X} \rightarrow \Delta^1$, which tries to classify
between synthesized data and real data.

The $g_{\theta}$ and $d_{\phi}$ play an adversarial game in which
$g_{\theta}$ tries to become better and better at generating synthetic
images while $d_{\phi}$ tries to become better and better at detecting
any errors $g_{\theta}$ is making. The learning problem can be written
as a minimax game: $$\begin{aligned}
    \mathop{\mathrm{arg\,min}}_\theta\max_\phi \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\log d_{\phi}(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log (1-d_{\phi}(g_{\theta}(\mathbf{z})))]
\end{aligned}$${#eq-generative_models-GAN_learning_problem}

Schematically, the generator synthesizes images that are then fed as
input to the discriminator. The discriminator tries to assign a high
score to generated images (classifying them as synthetic) and a low
score to real images from some training set (classifying them as real),
as shown in @fig-generative_models-generative_models-gan_schematic.

![Architecture of a GAN being trained to generate images of flamingos. The synthetic image in this example is generated by BigGAN @brock2018large](./figures/generative_models/gan_schematic.png){width="100%" #fig-generative_models-generative_models-gan_schematic}

Although we call this an adversarial game, the discriminator is in fact
helping the generator to perform better and better, by pointing out its
current errors (we call it an adversary because it tries to point out
errors). You can think of the generator as a student taking a painting
class and the discriminator as the teacher. The student is trying to
produce new paintings that match the quality and style of the teacher.
At first the student paints flat landscapes that lack shading and the
illusion of depth; the teacher gives feedback: "This mountain is not
well shaded; it looks 2D." So the student improves and corrects the
error, adding haze and shadows to the mountain. The teacher is pleased
but now points out a different error: "The trees all look identical;
there is not enough variety." The teacher and student continue on in
this fashion until the student has succeeded at satisfying the teacher.
Eventually, in theory, the student---the generator---produces paintings
that are just as good as the teacher's paintings.

This objective may be easier to understand if we think of the objectives
for $g_{\theta}$ and $d_{\phi}$ separately. Given a particular generator
$g_{\theta}$, $d_{\phi}$ tries to maximize its ability to discriminate
between real and synthetic images (synthetic images are anything output
by $g_{\theta}$). The objective of $d_{\phi}$ is logistic regression
between a set of real data $\{\mathbf{x}^{(i)}\}_{i=1}^N$ and synthetic
data $\{\hat{\mathbf{x}}^{(i)}\}_{i=1}^N$, where
$\hat{\mathbf{x}}^{(i)} = g_{\theta}(\mathbf{z}^{(i)})$.

Let the optimal discriminator be labeled $d_{\phi}^*$. Then we have,
$$\begin{aligned}    d_{\phi}^* = \mathop{\mathrm{arg\,max}}_{\phi} \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}}[\log d_{\phi}(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log (1-d_{\phi}(g_{\theta}(\mathbf{z})))]
\end{aligned}$${#eq-generative_models-GAN_optimal_D}

Now we turn to $g_{\theta}$'s perspective. Given $d_{\phi}^*$,
$g_{\theta}$ tries to solve the following problem: $$\begin{aligned}    \mathop{\mathrm{arg\,min}}_{\theta} \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log (1-d_{\phi}^*(g_{\theta}(\mathbf{z})))]
\end{aligned}$${#eq-generative_models-GAN_optimal_G}

Now, because the optimal discriminator $d_{\phi}^*$ depends on the
current behavior of $g_{\theta}$, as soon as we *change* $g_{\theta}$,
updating it to better fool $d_{\phi}^*$, then $d_{\phi}^*$ no longer is
the optimal discriminator and we need to again solve problem in
@eq-generative_models-GAN_optimal_D. To optimize a GAN, we simply
alternate between taking one gradient step on the objective in
@eq-generative_models-GAN_optimal_G and then $K$ gradient steps on the
objective in @eq-generative_models-GAN_optimal_D, where the larger the
$K$, the closer we are to approximating the true $d_{\theta}^*$. In
practice, setting $K=1$ is often sufficient.

### GANs are Statistical Image Models

GANs are related to the image and texture models we saw in @sec-stat_image_models.
For example, in the Heeger-Bergen model
(@sec-Heeger_Bergen @RG-Heeger-Bergen95), we synthesize
images with the same statistics as a source texture. This can be phrased
as an optimization problem in which we optimize image pixels until
certain statistics of the images match those same statistics measured on
a source (training) image. In the Heeger-Bergen model the objective is
to match subband histograms. In the language of GANs, the loss that
checks for such a match is a kind of discriminator; it outputs a score
related to the difference between a generated image's statistics and the
real image's statistics. However, unlike a GAN, this discriminator is
hand-defined in terms of certain statistics of interest rather than
learned. Additionally, GANs amortize the optimization via learning. That
is, GANs learn a mapping $g_{\theta}$ from latent noise to samples
rather than arriving at samples via an optimization process that starts
from scratch each time we want to make a new sample.

## Concluding Remarks

Generative models are models that synthesize data. They can be useful
for content creation---artistic images, video game assets, and so
on---but also are useful for much more. In the next chapters we will see
that generative models also learn good *image representations*, and
conditional generative models can be viewed as a general framework for
answering questions and making predictions. As Richard Feynman famously
wrote, "What I cannot create, I do not understand." By modeling
creation, generative models also help us understand the visual world
around us.

:::{.column-margin}
https://en.wikiquote.org/wiki/Richard_Feynman
:::