Sohl-Dickstein used the principles of diffusion to develop an algorithm for generative modeling. The idea is simple: The algorithm first turns complex images in the training data set into simple noise — akin to going from a blob of ink to diffuse light blue water — and then teaches the system how to reverse the process, turning noise into images.

Here’s how it works. First, the algorithm takes an image from the training set. As before, let’s say that each of the million pixels has some value, and we can plot the image as a dot in million-dimensional space. The algorithm adds some noise to each pixel at every time step, equivalent to the diffusion of ink after one small time step. As this process continues, the values of the pixels bear less of a relationship to their values in the original image, and the pixels look more like a simple noise distribution. (The algorithm also nudges each pixel value a smidgen toward the origin, the zero value on all those axes, at each time step. This nudge prevents pixel values from growing too large for computers to easily work with.)

Do this for all images in the data set, and an initial complex distribution of dots in million-dimensional space (which cannot be described and sampled from easily) turns into a simple, normal distribution of dots around the origin.

“The sequence of transformations very slowly turns your data distribution into just a big noise ball,” said Sohl-Dickstein. This “forward process” leaves you with a distribution you can sample from with ease.

Next is the machine learning part: Give a neural network the noisy images obtained from a forward pass and train it to predict the less noisy images that came one step earlier. It’ll make mistakes at first, so you tweak the parameters of the network so it does better. Eventually, the neural network can reliably turn a noisy image, which is representative of a sample from the simple distribution, all the way into an image representative of a sample from the complex distribution.

The trained network is a full-blown generative model. Now you don’t even need an original image on which to do a forward pass: You have a full mathematical description of the simple distribution, so you can sample from it directly. The neural network can turn this sample — essentially just static — into a final image that resembles an image in the training data set.

Sohl-Dickstein recalls the first outputs of his diffusion model. “You’d squint and be like, ‘I think that colored blob looks like a truck,’” he said. “I’d spent so many months of my life staring at different patterns of pixels and trying to see structure that I was like, ‘This is way more structured than I’d ever gotten before.’ I was very excited.”

**Envisioning the Future**

Sohl-Dickstein published his diffusion model algorithm in 2015, but it was still far behind what GANs could do. While diffusion models could sample over the entire distribution and never get stuck spitting out only a subset of images, the images looked worse, and the process was much too slow. “I don’t think at the time this was seen as exciting,” said Sohl-Dickstein.

It would take two students, neither of whom knew Sohl-Dickstein or each other, to connect the dots from this initial work to modern day diffusion models like DALL·E 2. The first was Song, a doctoral student at Stanford at the time. In 2019, he and his adviser published a novel method for building generative models that didn’t estimate the probability distribution of the data (the high-dimensional surface). Instead, it estimated the gradient of the distribution (think of it as the slope of the high-dimensional surface).

Song found his technique worked best if he first perturbed each image in the training data set with increasing levels of noise, then asked his neural network to predict the original image using gradients of the distribution, effectively denoising it. Once trained, his neural network could take a noisy image sampled from a simple distribution and progressively turn that back into an image representative of the training data set. The image quality was great, but his machine learning model was painfully slow to sample. And he did this with no knowledge of Sohl-Dickstein’s work. “I was not aware of diffusion models at all,” said Song. “After our 2019 paper was published, I received an email from Jascha. He pointed out to me that [our models] have very strong connections.”

In 2020, the second student saw those connections and realized that Song’s work could improve Sohl-Dickstein’s diffusion models. Jonathan Ho had recently finished his doctoral work on generative modeling at the University of California, Berkeley, but he continued working on it. “I thought it was the most mathematically beautiful subdiscipline of machine learning,” he said.

Ho redesigned and updated Sohl-Dickstein’s diffusion model with some of Song’s ideas and other advances from the world of neural networks. “I knew that in order to get the community’s attention, I needed to make the model generate great-looking samples,” he said. “I was convinced that this was the most important thing I could do at the time.”

His intuition was spot on. Ho and his colleagues announced this new and improved diffusion model in 2020, in a paper titled “Denoising Diffusion Probabilistic Models.” It quickly became such a landmark that researchers now refer to it simply as DDPM. According to one benchmark of image quality — which compares the distribution of generated images to the distribution of training images — these models matched or surpassed all competing generative models, including GANs. It wasn’t long before the big players took notice. Now, DALL·E 2, Stable Diffusion, Imagen and other commercial models all use some variation of DDPM.

Modern diffusion models have one more key ingredient: large language models (LLMs), such as GPT-3. These are generative models trained on text from the internet to learn probability distributions over words instead of images. In 2021, Ho — now a research scientist at a stealth company — and his colleague Tim Salimans at Google Research, along with other teams elsewhere, showed how to combine information from an LLM and an image-generating diffusion model to use text (say, “goldfish slurping Coca-Cola on a beach”) to guide the process of diffusion and hence image generation. This process of “guided diffusion” is behind the success of text-to-image models, such as DALL·E 2.

“They are way beyond my wildest expectations,” said Ho. “I’m not going to pretend I saw all this coming.”

**Generating Problems**

As successful as these models have been, images from DALL·E 2 and its ilk are still far from perfect. Large language models can reflect cultural and societal biases, such as racism and sexism, in the text they generate. That’s because they are trained on text taken off the internet, and often such texts contain racist and sexist language. LLMs that learn a probability distribution over such text become imbued with the same biases. Diffusion models are also trained on un-curated images taken off the internet, which can contain similarly biased data. It’s no wonder that combining LLMs with today’s diffusion models can sometimes result in images reflective of society’s ills.

Anandkumar has firsthand experience. When she tried to generate stylized avatars of herself using a diffusion model–based app, she was shocked. “So [many] of the images were highly sexualized,” she said, “whereas the things that it was presenting to men weren’t.” She’s not alone.

These biases can be lessened by curating and filtering the data (an extremely difficult task, given the immensity of the data set), or by putting checks on both the input prompts and the outputs of these models. “Of course, nothing is a substitute for carefully and extensively safety-testing” a model, Ho said. “This is an important challenge for the field.”

Despite such concerns, Anandkumar believes in the power of generative modeling. “I really like Richard Feynman’s quote: ‘What I cannot create, I do not understand,’” she said. An increased understanding has enabled her team to develop generative models to produce, for example, synthetic training data of under-represented classes for predictive tasks, such as darker skin tones for facial recognition, helping improve fairness. Generative models may also give us insights into how our brains deal with noisy inputs, or how they conjure up mental imagery and contemplate future action. And building more sophisticated models could endow AIs with similar capabilities.

“I think we are just at the beginning of the possibilities of what we can do with generative AI,” said Anandkumar.