Nowadays, text-to-image synthesis is gaining popularity. A probabilistic diffusion model is a class of latent variable models that have proven state-of-the-art in this task. Various models have been proposed recently, like DALLE-2, Imagen, Stable Diffusion, etc., which are surprisingly good at generating hyper-realistic images from a given text prompt. But how are they able to do it? Why are some models better than others in terms of image quality, speed and reliability? How can we further improve these models? So many questions to which the author of the article tried to answer.
Diffusion probabilistic models (DPM) have achieved impressive results in high-resolution image synthesis. A technique called guided sampling is used to improve the sample quality of DPMs. The Denoising Diffusion Implicit Model (DDIM) is a commonly used fast-guided sampler.
DDIM is a first-order diffusion ordinary differential equation (ODE) solver that requires approximately 100-250 steps to generate high-quality samples. Some higher-order solvers sample faster when unguided, but when guided become unstable and may even be slower than DDIM for high guidance scale values.
So now the question arises: why is orientation so important?
The answer to this question is simple: guidance helps improve the quality of model-generated samples by applying certain conditions, such as aligning a generated image more closely with the text prompt. Indeed, it costs a certain diversity in the samples generated, but one can adjust the guidance scale to obtain a good compromise between diversity and fidelity.
Let’s start by understanding how DPMs work?
DPM sampling procedures gradually remove noise from pure Gaussian random variables to obtain clear data. This sampling can be done by discretizing broadcast SDEs or broadcast ODEs, which are also defined in two ways: a parameterized noise prediction model and a data prediction model.
ODE solvers typically take 100–250 steps to converge, while high-order diffusion SDEs can generate high-quality samples in just 10–20 steps when sampling without guidance. So why not always use higher-order broadcast SDEs for sampling?
We face two challenges when applying high-order solvers:
- The large guide scale reduces the convergence radius of high-order solvers, making them unstable.
- The converged solution is in a different range than the original data. It is also known as Train-test mismatch.
The training data given is limited, but a large guide scale can cause the conditional noise prediction model to drift away from the actual noise, causing the sample to exit the limits of the training data, meaning the samples appear unrealistic (as some are shown in Figure 1). High values of the guide scale can amplify the output and higher derivatives, which are more sensitive to amplifications. Derivatives affect the convergence range of ODE solvers, and since the derivatives have been amplified, it is intuitive that it may need high-order solvers with small step sizes to converge.
Now, how do they meet these two challenges?
The authors proposed a high-order, training-free diffusion ODE solver for faster guided sampling, which they named DPM-Solver++, to address the first challenge. DPM-Solver++ is designed for data prediction models, unlike previous high-order solvers designed for noise prediction models. One of the reasons for choosing the data prediction model is that thresholding methods can be used to keep the limited sample within the same range as the original data, which is the second challenge we have to face. . Two versions of DPM-Solver++ are offered; one is DPM-Solver++(2S), which is a single-step second-order solver, while the other is DPM-Solver++(2M) which is a multi-step second-order solver. The latter deals with the problem of instability of high-order solvers. Now you must be wondering what is the difference between the two and what makes a multi-step solver better than a single-step version. Here is the answer, suppose we can only evaluate the data prediction model N times. Then, according to the algorithm of the single-step method of order k, it can only use M=N/k steps. Whereas for the same N evaluations, the multi-step process can use M=N steps because it uses previously calculated values of x̃you1 and x̃you2 to calculate higher order derivatives where t1 =ti-1 and T2= ti-2 instead of just dismissing them as a one-step solver. This way, information from previous steps is not lost, which ensures the stability of the high-order solver. In conclusion, based on the results, multi-step methods are slightly better than single-step methods. For large guidance scales, the multi-step DPM-Solver++(2M) performs better than the DPM-Solver++(2S), while for slightly smaller guidance scales, the single-step solver performs better than the single-step solver. many stages.
Comparing with previous high-order samplers (DEIS, PNDM, and DPM-Sampler), they found that DPM-Solver++ achieves the best convergence speed and stability for both large and small guidance scales.
DPM-Solver++ is able to converge in 15 numbers of function evaluations. We can use DPM-Solver++ with pixel-space DPMs as well as latent-space DPMs. But in the case of latent-space DPMs, we do not use the thresholding method because the latents are not bounded.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'DPM-SOLVER++: FAST SOLVER FOR GUIDED SAMPLING OF DIFFUSION PROBABILISTIC MODELS '. All Credit For This Research Goes To Researchers on This Project. Check out the preprint paper, and code.