Self-Guidance : Boosting Flow and Diffusion Generation on Their Own

Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on pattern analysis and machine intelligence. - 1979. - PP(2025) vom: 18. Sept.
Auteur principal: Li, Tiancheng (Auteur)
Autres auteurs: Luo, Weijian, Chen, Zhiyang, Ma, Liyuan, Qi, Guo-Jun
Format: Article en ligne
Langue:English
Publié: 2025
Accès à la collection:IEEE transactions on pattern analysis and machine intelligence
Sujets:Journal Article
LEADER 01000naa a22002652c 4500
001 NLM392755823
003 DE-627
005 20250920232342.0
007 cr uuu---uuuuu
008 250920s2025 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2025.3611831  |2 doi 
028 5 2 |a pubmed25n1574.xml 
035 |a (DE-627)NLM392755823 
035 |a (NLM)40966151 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Li, Tiancheng  |e verfasserin  |4 aut 
245 1 0 |a Self-Guidance  |b Boosting Flow and Diffusion Generation on Their Own 
264 1 |c 2025 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 18.09.2025 
500 |a published: Print-Electronic 
500 |a Citation Status Publisher 
520 |a Proper guidance strategies are essential to achieve high-quality generation results without retraining diffusion and flow-based text-to-image models. Existing guidance either requires specific training or strong inductive biases of diffusion model networks, which potentially limits their ability and application scope. Motivated by the observation that artifact outliers can be detected by a significant decline in the density from a noisier to a cleaner noise level, we propose Self-Guidance (SG), which can significantly improve the quality of the generated image by suppressing the generation of low-quality samples. The biggest difference from existing guidance is that SG only relies on the sampling score function of the original diffusion or flow model at different noise levels, with no need for any tricky and expensive guidance-specific training. This makes SG highly flexible to be used in a plug-and-play manner by any diffusion or flow models. We also introduce an efficient variant of SG, named SG-prev, which reuses the output from the immediately previous diffusion step to avoid additional forward passes of the diffusion network. We conduct extensive experiments on text-to-image and text-to-video generation with different architectures, including UNet and transformer models. With open-sourced diffusion models such as Stable Diffusion 3.5 and FLUX, SG exceeds existing algorithms on multiple metrics, including both FID and Human Preference Score. SG-prev also achieves strong results over both the baseline and the SG, with 50 percent more efficiency. Moreover, we find that SG and SG-prev both have a surprisingly positive effect on the generation of physiologically correct human body structures such as hands, faces, and arms, showing their ability to eliminate human body artifacts with minimal efforts. We have released our code at https://github.com/maple-research-lab/Self-Guidance 
650 4 |a Journal Article 
700 1 |a Luo, Weijian  |e verfasserin  |4 aut 
700 1 |a Chen, Zhiyang  |e verfasserin  |4 aut 
700 1 |a Ma, Liyuan  |e verfasserin  |4 aut 
700 1 |a Qi, Guo-Jun  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g PP(2025) vom: 18. Sept.  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnas 
773 1 8 |g volume:PP  |g year:2025  |g day:18  |g month:09 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2025.3611831  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d PP  |j 2025  |b 18  |c 09