Slow NSFW AI generation almost always comes down to four things: the wrong attention backend, too many steps at a slow sampler, oversized resolution, and a GPU that is offloading to system RAM. Enable xformers or SDPA, drop to DPM++ 2M Karras at 25 to 30 steps, generate at native resolution, and keep the model in VRAM. Those changes usually cut render time in half.
Waiting two or three minutes for a single 1024×1024 image kills your creative flow. The good news is that slow generation is one of the most fixable problems in the whole Stable Diffusion stack, because it is almost never caused by your prompt. It is caused by settings and hardware behavior. This guide walks through every lever that affects speed, in roughly the order that gives you the biggest wins first, with concrete settings tables and copy-paste flags.
Throughout, the example prompts stay tag-style and tasteful. Every subject is an adult (18+), fictional, and AI-generated, and every example carries baseline safety negatives. We are optimizing render speed here, not writing explicit content. If you want a place to test changes without a local install, you can run quick experiments on our generator and compare.
First, measure where the time goes
Before changing anything, look at your console. Stable Diffusion prints iterations per second (it/s) during sampling. That single number is your benchmark. Write down your current it/s at your usual settings, then re-test after each change. If you skip this step you will never know which fix actually helped.
A healthy modern GPU running SDXL at 1024×1024 should land in these rough ranges:
| GPU | Expected SDXL it/s | 30-step render |
|---|---|---|
| RTX 4090 | 8 to 12 it/s | about 3 to 4 sec |
| RTX 4070 Ti | 4 to 6 it/s | about 6 to 8 sec |
| RTX 3060 12GB | 2 to 3 it/s | about 12 to 16 sec |
| RTX 2060 6GB | 1 to 1.5 it/s | about 25 to 35 sec |
If you are far below these numbers, you have a configuration problem, not a hardware problem. That is the most common situation, and it is great news because configuration is free to fix.

Fix 1: enable a fast attention backend (the biggest single win)
The attention mechanism is the most compute-heavy part of the model. By default some setups fall back to a slow naive implementation. Switching to an optimized backend often delivers a 30 to 60 percent speedup with zero quality loss.
Automatic1111 and Forge: add a launch flag. Forge ships with this on by default, but verify it.
# webui-user.bat (Windows) or webui-user.sh (Linux)
set COMMANDLINE_ARGS=--xformers
# Modern PyTorch alternative, no extra install needed:
set COMMANDLINE_ARGS=--opt-sdp-attention
Use --xformers if you have it installed cleanly. If xformers refuses to build or you are on a newer PyTorch, --opt-sdp-attention (scaled dot product attention) is built into PyTorch 2.x and is nearly as fast. Never run both at once.
ComfyUI: SDPA is the default and the defaults are already good. If you are on ComfyUI and slow, the problem is usually elsewhere. Our ComfyUI for NSFW AI guide covers the node-graph specifics, and the Forge setup guide covers the Forge side.
After enabling, re-check your it/s. This one change is the reason most people fix slow generation.
Fix 2: pick a fast sampler and sane step count
Samplers differ wildly in speed. Some need 50 steps to converge, others look great at 20. The combination of sampler and steps is your second-biggest lever.
| Sampler | Good step range | Speed | Notes |
|---|---|---|---|
| DPM++ 2M Karras | 25 to 30 | Fast | Best all-round default |
| Euler a | 20 to 30 | Fast | Creative, slightly soft |
| DPM++ SDE Karras | 25 to 35 | Slower | High detail, about 2x cost |
| DPM++ 2M SDE | 25 to 30 | Medium | Good detail balance |
| LCM (with LCM LoRA) | 4 to 8 | Very fast | Needs LCM model or LoRA |
The practical default for almost every realistic or anime NSFW checkpoint is DPM++ 2M Karras at 26 to 30 steps. Going above 35 steps with any sampler buys you almost nothing visible while costing real time.
If you want a dramatic speedup and can accept a small quality trade, an LCM LoRA lets you render usable images in 4 to 8 steps. It is excellent for fast iteration, then you switch back to DPM++ for the final render.
Fast-iteration prompt (tag style, adult fictional AI subject):
(masterpiece, best quality), 1woman, adult, 25 years old, confident pose,
studio lighting, detailed face, sharp focus
Negative: child, minor, underage, loli, shota, lowres, bad anatomy,
bad hands, watermark, deformed
Sampler: DPM++ 2M Karras | Steps: 26 | CFG: 5.5 | Size: 1024x1024
Fix 3: generate at native resolution, upscale later
Diffusion compute scales with the number of pixels, and it scales badly. Doubling resolution roughly quadruples the work. Generating directly at 1536×1536 on SDXL is brutally slow and also causes duplicate bodies because you are far from the model native size.
The fix is to generate at the model native resolution, then upscale. SDXL and Pony are trained around 1024×1024 (roughly one megapixel). SD 1.5 models are trained at 512×512.
| Model family | Native size | Slow mistake |
|---|---|---|
| SD 1.5 | 512×512 | Generating at 1024 direct |
| SDXL / Pony / Illustrious | 1024×1024 | Generating at 1536 direct |
| Flux | 1024×1024 | Generating at 1536 direct |
Generate at native size, then use Hires Fix or a dedicated upscaler for the final pass. Our AI upscaler guide covers the fast options. This is faster AND higher quality than brute-forcing a large native render.
Fix 4: batch settings done right
There are two batch controls and they behave very differently for speed:
- Batch size processes multiple images in parallel in a single pass. Higher throughput per image, but it multiplies VRAM use. If you have headroom, raising batch size from 1 to 2 or 4 improves images per minute.
- Batch count runs the generation sequentially N times. It does not speed anything up per image, it just queues more jobs.
# More images per minute IF you have spare VRAM:
Batch size: 2 to 4 (parallel, uses more VRAM)
Batch count: 1 (sequential, no speed benefit)
The trap: pushing batch size too high triggers a CUDA out of memory error or, worse, silent offloading to system RAM that tanks your speed. If raising batch size makes things slower instead of faster, you have crossed your VRAM limit. See our CUDA out of memory fix for the memory side of this.
Fix 5: token merging (ToMe) for a free speed bump
Token merging speeds up sampling by merging redundant tokens in the image, with a small and often invisible quality cost. In Automatic1111 it lives in Settings under Optimizations as Token merging ratio.
Token merging ratio: 0.3 (good speed/quality balance)
Token merging ratio: 0.5 (faster, slight detail softening)
A ratio of 0.2 to 0.3 typically gives a 15 to 30 percent speedup that you will struggle to see in the output. Push to 0.5 only when speed matters more than fine detail. Set it back to 0 for your final hero renders if you are picky.
Fix 6: stop the model from offloading
The single most common cause of mysteriously slow generation is the model spilling out of VRAM into system RAM. When that happens, every step has to shuttle data across the PCIe bus and your it/s collapses.
Signs of offloading:
- Render time that suddenly jumps when you add a LoRA or raise resolution.
- Steady GPU usage but very low it/s.
- Task Manager showing shared GPU memory climbing.
Fixes, in order:
- Use the right medvram or lowvram flag for your card, but no lower than you need.
--medvramon Automatic1111 is fine for 8GB cards,--lowvramonly for 4GB and below. Using a lower mode than necessary slows you down. - Pick checkpoints sized for your GPU. Our best low VRAM NSFW checkpoints list keeps you in VRAM.
- Close other GPU apps, especially browsers with hardware acceleration and video players.
# webui-user.bat for an 8GB card
set COMMANDLINE_ARGS=--xformers --medvram --no-half-vae
The --no-half-vae flag also prevents the black-image VAE bug some cards hit, covered in our black image fix.
If you want to skip the local tuning entirely while you plan an upgrade, our hosted generator runs on cloud GPUs so you can keep creating.

Fix 7: hardware and drivers
Software tuning has limits. If you have done everything above and you are still slow, hardware is the bottleneck.
- Update GPU drivers. NVIDIA Studio or Game Ready drivers from the last few months matter. Old drivers leave performance on the table.
- Avoid the shared-memory fallback. Newer NVIDIA drivers let CUDA spill to system RAM instead of erroring. That keeps you running but at a crawl. In the NVIDIA Control Panel you can set CUDA System Memory Fallback Policy to Prefer No Fallback so you get a clean error instead of silent slowdown.
- AMD users have a different path entirely. Our AMD GPU guide covers ROCm and DirectML, which have their own speed characteristics.
- VRAM is king. For local NSFW work, more VRAM beats raw clock speed. Our GPU hardware requirements guide explains the tiers.
When cloud GPU is worth it
Sometimes the honest answer is that your hardware is not built for this. Renting a cloud GPU makes sense when:
- Your local card is 6GB or below and you are constantly fighting offloading.
- You need to train a LoRA, which is far more demanding than inference.
- You are doing big batch jobs or large upscales and your render times are measured in minutes.
| Situation | Local makes sense | Cloud makes sense |
|---|---|---|
| Occasional single images | Yes | No |
| 8GB+ card, SDXL | Yes | Optional |
| 4 to 6GB card | Painful | Yes |
| LoRA training | Only on 12GB+ | Yes |
| Large batch production | If 16GB+ | Yes |
Cloud L40S, A100, or rented 4090 instances eliminate offloading and VRAM ceilings. The cost is hourly, so the math favors cloud for bursts of heavy work and local for steady light work. Our hosted generator is the zero-setup version of that idea.
A clean fast baseline to copy
If you want one configuration that is fast and good on a mid-range GPU, start here and only deviate with a reason:
Launch: --xformers --medvram --no-half-vae
Model: an SDXL or Pony checkpoint sized to your VRAM
Sampler: DPM++ 2M Karras
Steps: 28
CFG: 5.5
Size: 1024x1024 (native), upscale after
Token merging: 0.3
Batch size: as high as VRAM allows without offloading
Prompt (tag style, adult fictional AI subject):
(masterpiece, best quality), 1woman, adult, 27 years old, elegant pose,
soft window light, detailed skin, sharp focus
Negative: child, minor, underage, loli, shota, lowres, blurry,
bad anatomy, bad hands, extra limbs, watermark, jpeg artifacts
Work the fixes in order, re-measuring it/s after each. Attention backend and sampler choice give you the bulk of the win, resolution discipline removes the worst slowdowns, and offloading control is what separates a smooth setup from a frustrating one.
Fix 8: cut wasted re-rolls
Raw render speed is only half of throughput. The other half is how many images you throw away. If you generate ten and keep one, your effective speed is ten times worse than your it/s suggests. A lot of slow workflows are really high-waste workflows in disguise. The fastest thing you can do is generate fewer bad images.
The biggest sources of wasted renders:
- Bad prompts that the model ignores. If half your output does not match the prompt, you re-roll constantly. Fixing adherence directly improves throughput. Our prompt ignoring fix covers this in depth.
- Anatomy and hand failures. Deformed hands are the classic re-roll trigger. Adding strong hand negatives and using ADetailer once, rather than re-rolling ten times, is far faster. See our fix hands guide and ADetailer faces guide.
- Color blowouts. A too-high CFG burns color and forces re-rolls. Keeping CFG in the 4 to 7 band saves renders, as covered in our oversaturated color fix.
# A waste-cutting workflow:
1. Lock a known-good prompt + negative with strong safety tokens.
2. Render a batch of 4 at native resolution, fast sampler.
3. Pick the best 1 or 2 composition winners.
4. Re-render ONLY those at higher steps + hires fix.
5. ADetailer pass for faces/hands instead of re-rolling.
This staged approach means you spend your expensive compute only on images that are already good, instead of paying full price for throwaways. It is the single biggest real-world speedup most people never think of, because they only watch it/s and ignore yield.

Fix 9: keep your install lean
A bloated install slows startup and sometimes slows generation. Over time people accumulate dozens of extensions, hundreds of LoRAs, and multiple multi-gigabyte checkpoints, all of which the UI has to scan and index.
- Disable extensions you do not use. Some extensions hook into the generation loop and add overhead on every render even when idle. Turn off the ones you are not actively using.
- Prune your model folder. A huge models directory slows the dropdown and the initial scan. Archive checkpoints you rarely touch.
- Watch VAE and embedding loads. Loading a fresh VAE or many textual inversions each run adds time. Keep your active set small.
- Restart periodically. Long-running sessions can fragment VRAM and leak memory through some extensions, slowly degrading speed. A restart clears it.
These are small individually but they add up, especially on modest hardware where every second of overhead is felt. A lean, well-maintained install simply runs faster than a cluttered one.
Fix 10: profile a single slow render
If you are still stuck, isolate the bottleneck with a controlled test. Strip everything back and add one variable at a time.
Profiling sequence (note it/s at each step):
1. Base model only, no LoRA, 512 or 1024 native, 20 steps, DPM++ 2M Karras.
2. Add your usual LoRAs. Did it/s drop? -> offloading or LoRA overhead.
3. Raise to your usual resolution. Big drop? -> resolution/VRAM.
4. Enable hires fix. Slow there? -> upscaler/denoise cost.
5. Add ADetailer/ControlNet. Slow there? -> extension cost.
Whatever step causes the it/s to fall off a cliff is your real bottleneck. This beats guessing, and it usually points straight at offloading, an oversized resolution, or a heavy extension. Once you know the exact culprit, the right fix from the sections above is obvious.
Once your render loop is fast and your yield is high, the rest of NSFW image work, from prompt structure to anatomy cleanup, gets a lot more enjoyable. If you hit anatomy problems while iterating quickly, our troubleshooting hub links every fix in this cluster, the best checkpoints list helps you pick a fast, quality base model, and the prompt formula guide helps you write prompts that hit on the first try so you re-roll less.
Frequently asked questions
Why is my Stable Diffusion suddenly slow when it used to be fast?
The most common cause is VRAM offloading. Something pushed the model out of GPU memory into system RAM, usually a new LoRA, a higher resolution, or another app grabbing VRAM. Every step now crosses the slow PCIe bus. Close background GPU apps, lower resolution to native, and check shared GPU memory in Task Manager to confirm the spill.
Is xformers or SDP attention faster?
They are very close. A clean xformers install is usually a touch faster on older cards, while opt-sdp-attention is built into PyTorch 2.x, needs no extra install, and avoids the xformers build headaches. If xformers will not compile, switch to sdp attention and you lose almost nothing. Never enable both at the same time.
How many steps do I really need?
For DPM++ 2M Karras, 26 to 30 steps is plenty for most NSFW checkpoints. Beyond 35 steps the image barely changes while render time keeps climbing. If you use an LCM LoRA you can drop to 4 to 8 steps for fast iteration, then switch back to a normal sampler for the final high-quality render.
Does a higher CFG slow generation down?
CFG itself adds little render cost, but very high CFG forces you toward more steps and can burn colors, which wastes time on re-rolls. Keeping CFG in the 4 to 7 range gives clean output and lets you stay at a low step count. So indirectly, a sane CFG keeps your whole pipeline faster and cuts wasted renders.
Will token merging hurt my image quality?
At a ratio of 0.2 to 0.3 the quality loss is usually invisible while you gain real speed. At 0.5 you may notice slight softening in fine detail like hair and skin texture. Use a low ratio for everyday work and set it to 0 for final hero renders where every detail matters. It is a safe, reversible setting.
Is cloud GPU worth it for NSFW generation?
It depends on your hardware and workload. If your card is 6GB or less, or you want to train LoRAs or run heavy batches, cloud GPUs remove the VRAM ceiling and offloading slowdowns. For occasional single images on an 8GB or larger card, local is cheaper. Cloud shines for bursts of heavy work, local for steady light work.
Why does adding a LoRA make everything slower?
LoRAs add weight to the model in VRAM. If you were already near your limit, the LoRA tips you over and the model starts offloading to system RAM, which crushes your speed. Use fewer or lighter LoRAs, a smaller base checkpoint, or a medvram flag. Watch your it/s before and after adding a LoRA to confirm the cause.
Do drivers actually affect Stable Diffusion speed?
Yes, noticeably. Recent NVIDIA drivers improve CUDA performance, but they also introduced a system memory fallback that silently slows you instead of erroring when VRAM runs out. Update to a current driver, then set the CUDA System Memory Fallback Policy to Prefer No Fallback so you get a clean out-of-memory error you can act on rather than a quiet crawl.



