Fix CUDA Out of Memory Errors in NSFW AI (2026)

15 min read

CUDA out of memory means your GPU ran short of VRAM for the resolution, batch size, and model you requested. Fix it with the –medvram or –lowvram flags, lower resolution, tiled VAE, batch size 1, xformers, and closing other GPU apps. On 6 GB and 8 GB cards these steps clear most errors. When hardware is the real limit, rent a cloud GPU. Adult, fictional, AI generated subjects only.

The error reads something like “torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate…” and it stops your generation cold. It is the single most common hard failure for people running NSFW models locally, especially on consumer cards. The cause is simple: the job you asked for needs more video memory than your GPU has free. The fixes are equally concrete. This guide gives you the full flag reference, a VRAM tier table, and the exact order to try things.

If your hardware simply cannot run a job, you can still generate adult fictional images right now in our browser tool, which runs on remote GPUs so your local VRAM is irrelevant.

What the error actually means

GPUs have a fixed amount of VRAM, separate from system RAM. SDXL, Pony, Illustrious, and Flux class models load weights into VRAM, then need more for the working latents, the VAE decode, and any ControlNet or LoRA. When the total exceeds free VRAM, CUDA throws the out of memory error. Hires fix and large batches roughly multiply the demand, which is why a job that worked at 1024 crashes the moment you enable a 2x upscale.

What is using VRAM Rough impact Lever to reduce it
Model weights Large, fixed per model –medvram, –lowvram
Resolution Scales with pixel count Lower width and height
Batch size Multiplies per image Set batch size to 1
VAE decode Spikes at the end Tiled VAE
ControlNet and LoRA Adds on top Use fewer at once
Hires fix Roughly doubles demand Lower upscale or denoise
A full VRAM meter dropping after medvram and tiling toggles, abstract concept

Fix 1: the medvram and lowvram flags

The fastest fix is a launch flag that changes how the model is held in memory. –medvram moves parts of the model between VRAM and system RAM as needed, cutting peak VRAM use at a small speed cost. –lowvram is more aggressive for very small cards, with a bigger speed penalty. Add the appropriate flag to your launch arguments.

# Automatic1111 or Forge: edit webui-user.bat
# 8 GB to 12 GB cards:
set COMMANDLINE_ARGS=--medvram --xformers
# 4 GB to 6 GB cards:
set COMMANDLINE_ARGS=--lowvram --xformers
# SDXL specific lighter option on mid cards:
set COMMANDLINE_ARGS=--medvram-sdxl --xformers

The –medvram-sdxl flag applies medvram behavior only to SDXL models, so SD 1.5 runs at full speed while SDXL gets the memory help. On many 8 GB to 12 GB cards this is the best balance.

Fix 2: lower your resolution

VRAM use scales with the number of pixels, so resolution is a powerful lever. A 1536 by 1536 image needs far more memory than 1024 by 1024. If you are crashing, drop to a native friendly size like 832 by 1216 for portraits and generate the composition there, then enlarge with a modest hires upscale or a separate upscaler pass.

Resolution Relative VRAM Notes
512 x 512 Lowest Too small for SDXL, anatomy suffers
832 x 1216 Moderate Good native portrait
1024 x 1024 Moderate Standard SDXL square
1536 x 1536 High Often OOM on 8 GB
2048 plus Very high Use tiled upscale instead

Do not go below native by much, because tiny base images break anatomy. The right move is a sane base resolution plus a memory friendly upscale, not a tiny image.

Fix 3: enable tiled VAE

The VAE decode at the end of generation causes a sharp VRAM spike, and it is a frequent OOM point, especially at high resolution. Tiled VAE decodes the image in small tiles instead of all at once, slashing peak memory at that stage. In Forge and Automatic1111 it is often built in or available as an extension, and ComfyUI has tiled VAE decode nodes. If you crash right at the end of a run or only when upscaling, tiled VAE is usually the fix.

Fix 4: set batch size to 1

Batch size multiplies VRAM use because the GPU holds multiple images at once. If you are generating four at a time and hitting OOM, drop batch size to 1. Use batch count instead, which generates images sequentially rather than simultaneously, so you still get multiple outputs without the memory multiplier.

# In the UI
Batch size: 1     # images held in VRAM at once - keep low
Batch count: 4    # images generated in sequence - safe to raise

Fix 5: install and enable xformers

xformers provides memory efficient attention, reducing VRAM use during generation, often substantially. It is one of the highest value additions for low VRAM cards and usually speeds things up too. Add the –xformers flag and make sure it is installed against your torch build.

set COMMANDLINE_ARGS=--medvram --xformers

On some newer setups, PyTorch native scaled dot product attention is competitive, but xformers remains a reliable VRAM saver on consumer cards. If it fails to install, your torch and CUDA versions may be mismatched, which is its own fix.

Fix 6: close other apps using the GPU

This sounds trivial but it catches everyone. Your browser with many tabs, a game launcher, a video call, or another model session can hold gigabytes of VRAM. Before blaming the model, check GPU memory use. On Windows, Task Manager shows dedicated GPU memory per process, and nvidia-smi gives a precise breakdown. Free that memory and your OOM may vanish without any setting change.

# Check what is using your GPU
nvidia-smi

Fix 7: tame hires fix

Hires fix is a top OOM trigger because it regenerates at a higher resolution, roughly doubling demand. If your base generates fine but enabling hires crashes you, lower the upscale factor from 2x to 1.5x, reduce hires steps, and make sure tiled VAE is on. Alternatively, skip in generation hires fix and upscale the finished image with a standalone upscaler, which can be more memory friendly. Our upscaler guide covers low VRAM upscaling approaches.

The full OOM fix order

Work the fixes in this order, since the early ones cost the least and clear most cases.

Step Action Cost
1 Close other GPU apps Free, instant
2 Add –medvram or –medvram-sdxl Small speed hit
3 Add –xformers Usually faster
4 Lower resolution to native None, often better
5 Set batch size to 1 Use batch count instead
6 Enable tiled VAE Slight time cost
7 Tame or skip hires fix Adjust workflow
8 Use –lowvram on tiny cards Bigger speed hit

If you reach step 8 and still crash at native resolution with a single image, your hardware is genuinely below what the model needs, and it is time to consider lighter models or cloud GPUs.

VRAM tiers: what runs on what

Knowing your card’s realistic ceiling saves a lot of frustration. These are practical expectations for local NSFW generation in 2026.

VRAM What runs Recommended flags
4 GB SD 1.5 only, slow, SDXL very hard –lowvram –xformers
6 GB SD 1.5 comfortable, SDXL with help –lowvram –xformers, tiled VAE
8 GB SDXL and Pony workable –medvram-sdxl –xformers
12 GB SDXL comfortable, hires fine –medvram-sdxl –xformers
16 GB SDXL and Flux comfortable –xformers, medvram optional
24 GB plus Everything, large batches –xformers, no medvram needed

For checkpoint choices that fit smaller cards, see the best low VRAM checkpoints guide. For a deeper hardware breakdown, the GPU requirements guide tells you what each tier really delivers.

Memory blocks being offloaded and tiled, glowing on dark

AMD and Apple GPUs

NVIDIA GPUs use CUDA, which is what the out of memory error refers to. AMD cards use ROCm or DirectML and have their own memory behavior and quirks, and Apple Silicon uses unified memory. If you are on AMD, the flags differ and the setup is more involved, covered in our AMD GPU guide. The core principles still apply: lower resolution, batch size 1, tiled VAE, and fewer add ons reduce memory pressure on any GPU.

When to rent a cloud GPU

Sometimes the honest answer is that your card cannot do what you want, and no flag will change that. If you need large batches, very high resolution, Flux class models, or LoRA training, and you are on an 8 GB card, renting a cloud GPU is far cheaper than buying a new one for occasional use. Services offer hourly access to 24 GB and 48 GB cards, so you pay only for the time you generate.

Cloud GPUs make sense when you have a heavy job occasionally, when you want to train a LoRA, or when you want to batch produce at high resolution. For light daily use, a local 8 GB or 12 GB card with the flags above is usually enough. And for quick generations with zero setup, our browser generator runs remotely so your VRAM never matters.

Reading the error message for clues

The full error often tells you how close you were. A line like “Tried to allocate 2.00 GiB” with a small shortfall means you are just over the edge, and a single change like batch size 1 or tiled VAE will fix it. A huge shortfall means you are far over budget and need bigger changes, like a lower resolution or the –lowvram flag. The error also reports reserved versus allocated memory, which can reveal fragmentation. If you see fragmentation language, a full restart of your front end clears the VRAM and often resolves a crash that flags alone did not.

# Common message shape
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.00 GiB. GPU has a total capacity of 8.00 GiB
of which 1.20 GiB is free.

Fragmentation and the restart fix

VRAM can fragment over a long session, especially after switching models, loading and unloading LoRAs, or running many varied jobs. Fragmentation means there is technically enough free VRAM, but not in one contiguous block large enough for the allocation, so CUDA still errors. The cure is simple: fully restart your front end to reset the GPU memory pool. If you crash after hours of work on settings that worked earlier, suspect fragmentation and restart before changing anything else. Some users also set the PYTORCH_CUDA_ALLOC_CONF environment variable to reduce fragmentation, but a restart is the quick, reliable first move.

Avoiding OOM in the first place

A few habits keep you out of trouble. Default to native resolution rather than oversized canvases. Keep batch size at 1 and use batch count for multiples. Leave tiled VAE on if you ever upscale. Run only the LoRAs and ControlNets you actually need, since each adds memory. And close heavy apps before a session. Build these into your defaults and OOM becomes rare rather than routine.

# A solid low VRAM default for an 8 GB card
set COMMANDLINE_ARGS=--medvram-sdxl --xformers
# In the UI:
# base resolution 832x1216, batch size 1, tiled VAE on,
# hires fix 1.5x at 0.45 denoise only when needed

LoRAs, ControlNet, and the memory they add

Every add on you load consumes VRAM on top of the base model. A stack of four LoRAs, a ControlNet model, and an embedding can quietly push you over the edge on an 8 GB card even at a normal resolution. If you started crashing right after adding extensions, that is your suspect. Load only the LoRAs you actually need for the current image, and prefer one ControlNet at a time. ControlNet in particular is memory hungry because it loads a second control model alongside your checkpoint. On tight hardware, generate the controlled base, then disable ControlNet for the refinement pass to free that memory. Trimming your add on stack is often the difference between a crash and a clean run, and it usually improves the image too, since over stacked LoRAs fight each other and degrade quality.

Model-specific VRAM notes

Not every model class behaves the same on a tight card, and knowing the differences saves a lot of failed runs. SD 1.5 is the lightest option and will run on almost anything, including 4 GB cards, which is why it is still the fallback for very low VRAM rigs. SDXL and Pony roughly double the memory footprint of SD 1.5, so 8 GB is the realistic entry point with –medvram-sdxl. Illustrious behaves like SDXL since it shares the architecture. Flux is the heaviest of the common families and genuinely wants 12 GB or more, though quantized GGUF builds of Flux exist specifically to fit it onto 8 GB cards at a quality and speed cost.

Model Comfortable VRAM Tight but workable Key flag
SD 1.5 6 GB 4 GB –lowvram –xformers
SDXL / Illustrious 12 GB 8 GB –medvram-sdxl –xformers
Pony 12 GB 8 GB –medvram-sdxl –xformers
Flux (full) 16 GB plus 12 GB –medvram –xformers
Flux (GGUF quantized) 8 GB 6 GB –lowvram –xformers

If you are on an 8 GB card and want Flux, do not fight the full model. Download a quantized GGUF version sized for your VRAM and you will get usable results where the full model only ever throws out of memory. For checkpoint picks that fit each tier, the low VRAM checkpoints guide is the practical shortlist.

A GPU chip with cooling glow and a freed memory bar, neon nodes

A diagnostic checklist before you change settings

Before you start flipping flags, run this quick checklist. It often reveals that the fix is simpler than you feared.

  1. Run nvidia-smi and note how much VRAM is already used before you generate. If a browser or game is holding 2 GB, close it first.
  2. Check whether the crash happens at the start (model load), during sampling (resolution or batch), or at the very end (VAE decode). The stage points straight at the fix.
  3. Note the exact “Tried to allocate” number in the error. A small shortfall is a one-setting fix, a large one needs a resolution or model change.
  4. Confirm you actually restarted after your last big session. Fragmentation from a long session masquerades as a hardware limit.
  5. Verify xformers is really active. The startup log prints whether it loaded. A silent failure means you lost a major VRAM saver without knowing.

Working this list top to bottom takes under two minutes and frequently turns a frustrating crash into a single obvious cause. Only after these five checks should you move into the flag ladder above, since changing settings blindly while another app is eating your VRAM just hides the real problem.

Bringing it together

CUDA out of memory is a VRAM budget problem with a clear fix ladder. Close other GPU apps, add –medvram or –medvram-sdxl and –xformers, lower resolution to native, set batch size to 1, enable tiled VAE, and tame hires fix. Those steps clear the vast majority of errors on 6 GB to 12 GB cards. If you still crash at native resolution with a single image, your hardware is the limit, and a cloud GPU or a lighter model is the answer.

Ready to keep generating? Use the low VRAM defaults above on your local rig, or skip the hardware question entirely with our generator. For any other failure mode, return to the troubleshooting pillar and match your symptom in the table.

Frequently asked questions

What does CUDA out of memory mean exactly?

It means your GPU ran short of video memory, or VRAM, for the job you requested. The model weights, working latents, VAE decode, and any LoRA or ControlNet all consume VRAM, and when the total exceeds what is free, CUDA throws the error and stops the run. It is a memory budget problem, not a corruption, and it is fixable with flags and lower settings.

Should I use –medvram or –lowvram?

Use –medvram for 8 GB to 12 GB cards, since it cuts peak VRAM with only a small speed cost. Use –lowvram for 4 GB to 6 GB cards, accepting a bigger speed penalty in exchange for fitting the model at all. On mid range cards running SDXL, –medvram-sdxl is often the best balance because it only applies the memory saving to SDXL models.

Why does hires fix cause out of memory errors?

Hires fix regenerates the image at a higher resolution, which roughly doubles VRAM demand, so a job that fits at the base size crashes the moment you enable a large upscale. Lower the upscale factor to 1.5x, reduce hires steps, enable tiled VAE, or skip in generation hires fix and upscale the finished image with a standalone upscaler that is more memory friendly.

Does xformers help with VRAM?

Yes. Xformers provides memory efficient attention that reduces VRAM use during generation, often substantially, and it usually speeds things up too. It is one of the highest value additions for low VRAM cards. Add the –xformers flag and confirm it installed against your torch build. If installation fails, your torch and CUDA versions may be mismatched and need to be reinstalled.

What is tiled VAE and when do I need it?

Tiled VAE decodes the final image in small tiles rather than all at once, which slashes the sharp VRAM spike that happens at the decode stage. You need it if you crash right at the end of a run or only when upscaling to high resolution. It adds a slight time cost but reliably prevents decode stage out of memory errors on limited hardware.

Can an 8 GB GPU run SDXL NSFW models?

Yes, an 8 GB card can run SDXL and Pony based NSFW models with help. Use the –medvram-sdxl and –xformers flags, generate at native resolution like 832 by 1216, keep batch size at 1, and enable tiled VAE for upscaling. Heavy jobs like large batches or Flux class models may still be out of reach, but standard generation is comfortable with these settings.

When should I rent a cloud GPU instead?

Rent a cloud GPU when you need large batches, very high resolution, Flux class models, or LoRA training and your local card cannot keep up. Hourly access to 24 GB or 48 GB cards is far cheaper than buying new hardware for occasional heavy use. For light daily generation, a local 8 GB or 12 GB card with the right flags is usually sufficient.

How do I know if another app is stealing my VRAM?

Run nvidia-smi from a terminal, or open Task Manager on Windows and check dedicated GPU memory per process. Browsers with many tabs, game launchers, video calls, and other model sessions can hold gigabytes of VRAM. Close them before generating. Freeing that memory sometimes clears an out of memory error entirely without changing any model setting.