How to Train an NSFW LoRA on Low VRAM (2026)

15 min read

You can train an NSFW LoRA on 6GB or 8GB of VRAM with gradient checkpointing, batch size 1, AdamW8bit or Adafactor, bf16, cached latents, and a capped resolution. On 6GB use SD 1.5 or low-res SDXL; 8GB handles SDXL at 768 to 1024; Flux needs 12GB or the cloud. Keep all subjects adult, fictional, and AI-generated.

Training a LoRA used to mean a 24GB card. In 2026 the memory-saving stack has matured enough that a 6GB laptop GPU can train a usable LoRA overnight, and an 8GB card trains SDXL LoRAs comfortably. The trick is knowing which knobs trade speed and quality for memory, and being honest about what each VRAM tier can actually do. This guide gives you a working low-VRAM config and a realistic per-tier breakdown, plus the cloud fallback for when your card simply is not big enough.

If you are new to the process, read the complete LoRA training guide first; this post assumes you know the basic pipeline and just need it to fit in limited memory.

The memory-saving stack, knob by knob

Every technique below cuts VRAM. You stack them. The more memory-starved your card, the more you turn on, accepting slower training in exchange for it fitting at all.

  • Gradient checkpointing. The single biggest saver. Instead of keeping every intermediate activation in memory for the backward pass, it recomputes them on the fly. Costs roughly 20 to 30 percent training speed, saves a huge chunk of VRAM. Always on for low-VRAM training.
  • Batch size 1. Larger batches are smoother but each image in the batch lives in memory simultaneously. On constrained cards, train one image at a time and use gradient accumulation if you want an effective larger batch without the memory cost.
  • 8-bit optimizers (AdamW8bit) or Adafactor. A normal AdamW optimizer stores two full-precision state tensors per weight. AdamW8bit quantizes that state to 8-bit and cuts optimizer memory dramatically. Adafactor goes further by not storing full second-moment state at all, which is why it is the go-to for the tightest cards and for Flux.
  • fp16 / bf16 mixed precision. Half precision halves activation and weight-copy memory versus fp32. Use bf16 on RTX 30-series and newer (more stable); fp16 on older cards. Never train full fp32 on a small card.
  • Cached latents. Pre-encode every training image to its VAE latent once and store it, so the VAE does not need to sit in VRAM during training. This also speeds up epochs. Pair with cached text-encoder outputs to free even more.
  • Capped resolution and bucket limits. Memory scales with pixel count. Dropping from 1024 to 768 is a large saving; 512 is smaller still. Cap your bucket resolution so a few wide or tall images do not blow past your limit.
  • xformers / SDPA attention. Memory-efficient attention kernels cut the attention memory spike. Enable xformers (or PyTorch SDPA) on every low-VRAM run.
A compact GPU with a near full VRAM gauge training efficiently, abstract concept

What is realistic per VRAM tier

Be honest with yourself here. Forcing SDXL onto 6GB at 1024 will out-of-memory no matter how many flags you set. This table reflects what actually trains in 2026 with the full memory stack enabled.

VRAM What you can train Notes
6GB SD 1.5 LoRA at 512; SDXL LoRA at 512 to 768 (slow, tight) Adafactor, dim 8 to 16, latents cached, expect long runs
8GB SDXL LoRA at 768 to 1024; SD 1.5 easily AdamW8bit fine, dim 16 to 32, comfortable sweet spot
12GB SDXL LoRA at 1024 comfortably; Flux LoRA via FluxGym (low-VRAM path) Headroom for batch 2, dim up to 32 to 64
16GB+ SDXL with batch 2 to 4; Flux LoRA more comfortably Faster, can experiment with higher dim and full Adam

For a deeper look at the cards themselves and where the value sits, see the GPU hardware requirements guide. If your card cannot even run generation comfortably, the low-VRAM checkpoints guide helps you pick a base model that fits.

A working low-VRAM config (8GB SDXL)

This is a known-good Kohya config for an 8GB card training an SDXL LoRA at 768. It trains overnight on most 8GB cards. Drop resolution to 512 and switch to Adafactor for 6GB.

# Kohya low-VRAM SDXL LoRA config (8GB target)
network_module           = networks.lora
network_dim              = 16
network_alpha            = 8
train_batch_size         = 1
gradient_accumulation    = 2          # effective batch 2 without the memory
gradient_checkpointing   = true       # essential on low VRAM
optimizer_type           = AdamW8bit  # use Adafactor for 6GB
mixed_precision          = bf16       # fp16 on pre-30-series cards
resolution               = 768        # 512 for 6GB
max_bucket_reso          = 1024
cache_latents            = true
cache_latents_to_disk    = true
cache_text_encoder_outputs = true
xformers                 = true
learning_rate            = 1e-4
lr_scheduler             = cosine
max_train_epochs         = 10
save_every_n_epochs      = 2
clip_skip                = 2          # for Pony / Illustrious bases

A few notes. cache_text_encoder_outputs frees the text encoder from VRAM during training, which is a meaningful saving, but it means you cannot also train the text encoder in that run; for most NSFW character and concept LoRAs the U-Net-only result is fine. gradient_accumulation = 2 gives you the smoothing benefit of a batch of 2 while only ever holding one image in memory. If you still hit out-of-memory, drop resolution first, then dim, then turn on Adafactor. For the reasoning behind each value across all card sizes, see best NSFW LoRA training settings.

Captions and dataset still matter on small cards

Low VRAM changes how you train, not what you feed the trainer. A tight 20-to-40-image dataset with clean captions beats a sprawling one, and on a slow card a smaller dataset is also kinder to your training time. Keep subjects varied for a style LoRA and consistent for a character LoRA, and caption accordingly. The dataset guide and captioning guide apply unchanged. If you want a lean dataset that fits a small-card budget, generate it yourself with our free NSFW AI image generator and curate down to your best 25 images.

Safety and consent. Subjects must be adult (18+), fictional, AI-generated, or fully owned and consented. Never train on a real identifiable person without explicit consent, and never on minors or minor-appearing subjects. The TAKE IT DOWN Act treats non-consensual intimate imagery as a serious matter; use synthetic or consented datasets only. This is not legal advice.

Testing without blowing your memory budget

You do not need to reload a fat trainer to test. Generate with your saved LoRA epochs in your normal inference tool, which uses far less VRAM than training. Sweep epochs at a fixed seed to find the one that is trained enough but not fried.

# Test prompt across epochs (low-VRAM inference)
<lora:mychar-000006:0.85> ohwx woman, full body, standing,
soft lighting, detailed skin, bedroom

Negative: child, minor, underage, loli, shota, deformed, bad anatomy,
extra limbs, lowres, blurry, watermark, text

If epoch 6 looks under-baked and epoch 10 looks over-baked, the right answer is usually a saved checkpoint in between, which is exactly why save_every_n_epochs = 2 is in the config. For output problems unrelated to memory, the troubleshooting guide covers the common artifacts.

The cloud fallback: rent a GPU

Sometimes the card is just too small. A 4GB GPU will not train SDXL no matter what, and Flux training is painful below 12GB. When that is your situation, renting a cloud GPU for a couple of hours is cheaper and faster than fighting your hardware.

  • RunPod, Vast.ai, and similar rent GPUs by the hour. An RTX 4090 (24GB) or A40 (48GB) rents for a low hourly rate, and a LoRA trains in well under an hour, so a full training run often costs less than a coffee.
  • You keep full content freedom. Renting raw compute and running your own Kohya or FluxGym instance means no platform content filter touches your NSFW dataset, unlike many hosted “train your model” web services that restrict adult content.
  • Workflow. Spin up a GPU pod with a Kohya or FluxGym template, upload your dataset, run the same config you would run locally (with the memory flags relaxed since you now have headroom), download the .safetensors, shut the pod down. You only pay while it runs.

The full economics, provider comparison, and step-by-step rental flow are in the cloud GPU rental guide, and if you are weighing the total spend, how much NSFW AI image generation costs puts training costs in context against running things locally. For comparing trainers themselves, the best NSFW LoRA training tools roundup covers which ones run well in a rented pod.

Gradient checkpointing easing a memory bar on dark, glowing

A step-by-step 8GB SDXL walkthrough

Here is the whole run on an 8GB card, start to finish, so you can follow it without guessing. This assumes Kohya SS and an adult, fictional, AI-generated character dataset.

  1. Prep the dataset. Curate 25 of your cleanest images, deduped, no watermarks, cropped sensibly. Put them in one folder named so Kohya reads the repeat count, for example 15_ohwxwoman, which means 15 repeats per image per epoch.
  2. Caption them. Write consistent captions that describe the variable scene (pose, light, setting) so identity binds to the trigger. Keep the format consistent across all 25.
  3. Set the base model. Point Kohya at your SDXL or Pony base checkpoint. For Pony or Illustrious set clip_skip = 2.
  4. Load the low-VRAM config. Use the config block above: dim 16, alpha 8, batch size 1, gradient accumulation 2, gradient checkpointing on, AdamW8bit, bf16, resolution 768, latents and text-encoder outputs cached, xformers on.
  5. Cache first. Let Kohya pre-cache latents and text-encoder outputs before the run proper. This is where the big VRAM saving comes from and it only happens once.
  6. Start training and watch the first few steps. Peak memory is allocated early; if it survives step five without an out-of-memory error, it will almost certainly finish. On an 8GB card at 768 you should sit comfortably under the limit.
  7. Let it run. Ten epochs of 25 images at 15 repeats is a few thousand steps, which on a typical 8GB card runs overnight. Saving every two epochs gives you five candidate files.
  8. Test the candidates. In your inference tool, generate the same seed and prompt against epochs 6, 8, and 10 at weight 0.85. Pick the one that holds identity without looking fried.
  9. Lock it in. Rename your chosen .safetensors, archive the dataset and config together so you can retrain later, and run a final scene check in our free generator.

If step six out-of-memories, do not change everything at once. Drop resolution to 640 or 512 first, re-cache, and try again. That single change clears most 8GB failures.

Gradient accumulation: a bigger batch without the memory

One technique deserves its own section because it solves a real low-VRAM dilemma. Larger batch sizes make training smoother and more stable, but every image in a batch sits in VRAM at once, which is exactly what you cannot afford. Gradient accumulation gives you the smoothing benefit of a large batch while only ever holding one image in memory. Instead of updating the weights after every single image, the trainer accumulates gradients across several forward and backward passes, then applies one combined update. Set train_batch_size = 1 and gradient_accumulation = 4, and you get the statistical effect of a batch of 4 at the memory cost of a batch of 1. The tradeoff is time: four passes per update is slower than one. But on a memory-starved card, slower-but-fits beats faster-but-crashes every time. For most low-VRAM NSFW LoRAs, an effective batch of 2 to 4 via accumulation is a good balance between stability and run time.

Reading the numbers: how to tell you are out of memory

Low-VRAM training has a distinctive failure: a CUDA out-of-memory error, usually in the first few steps when the trainer tries to allocate its peak memory. When you see it, do not panic and do not start over blindly. Work down this list in order, retesting after each change, because each step costs you something and you want the smallest sacrifice that fits:

  1. Lower resolution first. Going from 1024 to 768, or 768 to 512, is the biggest single saving and usually the least painful. This is almost always the right first move.
  2. Confirm gradient checkpointing is on. If you somehow left it off, turning it on is a huge saving for a modest speed cost.
  3. Switch to a lighter optimizer. Move from AdamW8bit to Adafactor to shed optimizer-state memory.
  4. Drop network dim. A dim of 32 to 16, or 16 to 8, frees memory and, for many subjects, barely changes quality.
  5. Cache text encoder outputs. If you were training the text encoder, stop and cache it instead.

If you have done all five and a 768 SDXL LoRA still will not fit, your card is genuinely too small for that target and it is time to drop to SD 1.5, drop to 512, or rent a cloud GPU. There is no flag that conjures VRAM you do not have.

A small local card handing off to a cloud GPU node, neon nodes on dark

Why low-VRAM training takes longer (and why that is fine)

It helps to know where the time goes so the wait does not surprise you. Gradient checkpointing adds recomputation, which slows each step. Batch size 1 with accumulation means more passes per weight update. Lower resolution is actually faster per step, but you may compensate with more steps to reach quality. The net effect is that a LoRA that trains in 20 minutes on a 4090 might take a couple of hours on an 8GB card. That is completely fine for a one-time training job. You are not generating in real time; you start the run, walk away, and come back to a finished .safetensors. The output file is identical in kind to one trained on a big card. The only genuine quality lever you sacrificed is training resolution, so cap that as high as your VRAM allows and accept the longer clock.

Putting it together

Start by enabling the full memory stack: gradient checkpointing, batch size 1 with accumulation, AdamW8bit or Adafactor, bf16, cached latents, and xformers. Match your resolution and model choice to your VRAM tier from the table. On 6GB, expect SD 1.5 or low-res SDXL and long runs. On 8GB, SDXL LoRA at 768 to 1024 is the comfortable sweet spot. At 12GB you can reach for Flux via FluxGym’s low-VRAM path; below that, or for faster runs, rent a cloud GPU and keep full control of your content. Either way you end up with the same .safetensors file and the same NSFW capability, just on a budget that fits your hardware. Spin up a quick test with our free generator once it is done to confirm the LoRA fires the way you intended.

Frequently asked questions

Can I really train an NSFW LoRA on a 6GB GPU?

Yes, with the full memory stack on. Use gradient checkpointing, batch size 1, the Adafactor optimizer, fp16 or bf16, cached latents, and a 512 resolution cap. On 6GB you are realistically training SD 1.5 LoRAs comfortably and SDXL LoRAs at low resolution slowly. It works, it is just slower than a bigger card.

What is the single most important setting for low-VRAM training?

Gradient checkpointing. It recomputes activations during the backward pass instead of storing them, which saves the largest single chunk of VRAM. It costs roughly 20 to 30 percent in training speed, which is a fair trade when the alternative is running out of memory entirely. Turn it on for any constrained card.

Should I use AdamW8bit or Adafactor on a small card?

AdamW8bit is the default for 8GB and up; it quantizes optimizer state to 8-bit and works well. Adafactor saves even more memory because it does not store full second-moment state, so it is the better choice for 6GB cards and for Flux training. Try AdamW8bit first and switch to Adafactor only if you still run out of memory.

Why cache latents and text encoder outputs?

Caching latents pre-encodes your images to VAE latents once, so the VAE does not occupy VRAM during training and epochs run faster. Caching text encoder outputs frees the text encoder from memory too. The tradeoff is that you cannot train the text encoder in that run, which is fine for most character and concept LoRAs that train the U-Net only.

What resolution should I train at on low VRAM?

Memory scales with pixel count, so resolution is a powerful lever. On 8GB, 768 is a good balance for SDXL. On 6GB, drop to 512. Use a bucket resolution cap so a few unusually wide or tall images do not spike past your limit. Lowering resolution is the first thing to try when you hit out-of-memory.

Can I train a Flux LoRA on low VRAM?

Flux is heavier than SDXL. The realistic floor is about 12GB using FluxGym’s low-VRAM path with Adafactor and aggressive memory flags. Below 12GB, Flux training is impractical and you should rent a cloud GPU instead. SDXL or Pony remain the better local choices for cards smaller than 12GB.

When should I rent a cloud GPU instead of training locally?

Rent when your card is too small for your target model, when you want Flux but have under 12GB, or when local runs are simply too slow to iterate. An RTX 4090 or A40 pod on RunPod or Vast.ai trains a LoRA in under an hour for a small hourly fee, and renting raw compute means no platform content filter touches your NSFW dataset.

Does low-VRAM training hurt the quality of the LoRA?

Not meaningfully if you do it right. Gradient checkpointing, 8-bit optimizers, and cached latents do not degrade the final weights; they only trade speed for memory. The real quality factors are still dataset, captions, and learning rate. The only genuine compromise on tiny cards is lower training resolution, which can soften fine detail, so cap resolution as high as your VRAM allows.