Caption Images For NSFW LoRA 2026

Caption NSFW LoRA images to match your base model: use Danbooru-style booru tags for Pony and Illustrious, natural-language sentences for Flux, and either for SD1.5. Add one unique trigger word, describe everything you want to stay variable, and prune tags for the constant concept so it absorbs into the trigger. Keep all subjects adult, fictional, and AI-generated.

Captioning is where most of the actual control in LoRA training happens. The dataset gives the network the visual material; the captions tell it what each part of that material means and which parts to bind to your trigger word. Get captioning wrong and even a perfect dataset produces a LoRA that ignores your trigger, bleeds unwanted features, or refuses to flex. This guide covers booru tags versus natural language, when to use which by base model, trigger words, tag pruning, auto-taggers, and the exact caption file format.

Table of Contents

The two captioning styles

There are two broad approaches, and the right one depends almost entirely on what your base model was trained on.

Booru tags are comma-separated keywords, the same vocabulary you see on Danbooru and similar boards: 1girl, long hair, standing, looking at viewer, indoors. They are precise, compact, and map directly to how anime and Pony-family models think.

Natural-language captions are full sentences: a photo of a woman with long hair standing indoors, looking at the camera. They carry richer relational meaning and match how newer text encoders parse prompts.

Neither is universally better. They are tools matched to different base models. Mixing them randomly within one dataset is the mistake to avoid.

Image cards with floating tag chips connected by light, abstract concept

Matching caption style to base model

This is the single most important decision in captioning. Train the way the base model expects to be prompted.

Base model	Caption style	Trigger placement	Notes
SD1.5	Either, booru or short natural	First tag/phrase	Tolerant; short tags work well.
SDXL / Pony	Booru tags (Danbooru)	First, before quality tags	Pony expects booru vocabulary; score tags help.
Illustrious	Booru tags (Danbooru)	First tag	Strongly tag-driven; clean tags matter a lot.
Flux	Natural language	In a sentence near the start	Flux parses prose; tag soup underperforms.

If you are training on Pony, lean on proper Danbooru tags, because that is the language the model already knows. The Pony Diffusion guide and the Illustrious model guide both cover the tag conventions those checkpoints expect, and your captions should mirror them. For Flux, write the way you would describe the image to a person.

Choosing a trigger word

The trigger word (also called an activation tag) is the unique token you put in every caption so that, at inference time, typing it summons your concept. Rules for a good trigger:

Make it unique. Use something the base model has never seen as a meaningful token. m1lf_aria or xyzcharacter is far safer than aria, which already carries baggage.
Keep it short and typeable. You will type it constantly.
Put it first. The leading token gets the most weight.
Use exactly one per concept. For a character, one trigger. For an outfit, one trigger.

A caption for a character named with trigger aria_nsfwchar might start: aria_nsfwchar, 1girl, .... Every image in the set carries that same first token. That repetition is what makes the network associate the trigger with the consistent visual features.

Tag pruning: the absorption principle

This is the concept that separates good LoRAs from frustrating ones. It sounds backwards at first, so read slowly.

Describe what you want to remain variable. Omit (prune) what you want baked into the trigger.

If you tag a feature, you give the model a handle to change it. If you omit a feature, the model cannot separate it from the trigger, so it absorbs into the trigger.

Example: you are training a character who always has green eyes. If you tag green eyes in every caption, you are telling the model green eyes are a separate, controllable attribute. The model may then drop them when you do not prompt for them. If instead you omit green eyes from every caption, the green eyes have nowhere to attach except the trigger, so the trigger learns to carry them. The eyes become part of the character automatically.

The practical rule:

For a character, prune tags for the constant identity features (face shape, eye color, signature hairstyle, body marks). Keep tags for things that change (pose, expression, outfit, background, lighting).
For a style, prune tags that describe the style itself. Keep tags for the varied subjects.

This is exactly how a character LoRA achieves identity that holds across poses and scenes. Over-tagging the identity is the number one cause of a character LoRA that “forgets” its own face.

What to keep versus remove

A working tag-pruning checklist for a character:

Keep: 1girl, pose (standing, sitting, lying), camera (from above, close-up), expression (smile, open mouth), outfit when it varies, background, lighting, NSFW action tags.
Remove (let absorb): the character’s permanent eye color, permanent hair color and style, face shape descriptors, signature body features, any tag that is identical across the whole set and is part of who they are.
Always keep one thing: the trigger word, first, in every caption.

For a style LoRA, invert: remove the style descriptors, keep the per-image subject and composition tags.

Before committing to a tagging plan, it helps to generate a few test images so you know what the base model already produces. You can rough that out in our free NSFW AI image generator and compare its default rendering to what you want the LoRA to change.

Auto-taggers: WD14 and BLIP

You will not hand-write every caption from scratch. Auto-taggers give you a first pass that you then clean up.

WD14 tagger (wd-v1-4 and successors). The standard for booru-style tagging. It outputs Danbooru tags with confidence scores. Use it for Pony, Illustrious, and any tag-driven base. Set a confidence threshold (0.35 is a common default) so you get useful tags without noise. Kohya SS ships a WD14 tagging utility, and the Kohya setup guide walks through running it.
BLIP / BLIP-2. Generates natural-language captions. Use it for Flux or SD1.5 natural-language workflows. Output is a sentence per image. It is decent but generic, so you will edit for accuracy.

A realistic workflow:

Run WD14 (booru base) or BLIP (Flux) across the whole folder to generate .txt files.
Open each caption and fix errors: remove wrong tags, add missed NSFW action tags.
Prepend your trigger word to every file.
Prune the constant-identity tags per the absorption principle.

Auto-taggers save hours but they do not understand your intent. The pruning step is manual and is where the real quality comes from.

Caption file format

The format is simple and standardized. For each image, you create a text file with the same base filename and a .txt extension, in the same folder.

# Folder layout
img_001.png
img_001.txt
img_002.png
img_002.txt
...

The .txt contains the caption on a single line. Booru style is comma-separated; natural language is a sentence. Example booru caption file (img_001.txt) for a Pony or Illustrious character:

aria_nsfwchar, 1girl, solo, standing, looking at viewer, indoors, soft lighting, nude, detailed background
# safety baseline for any generation tests later:
# negative: child, minor, underage, loli, shota

And a natural-language caption file for a Flux dataset:

aria_nsfwchar, a photograph of an adult woman standing in a softly lit room, looking toward the camera, full body in frame

Note that the negative line above is a comment for your own reference; captions themselves do not contain negatives. Negatives belong in your generation prompts at inference time, and every test prompt must include the baseline child, minor, underage, loli, shota. For a full negative reference, see the negative prompts master list.

Quality and score tags for Pony-family models

Pony Diffusion and many of its derivatives were trained with explicit quality scoring tags, and your captions should reflect that if you want the LoRA to play nicely with the base. Pony recognizes a score ladder like score_9, score_8_up, score_7_up that signals high-quality images. When captioning a Pony dataset, place these after your trigger word but before the descriptive tags.

A Pony character caption then looks like this:

aria_nsfwchar, score_9, score_8_up, score_7_up, 1girl, solo, standing, indoors, soft lighting, nude
# test-time negatives: child, minor, underage, loli, shota, low quality, blurry

Illustrious uses a different convention. It does not rely on the same score ladder and instead responds well to clean Danbooru tags and quality words like masterpiece, best quality. Do not copy Pony score tags into an Illustrious dataset; match each model’s actual vocabulary. The how to use Illustrious models guide covers its tag conventions in depth, and getting them right in your captions is what lets the LoRA inherit the base model’s strengths instead of fighting them.

A trigger word token binding to a stack of sample cards, glowing on dark

Ordering tags for maximum effect

Tag order carries weight in tag-driven models. Earlier tokens influence the output more strongly. A reliable ordering for booru captions is:

Trigger word.
Quality or score tags (Pony) or quality words (Illustrious).
Subject count and framing (1girl, solo, full body).
Pose and action.
Expression and gaze.
Outfit or state.
Environment, background, lighting.

Keeping this order consistent across the dataset reinforces the same associations every time, which speeds up how cleanly the concept binds. Random tag order across files dilutes that signal and slows convergence, sometimes forcing you into more training steps than necessary.

Handling NSFW action and anatomy tags

NSFW datasets need accurate explicit tags, and this is where auto-taggers often fall short, especially BLIP, which tends to sanitize. WD14 handles many adult tags but still misses or mislabels specifics. Plan to add these by hand. Be precise and consistent: if two images show the same act, tag it identically in both. Inconsistent action tags teach the model a fuzzy concept that comes out unreliably at inference time.

Keep anatomy tags in the “keep” column unless a particular feature is a permanent identity trait you want absorbed. For most NSFW character LoRAs, you want anatomy and acts to remain promptable, so tag them. Only the fixed identity features get pruned.

Consistency rules across the whole set

Whatever conventions you pick, apply them identically to every file:

Same trigger word, same spelling, first position, every file.
Same captioning style (do not mix booru and prose in one dataset).
Same threshold and cleanup standard.
Same approach to pruning.

Inconsistency in captioning produces inconsistency in the LoRA. If half your files tag green eyes and half omit it, the absorption is muddled and the result is unreliable.

Common captioning mistakes

A handful of errors account for most captioning-related LoRA failures, so it pays to recognize them.

Over-tagging the constant concept. The number one mistake. If you tag the character’s permanent features, they stay separate from the trigger and the LoRA forgets its own identity. Prune them.
Mixing caption styles. Half booru, half prose in one dataset confuses the trainer. Pick one style per base model and apply it everywhere.
Inconsistent trigger placement. The trigger must be first in every file, spelled identically. A trigger that drifts to the middle in some files binds weakly.
Trusting the auto-tagger blindly. WD14 and BLIP miss and mislabel, especially on NSFW specifics. Always do a manual cleanup pass.
Forgetting to add NSFW action tags. Auto-taggers sanitize. If the act is not tagged, the model cannot reliably reproduce it on command.

Walking the whole folder once with these five in mind catches the vast majority of problems before they cost you a training run.

How captioning interacts with training settings

Captioning and settings are not independent. Tighter captioning, where you prune identity hard and keep the rest precise, lets you train a little longer without the identity drifting, because the trigger has a clean signal to bind to. Loose, noisy captions force you to stop earlier to avoid baking in the noise. If you find your LoRA needs an unusually low step count to stay usable, the captions are often the culprit, not the training settings. Clean captions widen the window where the LoRA is good, which makes the whole run more forgiving.

An auto tagger scanning thumbnails into tag clouds, neon nodes on dark

Safety and consent in captioning

Captions describe real visual content, so the same rules apply as in dataset sourcing. Every subject must be adult, fictional, and AI-generated or fully consented. Never caption or train minor or minor-appearing subjects. Do not build a LoRA of a real identifiable person without explicit consent; the US TAKE IT DOWN Act treats non-consensual intimate imagery as a serious offense, and a trained LoRA can mass-produce exactly that. This is not legal advice, and synthetic or consented data is the only path worth taking.

Putting it together

Here is the end-to-end captioning sequence:

Pick your caption style from the base-model table.
Choose one unique trigger word.
Run the matching auto-tagger across the folder.
Clean every caption for accuracy.
Prepend the trigger to every file.
Prune the constant-concept tags so they absorb.
Verify consistency across all files.

Once captions are done, your dataset is training-ready. Confirm your image set met the dataset standards first, then dial in your training settings and launch the run. If your prompts feel weak when you test the finished LoRA, the prompt formula guide will tighten them, and you can iterate quickly in our free generator. Captioning is unglamorous, but it is the lever that decides whether your trigger word actually works.

Frequently asked questions

Should I use booru tags or natural language captions?

Match the base model. Pony and Illustrious are trained on Danbooru tags, so use comma-separated booru tags for them. Flux parses prose, so use natural-language sentences. SD1.5 tolerates either, though short tags work well. The key rule is to never mix both styles within a single dataset, because that produces inconsistent, unreliable training results.

What is a trigger word and how do I pick one?

A trigger word is the unique token you place first in every caption so that typing it at inference time summons your concept. Pick something short, typeable, and unlikely to exist as a meaningful token in the base model, like aria_nsfwchar rather than aria. Use exactly one trigger per concept and keep its spelling identical across every caption file.

Why should I remove tags for a character’s permanent features?

If you tag a feature, you give the model a handle to vary or drop it. If you omit it, that feature has nowhere to attach except the trigger word, so it absorbs into the trigger. Pruning constant identity tags like permanent eye and hair color makes the trigger reliably carry the character’s look across poses and scenes.

What auto-tagger should I use for NSFW datasets?

Use the WD14 tagger for booru-style datasets on Pony or Illustrious, with a confidence threshold around 0.35. Use BLIP or BLIP-2 for natural-language captions on Flux or SD1.5. Both give a first pass you must then clean by hand, removing wrong tags, adding missed action tags, and prepending your trigger word to every file.

What format do caption files use?

Each image gets a matching text file with the same base filename and a .txt extension in the same folder, for example img_001.png and img_001.txt. The text file holds the caption on a single line, comma-separated for booru style or a full sentence for natural language. Negatives do not go in caption files; they belong in generation prompts at inference time.

Do captions need negative prompts in them?

No. Caption files describe the image content only. Negative prompts belong in your generation prompts when you test or use the finished LoRA. Always include the baseline safety negatives child, minor, underage, loli, shota in every test prompt, plus quality negatives. Keep captions focused on what is in the picture, not on what to exclude.

How detailed should my captions be?

Detailed enough to label everything you want to remain variable, and no more. Tag pose, expression, camera angle, background, lighting, and outfit when those change. Omit the constant identity or style features you want baked into the trigger. Over-tagging the constant concept is the most common reason a LoRA fails to lock its character or style.

Can I reuse the same captions across different base models?

Not directly. A booru-tagged dataset built for Pony will underperform on Flux, which expects natural language, and vice versa. If you want to train the same concept on multiple bases, write a tag-style caption set for the booru models and a prose caption set for Flux. The images stay the same; only the captions change to match each model.

How to Caption and Tag Images for NSFW LoRA Training (2026)