OppaiOracle — Hugging Face Release Draft

Draft release notes / model card for the first public OppaiOracle checkpoint. Intended audience: people considering using this model for anime/illustration tagging. Tone: direct about what works, direct about what doesn't.

TL;DR

A multi-label anime tagger trained from scratch on a ~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly 1.3M tags — large in absolute terms, but only on the order of ~3% of all tags in the corpus, so this is best described as a targeted cleaning rather than a heavy one. The pass was deliberately weighted toward low-frequency tags, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. All predictions should be human-reviewed before they are trusted.

This release ships two checkpoints — V1 (the from-scratch 320×320 model) and V1.1 (a 448×448 fine-tune of V1). Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see Variants below).

Variants — which checkpoint should I use?

Checkpoint	Native resolution	How it was produced	When to use
V1	320×320	Trained from scratch at 320×320. This is the model's native resolution and the one it performs best at.	Default choice. Use when you are running inference at 320×320, when throughput matters, or when you want the checkpoint that has seen the most training.
V1.1	448×448	A fine-tune of V1 at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe.	Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). It will not magically be better than V1 across the board — it is a resolution upgrade, not a model quality upgrade.

Two practical notes:

Match input resolution to the checkpoint. Feeding 448×448 images to V1, or 320×320 images to V1.1, will give worse results than matching them. The position-embedding grid is fixed at load time.
V1 is not deprecated by V1.1. They are siblings with different operating points, not generations of the same model.

How this model came to be

I started with a corpus of roughly 5.9 million images with publicly-sourced tags. Before training anything of my own, I used SmilingWolf's ViT v3 tagger to help clean the dataset. With that pipeline I:

Removed ~300k incorrect tags from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
Added ~1,000,000 missing tags in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.

That is ~1.3M corrections in total, which is only on the order of ~3% of the tags in the corpus. This was a targeted pass, not a top-to-bottom relabel. Effort was deliberately concentrated on low-frequency tags, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.

I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to expand the tag vocabulary by ~20,000 additional low-frequency tags that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.

The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.

What "cleaned" actually means (and what it does not)

This is the most important section of this release. The cleaning was real work, but it was not omniscient, and the dataset still has structured, category-level label noise that you will see in the model's outputs. Most of these issues are inherited directly from the publicly-sourced source datasets — they are not new noise introduced during cleaning; they are pre-existing patterns that the cleaning pass touched but did not resolve at the category level.

The categories below are illustrative, not exhaustive. Many other tag families show similarly deep-rooted issues. Two failure modes show up across most of them, but they are not equal in size:

Missing tags (by far the dominant problem) — concepts that are clearly present in an image but were never tagged at the source. This is the single biggest source of noise in the entire dataset. See the dedicated subsection below for the empirical scale.
Wrong tags (not uncommon, but secondary) — visually similar concepts confused with each other in the source data (the bow / bowtie / ribbon / ascot / necktie cluster, color buckets, length and size buckets). These are real and plentiful, just not the dominant failure mode.

Missing tags (the dominant noise mode)

If you only remember one thing from this section, remember this: the biggest single problem in the source data is not wrong tags, it is missing tags. Wrong tags are not uncommon either, but they are dwarfed in volume by labels that should be present and simply aren't.

A rough empirical sense of the gap, from manual review:

A typical image in this dataset arrives with roughly ~28 tags from the source.
A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have 50+ tags, often more.
During spot-checks I have routinely taken images that arrived with ~40 tags up past 60 tags just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.

So the source tag count is on the order of half of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added ~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.

The training-time consequence is that for every missing-but-present tag, the model receives no positive gradient at all for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means low predicted scores are less informative than they look — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."

Color tags

Color-named tags (eye color, hair color, general color tags) are poorly tagged at the source, and the noise that survived cleaning is dataset-wide. Every color tag in the vocabulary has some version of this problem; some are worse than others.

Obvious failures were cleanable. A bright, unambiguous yellow mislabeled as blue_eyes is exactly the kind of disagreement the AI-assisted pass catches, and those got fixed. The residual noise is not the obvious-failure kind.
The deep-rooted issue is perceptual, not technical. The category boundaries between color tags are drawn by human viewers, not by RGB codes. Different taggers carve up the spectrum differently, and any single color tag in this dataset covers a fairly wide perceptual band of that color. There is no clean RGB threshold I could have used to mechanically separate the categories, which is exactly why manual cleaning at the category level is intractable.
Adjacent / overlapping colors leak into each other in predictable patterns. Some examples I have observed:
- aqua_* tags heavily pollute both blue and green based tags — aqua sits perceptually between them and gets sorted into all three buckets across the corpus.
- yellow_* tags overlap meaningfully with red and orange tags — warm-spectrum boundaries are inconsistent in the source data.
- Similar patterns exist for purple/blue/pink, brown/orange/red, and black/very-dark-anything.
Color tags are also high frequency, so the noise is spread across millions of images rather than concentrated where it could be hand-fixed.
When I sampled live in-the-wild images and compared the model's predictions to a careful human reading, the same source-data confusion patterns were still present in the predictions. The model is faithfully reproducing the source-data label distribution, which is itself noisy along the color axis.

Hair length

The hair length tags — very_short_hair, short_hair, medium_hair, long_hair, very_long_hair — all have major boundary issues. long_hair and very_long_hair are the worst offenders; the source labels routinely disagree with each other across visually similar images. The model inherits this confusion.

Other "objective size" body-part tags

The same problem applies to tags that sound objective but are really continuous and judgement-dependent: flat_chest, small_breasts, medium_breasts, large_breasts, huge_breasts. These are inherently noisy supervision targets for a classifier — adjacent buckets are not crisply separable in the source data, and the model cannot do better than the labels it was given.

Neckwear and small accessories (bows, bowties, ribbons, ascots, neckties)

This cluster of tags has systemic issues at the source. bow, bowtie, ribbon, ascot, and necktie are visually similar but distinct accessories, and the public source data routinely confuses them — the same physical object will be tagged differently across images, and adjacent categories leak into each other in both directions. The cleaning pass touched obvious mistakes here but did not normalize the category boundaries; the model learns the same fuzzy boundaries the source data has.

These five are the cluster I happened to look at closely. Many other small-accessory and clothing-detail tags show the same pattern — visually similar items, fuzzy source-data boundaries, residual confusion in the model. Treat any prediction in this category as a suggestion to inspect, not a final answer.

Character-vs-concept leakage

For some tags, the data is dominated by a small number of characters. When that happens, the model tends to learn the character rather than the concept the tag was meant to represent. Without a curated golden-standard set that deliberately decouples the concept from those characters, this is very hard to fix at training time.

My estimate of cleaning quality

The 300k removals and ~1M additions were AI-assisted and then human-reviewed by me. My honest estimate is that the corrections themselves are <5% error. That is a statement about the changes I made, not about the underlying dataset — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.

How to use this model responsibly

Human review every output. This applies most strongly to color, hair length, and size-bucket tags. The model is a fast first pass, not an authoritative labeler.
Treat sibling tags as a group, not a hard pick. If the model emits blue_eyes with high confidence, also check the purple_eyes / aqua_eyes / black_eyes scores before you commit.
Do not use the raw output as ground-truth for downstream training without manual review. The very confusion patterns that this model can't resolve will get baked into your downstream model.
For thresholding, prefer per-tag thresholds over a single global threshold. Different tag families have very different precision/recall behavior on this dataset.

Performance notes

On my evaluation set this model achieves:

The best precision-equals-recall point I have measured among comparable open anime taggers.
A solid mAP relative to the same comparison set.

V1 headline numbers (e27/40, Phase 1, 320×320, 19,292 tags)

Metric	Value
Macro F1	0.588
Micro F1	0.659
P=R threshold (macro / micro)	0.614 / 0.670
Overall val/mAP	0.614

mAP broken out by tag frequency bucket:

Frequency bucket	mAP
500–999 (rare)	0.589
1K–5K (mid)	0.598
5K–10K (head)	0.535
10K+ (very common)	0.542

Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.

I want to be honest about why I think it performs well: it is almost certainly not because of a special training regimen. The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the input dataset is cleaner than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.

Image augmentation settings (V1 and V1.1)

For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a reduced version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.

Augmentation	V1 (320×320, from scratch, 40 epochs planned)	V1.1 (448×448, fine-tune of V1, 15 epochs)
Horizontal flip	p = 0.5	p = 0.5
Color jitter — brightness	0.30 (p = 0.5)	0.22 (p = 0.5)
Color jitter — contrast	0.20 (p = 0.5)	0.15 (p = 0.5)
Color jitter — saturation	0.08 (p = 0.5)	0.06 (p = 0.5)
Random rotation	p = 0.50, ±[2°, 8°], bicubic	p = 0.30, ±[2°, 5°], bicubic
Gaussian blur	p = 0.30, kernel = 3, σ ∈ [0.1, 1.5]	p = 0.15, kernel = 3, σ ∈ [0.1, 1.0]
Random erasing	disabled	disabled
Normalization (mean / std)	[0.5, 0.5, 0.5] / [0.5, 0.5, 0.5]	[0.5, 0.5, 0.5] / [0.5, 0.5, 0.5]
Letterbox pad color	[114, 114, 114]	[114, 114, 114]

Notes on a few of these choices:

Saturation is held well below brightness/contrast in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (blue_eyes, pink_skin, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (~¼ of brightness) is taken from BYOL's asymmetric augmentation.
Rotation is kept on at V1.1, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
Gaussian blur is also kept on at V1.1 for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
No mixup, no cutmix, no RandAugment, no random erasing in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.

Limitations summary

Area	Severity	Notes
Color tags (eye/hair/general)	High	Source-data noise survives; sibling colors leak into each other
Hair length (especially `long_hair`, `very_long_hair`)	High	Boundary tags inherently noisy in source
Size-bucket body-part tags	High	Continuous quantity discretized into noisy buckets
Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`)	High	Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern
Missing tags (concept present, no label)	Dominant	The single biggest source of noise in the corpus. Typical ~28 tags/image vs. 50+ that should be present. ~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction.
Character-overwhelmed tags	Medium	Some tags are learned as proxies for specific characters
Rare / low-frequency tags	Medium	The +20k vocabulary expansion helps, but tail tags still see fewer examples
Anything not on the above list	Use with normal caution	The above are illustrative, not exhaustive — many tag families show similar source-data issues

What's next (V2)

Once a refreshed 2026-vintage source dataset becomes available, I plan to start work on V2. The biggest single change between V1 and V2 will not be the model — it will be substantially more time spent on data cleaning before training begins, with a particular focus on:

Building a curated golden-standard slice for color tags, hair-length tags, and size-bucket tags so those categories can be supervised against deliberately disambiguated examples.
Deeper character/concept decoupling so character-overwhelmed tags learn the actual concept.
Better measurement of "true" performance on a hand-relabeled validation slice, so the headline metrics are not silently inflated by the same missing-positive bias that affects the training data.

V1 ships with the noise it ships with. V2 is where I plan to do something about it.

Acknowledgments

SmilingWolf for the ViT v3 tagger, which made the initial cleaning pass tractable. None of this would have been feasible without an existing strong tagger to use as a second opinion.
The broader anime-tagger open-source community for the public tag corpora and prior model checkpoints I compared against.

License / usage

TODO: fill in license, intended-use statement, and out-of-scope use list before publishing.

Downloads last month: 17