How I contributed a new model to the Transformers library using Codex

Community Article Published March 30, 2026

VidEoMT inference
Inference with the VidEoMT model available in the Transformers library.

Hi, my name is Niels, I've been contributing new model architectures to the Transformers library for several years, with about 50 models in total as also noted by my Github profile.

I started with a relatively niche model called TAPAS in 2020, as a means to better understand how the Transformers library works. While doing that, it also helped me learn git, PyTorch, writing tests, using code quality tools like Ruff and much more. After having a lot of fun with that, I started helping out on other models such as LayoutLM, a popular document AI model at the time.

As I was working on various pull requests, the Hugging Face team eventually asked me to join them, which was super cool, and something I did not expect. This is definitely unique to open-source: anyone can just contribute and show what they got, without requiring any CV or formal application process. I still did the formal application process before joining Hugging Face though ;)

After that, I ported and implemented all kinds of models, mostly computer vision and multimodal models, but also text-only and audio-only models. Some of the ones I'm most proud of include the Vision Transformer and SigLIP by Google, DETR by Meta, SegFormer by NVIDIA, as well as LayoutLMv2 and LayoutLMv3 by Microsoft. Tutorials for all of those were made in the Transformers-Tutorials repository.

More generally, I'm proud of helping a hand in converting Hugging Face from an NLP-only company to a company that supports all domains of machine learning, including computer vision and multimodal AI. Today I'm no longer part of the core Transformers team, but I still contribute (or help people contribute) once in a while, just for the fun of it.

Adding models by hand

As I started doing this around 2020, every model contribution was done by hand. Tools like Cursor, Claude Code and Codex did not exist yet. The learning curve to contribute a new model was pretty steep: you not only needed to (roughly) understand the original Github implementation, but also know how to translate it to the Transformers API, write tests and make sure they pass, etc.

As this task is not the easiest, especially the first time, I created an entire YouTube tutorial series for people where I showed my entire process of contributing a new model, DINOv2 by Meta, to the Transformers library. As can be seen, there are 6 videos in total, with about 4 hours of content. Some people have used it to contribute new models, such as Eduardo Pacheco who added Grounding DINO, a popular open vocabulary object detection model.

My YouTube tutorial series on contributing a new model.

To give you a sense of how hard this could be: for TAPAS, my first model contribution, I had to make the translation of a model implemented in Tensorflow 1 to PyTorch (😅). As I was running things on a Windows laptop and had some issues with properly installing all dependencies required to run the original implementation, I heavily made use of the free environment of Google Colab.

This allowed me to run up scripts that performed a forward pass of both the original implementation and my Transformers implementation on the same dummy inputs. This enabled me to debug things layer-by-layer, eventually reaching parity in terms of the hidden states and output logits. This was a process of several weeks to months, as I was doing this in my free time.

I then also run into an issue where the original implementation used so-called scatter operations on tensors, which PyTorch did not support natively. Luckily, someone built the torch-scatter package, which I could use in order to fully implement the model in PyTorch. Learning about these scatter operations and translating them from Tensorflow to PyTorch definitely also wasn't the easiest thing to do.

Besides, working for the first time with git can also be confusing. You need to learn about commands like git rebase and the difference with git merge, etc. Oftentimes, people struggle initially with understanding how this works - I also learned this the hard way.

2026: just ask a coding agent

However, nowadays we're 2026. Things have changed. We now have coding agents which can implement impressive things in the matter of a couple of minutes. As Karpathy also noted, this change was fairly recent: around December 2025, a shift happened where coding agents really crossed some threshold of coherence and really started to become reliable for writing code. Hence, I wanted to see how far I could use coding agents to automate model contributions, as they failed my earlier attempts.

I saw the University of Eindhoven just released VidEoMT, a simple and elegant model to make vision transformers perform video semantic, instance and panoptic segmentation. I decided to use this model as a test.

Most of the time, I use Cursor as I prefer to work within an IDE and see the code. I don't like working from the terminal as it's too low-level and kind of a black box. Hence, I don't use Claude Code that much.

However, as the Codex desktop app also just came out and I saw many positive reviews on X (see e.g. by Hamel Husain), I decided to give it a try. As it turns out, Codex (with the GPT-5.3 Codex model at the time) is great for async work: you can ask it pretty hard things like debugging something (or in our case, porting a model from scratch), and then half an hour or an hour later it comes back to you and actually did the thing.

Codex review
One of the many positive takes on the Codex desktop app.

The initial prompt

I opened the Codex desktop app and pointed it to my local clone of the Transformers library. In a separate terminal window, I first manually created a branch called add_videomt where I already ran the the transformers-cli add-new-model-like CLI command. This is a CookieCutter template which generates about 15 files for you to complete when adding a new model based on an existing model architecture. In this case, as VidEoMT is based on the EoMT model, I provided that as model to copy from.

Next, I started off with the following prompt:

Contribute the VidEoMT model to the Transformers library based on the original implementation at /Users/nielsrogge/Documents/python_projecten/videomt. The model definition is at /Users/nielsrogge/Documents/python_projecten/videomt/videomt/modeling.

Make sure to use uv and the existing virtual environment. Make sure to use modular, which will bootstrap modeling_videomt.py.

Only focus on implementing the modular, modeling file, and the conversion script, which can be used to convert the original pre-trained checkpoints to the HF implementation. Just focus on converting one checkpoint successfully (found at /Users/nielsrogge/Downloads/yt_2019_vit_small_52.8.pth) by converting weights + checking whether the outputs are exactly the same on the same (dummy) inputs. Usually this is done by using print statement in both the original implementation (Github repo) and your HF implementation. Do not implement tests yet, only the modeling files + conversion script.

Come back to me when you have made progress on this. Write down this task and keep track of your progress in a progress.md file at the root of the repository.  Use uv and a virtual environment.

A few notes here:

As can be seen, I'm heavily filesystem-pilled here. This is mostly because of Anthropic's great slides on how to build effective agents. Basically they say: bash and the filesystem is all they need. Giving agents access to a filesystem in order to let them read and write files, alongside the terminal (bash) so that they can run low-level Unix commands and CLIs is oftentimes the best way to make them do work efficiently. Hence, I provide the filepaths of the original implementation (which I cloned locally) and one original checkpoint (which I downloaded from Hugging Face - luckily the authors shared the original checkpoints on the hub). Before this, I tried the cloud version of Codex, but that made it harder for me to provide local filepaths to it. Hence, using Codex locally made things a bit easier.
I ask it to focus on one thing: converting only 1 checkpoint successfully. Without prompting it this way, it might easily get lost in trying to do all things at once (converting all checkpoints in one go). Letting it focus on converting just 1 checkpoint successfully makes it a lot more manageable. This is also based on my prior experience: when 1 checkpoint can be converted successfully, it is often not a lot of work to convert the remaining checkpoints, which often only differ in terms of model size and other hyperparameters, which are simple tweaks to the model configuration.
I ask it to write its progress to a Markdown file called progress.md. This is simply used as a scratchpad/memory for the coding agent, and it allows the agent to continue working by referencing this file.
The Transformers library itself already provides a pretty good AGENTS.md file, which guides the agent in using modular as well as lint commands like make style and make check-repo. These are super helpful which allowed me to no longer specify these things manually.

After that, Codex went off, and took about 10 minutes to come back to me with a good initial draft:

Codex answer
The initial work done by Codex.

Continuing the work and avoiding context rot

To continue working on contributing a new model to Transformers, I simply used the following prompt:

Please continue working on porting VidEoMT to the Transformers library by working bottom-up. Read and write down your progress in the progress.md file.

Sometimes I used a variant like this one:

Great. Work further on successfully converting the original VidEoMT to the Transformers library based on the original implementation at /Users/nielsrogge/Documents/python_projecten/videomt. Run `make style` and `make check-repo` in the existing virtual environment and fix what they report. Report your progress in the progress.md file at the root of the repository. List the remaining to do's when you come back to me.

As noted above, the progress.md file allows to pass the state from one chat session to the other (as LLMs, or coding agents in general are stateless). This Markdown file kept on getting longer and longer, but it was still manageable over multiple chat sessions, I never had to prune it. If that were the case, I would simply ask the coding agent to make it more concise before continuing, similar to how context compaction works.

I also want to highlight here that the Codex desktop app handles context compaction pretty nicely, better than tools like Cursor or Claude Code in my experience. Usually, I need to restart a chat everytime the coding agent is hitting at least 60% of its context window, as the model simply becomes dumber after that, a problem known as context rot. Someone did a deep dive on how Codex's context compaction works, find it here. I haven't read it in detail, but from a high-level perspective, it simply uses an LLM to summarize the conversation so far, which then gets injected into the system prompt. Thanks to Codex' context compaction, I could simply continue in the same chat, as if the model has an infinite context window - pretty nice!

In total, I asked Codex to continue working on the same task about 10 times.

Submitting a pull request

When I was happy about my implementation, I opened a pull request on Github: https://github.com/huggingface/transformers/pull/44285. I informed the Hugging Face team about the fact that it was entirely AI-generated (with me just steering it in the right direction). This was important as the core team was getting overwhelmed by AI-slop PRs, which made them make some drastic changes.

After that, I simply let the maintainers of the Transformers library review my pull request, and asked Codex to address their comments like so:

Can you address the PR review comments at https://github.com/huggingface/transformers/pull/44285? Make sure to use the virtual environment at the Transformers root.

Another round of reviews were addressed by simply prompting Codex like so:

Another round of reviews has taken place. Please address all comments made after "Another round, this time it's more about details. The core is good to go 🤗".

After about 4 review rounds, the PR got approved by one of the core maintainers!

Fixing Github checks

Another thing I often need to work on manually is fixing the failing checks on Github. These are automated scripts which run on every pull request on the infrastructure of CircleCI.

Of course, I wanted to automate this too. Hence, I create a circle-ci-diagnose-failures skill for this at my local .agents/skills folder. I used Anthropic's skill-creator skill to speed up creation of the skill. Now, I can simply ask my coding agent "please fix CircleCI", and off it goes! A couple of minutes later, all checks are green :)

This was inspired by OpenAI's nice blog post called Using skills to accelerate OSS maintenance. The Transformers library will definitely be extended with more skills.

Converting all checkpoints

After approval by the core Transformers maintainers team, the PR got merged. However, only one Transformers-compatible checkpoint was already pushed to the hub. As there were 8 checkpoints in total, I simply asked Codex to convert and push all remaining ones using the following prompt:

We recently merged the VidEoMT model (with DINOv2 as backbone) in the Transformers library (see https://github.com/huggingface/transformers/pull/44285). However, so far only one checkpoint has been converted using [convert_videomt_to_hf.py](src/transformers/models/videomt/convert_videomt_to_hf.py). You can find it on the hub here: https://huggingface.co/tue-mps/videomt-dinov2-small-ytvis2019.

Can you convert all remaining checkpoints which use a DINOv2 backbone from here:  https://github.com/tue-mps/videomt/blob/master/model_zoo/dinov2.md.

You can find a HF_token in the venv with write access to the tue-mps org in the Transformers root. Use uv. Also feel free to update the model cards for each of those.

Write down your progress in a progress.md file at the root of the repository.

It came back to me after around 20 minutes:

Codex
Codex coming back to me after converting and pushing all checkpoints to the hub, model cards included.

This worked pretty great along with nice model cards! You can find all 8 models here: https://huggingface.co/papers/2602.17807. They are the state-of-the-art for Transformer-based video segmentation.

I informed the authors about this new addition. The VidEoMT model was also highlighted at the top of the new Transformers v5.4 release. Pretty cool!

I already applied the same approach for other models, including SAM-3 LiteText, Roboflow's RF-DETR and DEIMv2. Basically, software is automated now.

Conclusion

The bottom line is that coding has changed significantly, and that coding agents are now capable of porting entire model architectures from scratch, based on an existing Github implementation, to the Transformers library. This is something that only became possible after December 2025. I remember trying this before that, and they failed.

Of course, we still need to steer them in the right directions, but the prompts have become smaller and smaller. The heavy lifting can now be done via a concise AGENTS.md and corresponding Skills. Contributing a model is a nice example of an asychronous task, for which a tool like Codex Desktop shines.

Some people even claim that deep learning libraries (like Transformers) might not be needed anymore (see comments by Ross Wightman - creator of Timm and Lukas Beyer - author of ViT). I won't go as far, but safe to say that software has changed!

Mentioned models

Mentioned papers

On CLIs vs. MCP

March 22, 2026

Community

ezzaldeen

about 6 hours ago

Thanks Niels. Amazing article.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote