JobBERT
Collection
State-of-the-art, open-source job title NLP β’ 5 items β’ Updated
This is a sentence-transformers model specifically trained for job title matching and similarity. It's finetuned from sentence-transformers/all-mpnet-base-v2 on a large dataset of job titles and their associated skills/requirements. The model maps job titles and descriptions to a 1024-dimensional dense vector space and can be used for semantic job title matching, job similarity search, and related HR/recruitment tasks.
SentenceTransformer(
(0): Transformer({'max_seq_length': 64, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Asym(
(anchor-0): Dense({'in_features': 768, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
(positive-0): Dense({'in_features': 768, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)
)
First install the required packages:
pip install -U sentence-transformers
Then you can load and use the model with the following code:
import torch
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import batch_to_device, cos_sim
# Load the model
model = SentenceTransformer("TechWolf/JobBERT-v2")
def encode_batch(jobbert_model, texts):
features = jobbert_model.tokenize(texts)
features = batch_to_device(features, jobbert_model.device)
features["text_keys"] = ["anchor"]
with torch.no_grad():
out_features = jobbert_model.forward(features)
return out_features["sentence_embedding"].cpu().numpy()
def encode(jobbert_model, texts, batch_size: int = 8):
# Sort texts by length and keep track of original indices
sorted_indices = np.argsort([len(text) for text in texts])
sorted_texts = [texts[i] for i in sorted_indices]
embeddings = []
# Encode in batches
for i in tqdm(range(0, len(sorted_texts), batch_size)):
batch = sorted_texts[i:i+batch_size]
embeddings.append(encode_batch(jobbert_model, batch))
# Concatenate embeddings and reorder to original indices
sorted_embeddings = np.concatenate(embeddings)
original_order = np.argsort(sorted_indices)
return sorted_embeddings[original_order]
# Example usage
job_titles = [
'Software Engineer',
'Senior Software Developer',
'Product Manager',
'Data Scientist'
]
# Get embeddings
embeddings = encode(model, job_titles)
# Calculate cosine similarity matrix
similarities = cos_sim(embeddings, embeddings)
print(similarities)
The output will be a similarity matrix where each value represents the cosine similarity between two job titles:
tensor([[1.0000, 0.8723, 0.4821, 0.5447],
[0.8723, 1.0000, 0.4822, 0.5019],
[0.4821, 0.4822, 1.0000, 0.4328],
[0.5447, 0.5019, 0.4328, 1.0000]])
In this example:
Please cite this paper when using JobBERT-v2:
@article{01K47W55SG7ZRKFG431ESRXC35,
abstract = {{Labor market analysis relies on extracting insights from job advertisements, which provide valuable yet unstructured information on job titles and corresponding skill requirements. While state-of-the-art methods for skill extraction achieve strong performance, they depend on large language models (LLMs), which are computationally expensive and slow. In this paper, we propose ConTeXT-match, a novel contrastive learning approach with token-level attention that is well-suited for the extreme multi-label classification task of skill classification. ConTeXT-match significantly improves skill extraction efficiency and performance, achieving state-of-the-art results with a lightweight bi-encoder model. To support robust evaluation, we introduce Skill-XL a new benchmark with exhaustive, sentence-level skill annotations that explicitly address the redundancy in the large label space. Finally, we present JobBERT V2, an improved job title normalization model that leverages extracted skills to produce high-quality job title representations. Experiments demonstrate that our models are efficient, accurate, and scalable, making them ideal for large-scale, real-time labor market analysis.}},
author = {{Decorte, Jens-Joris and Van Hautte, Jeroen and Develder, Chris and Demeester, Thomas}},
issn = {{2169-3536}},
journal = {{IEEE ACCESS}},
keywords = {{Taxonomy,Contrastive learning,Training,Annotations,Benchmark testing,Training data,Large language models,Computational efficiency,Accuracy,Terminology,Labor market analysis,text encoders,skill extraction,job title normalization}},
language = {{eng}},
pages = {{133596--133608}},
title = {{Efficient text encoders for labor market analysis}},
url = {{http://doi.org/10.1109/ACCESS.2025.3589147}},
volume = {{13}},
year = {{2025}},
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
sentence-transformers/all-mpnet-base-v2
from sentence_transformers import SentenceTransformer model = SentenceTransformer("TechWolf/JobBERT-v2") sentences = [ "Program Coordinator RN", "discuss the medical history of the healthcare user, evidence-based approach in general practice, apply various lifting techniques, establish daily priorities, manage time, demonstrate disciplinary expertise, tolerate sitting for long periods, think critically, provide professional care in nursing, attend meetings, represent union members, nursing science, manage a multidisciplinary team involved in patient care, implement nursing care, customer service, work under supervision in care, keep up-to-date with training subjects, evidence-based nursing care, operate lifting equipment, follow code of ethics for biomedical practices, coordinate care, provide learning support in healthcare", "provide written content, prepare visual data, design computer network, deliver visual presentation of data, communication, operate relational database management system, ICT communications protocols, document management, use threading techniques, search engines, computer science, analyse network bandwidth requirements, analyse network configuration and performance, develop architectural plans, conduct ICT code review, hardware architectures, computer engineering, video-games functionalities, conduct web searches, use databases, use online tools to collaborate", "nursing science, administer appointments, administrative tasks in a medical environment, intravenous infusion, plan nursing care, prepare intravenous packs, work with nursing staff, supervise nursing staff, clinical perfusion" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4]