InvDef-DeBERTa Model Card
The InvOntDef-DeBERTa is a transformer encoder model pretrained for the domain of invasion biology. In addition to MLM pretraining on scientific abstracts (ca. 35000) from the domain of invasion biology, we pretrain it as embedding model on concept definitions for domain-relevant concepts. This dataset of concepts with definitions was created using the INBIO and ENVO ontologies, and was augmented with an LLM by generating four additional definitions for each concept.
Model Details
Model Description
- Developed by: CLAUSE group at Bielefeld University
- Model type: DeBERTa-base
- Languages: Mostly english
- Finetuned from model: microsoft/deberta-base
Model Sources
- Repository: github.com/inas-argumentation/Ontology_Pretraining
- Paper: aclanthology.org/2025.findings-emnlp.1238/
How to Get Started with the Model
Minimal example on how to process texts with this model:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/InvOntDef-DeBERTa")
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/InvOntDef-DeBERTa")
text = "Your text to be embedded."
batch = tokenizer([text], return_tensors="pt")
model_output = model(**batch)
Training Details
This model was trained on a dataset of about 35000 scientific abstracts from the domain of invasion biology. Additionally, we used a dataset of 5,197 unique concepts extracted from the ENVO and INBIO ontologies, each accompanied by one ontology-derived and four LLM-generated concept definitions. We used a triplet loss to encourage definitions of the same concept to be placed nearby in the embedding space, and to also place related concepts (i.e., linked in the ontology) in proximity. The dataset and exact training procedure can be found in our GitHub repo,
Evaluation
| Model | INAS Clf: Macro F1 | INAS Clf: Micro F1 | INAS Span: Token F1 | INAS Span: Span F1 | EICAT Clf: Macro F1 | EICAT Clf: Micro F1 | EICAT Evidence: NDCG | Avg. |
|---|---|---|---|---|---|---|---|---|
| DeBERTa base | 0.674 | 0.745 | 0.406 | 0.218 | 0.392 | 0.416 | 0.505 | 0.483 |
| InvOntDef-DeBERTa | 0.750 | 0.812 | 0.414 | 0.242 | 0.504 | 0.518 | 0.530 | 0.538 |
| InvDef-DeBERTa | 0.740 | 0.805 | 0.415 | 0.220 | 0.469 | 0.489 | 0.511 | 0.520 |
The InvDef-DeBERTa model was also trained by us, using a purely LLM-based pipeline to see if the ontology-derived information can be replaced.
Citation
BibTeX:
@inproceedings{brinner-etal-2025-enhancing,
title = "Enhancing Domain-Specific Encoder Models with {LLM}-Generated Data: How to Leverage Ontologies, and How to Do Without Them",
author = "Brinner, Marc Felix and
Al Mustafa, Tarek and
Zarrie{\ss}, Sina",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.1238/",
doi = "10.18653/v1/2025.findings-emnlp.1238",
pages = "22740--22754",
ISBN = "979-8-89176-335-7"
}
- Downloads last month
- 7
Model tree for CLAUSE-Bielefeld/InvOntDef-DeBERTa
Base model
microsoft/deberta-base