| | --- |
| | language: en |
| | license: mit |
| | datasets: |
| | - arxmliv |
| | - math-stackexchange |
| | --- |
| | |
| | # MathBERTa model |
| |
|
| | Pretrained model on English language and LaTeX using a masked language modeling |
| | (MLM) objective. It was introduced in [this paper][1] and first released in |
| | [this repository][2]. This model is case-sensitive: it makes a difference |
| | between english and English. |
| |
|
| | [1]: http://ceur-ws.org/Vol-3180/paper-06.pdf |
| | [2]: https://github.com/witiko/scm-at-arqmath3 |
| |
|
| | ## Model description |
| |
|
| | MathBERTa is [the RoBERTa base transformer model][3] whose [tokenizer has been |
| | extended with LaTeX math symbols][7] and which has been [fine-tuned on a large |
| | corpus of English mathematical texts][8]. |
| |
|
| | Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling |
| | (MLM) objective. Taking a sentence, the model randomly masks 15% of the words |
| | and math symbols in the input then run the entire masked sentence through the |
| | model and has to predict the masked words and symbols. This way, the model |
| | learns an inner representation of the English language and LaTeX that can then |
| | be used to extract features useful for downstream tasks. |
| |
|
| | [3]: https://huggingface.co/roberta-base |
| | [7]: https://github.com/Witiko/scm-at-arqmath3/blob/main/02-train-tokenizers.ipynb |
| | [8]: https://github.com/witiko/scm-at-arqmath3/blob/main/03-finetune-roberta.ipynb |
| |
|
| | ## Intended uses & limitations |
| |
|
| | You can use the raw model for masked language modeling, but it's mostly |
| | intended to be fine-tuned on a downstream task. See the [model |
| | hub][4] to look for fine-tuned versions on a task that interests you. |
| |
|
| | Note that this model is primarily aimed at being fine-tuned on tasks that use |
| | the whole sentence (potentially masked) to make decisions, such as sequence |
| | classification, token classification or question answering. For tasks such as |
| | text generation you should look at model like GPT2. |
| |
|
| | [4]: https://huggingface.co/models?filter=roberta |
| |
|
| | ### How to use |
| |
|
| |
|
| | *Due to the large number of added LaTeX tokens, MathBERTa is affected by [a |
| | software bug in the 🤗 Transformers library][9] that causes it to load for tens |
| | of minutes. The bug was [fixed in version 4.20.0][10].* |
| |
|
| | You can use this model directly with a pipeline for masked language modeling: |
| |
|
| | ```python |
| | >>> from transformers import pipeline |
| | >>> unmasker = pipeline('fill-mask', model='witiko/mathberta') |
| | >>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.") |
| | |
| | [{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is zero.' |
| | 'score': 0.23291291296482086, |
| | 'token': 4276, |
| | 'token_str': ' zero'}, |
| | {'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 0.' |
| | 'score': 0.11734672635793686, |
| | 'token': 321, |
| | 'token_str': ' 0'}, |
| | {'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is real.' |
| | 'score': 0.0793389230966568, |
| | 'token': 588, |
| | 'token_str': ' real'}, |
| | {'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 1.' |
| | 'score': 0.0753420740365982, |
| | 'token': 112, |
| | 'token_str': ' 1'}, |
| | {'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is even.' |
| | 'score': 0.06487451493740082, |
| | 'token': 190, |
| | 'token_str': ' even'}] |
| | ``` |
| |
|
| | Here is how to use this model to get the features of a given text in PyTorch: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta') |
| | model = AutoModel.from_pretrained('witiko/mathberta') |
| | text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like." |
| | encoded_input = tokenizer(text, return_tensors='pt') |
| | output = model(**encoded_input) |
| | ``` |
| |
|
| | ## Training data |
| |
|
| | Our model was fine-tuned on two datasets: |
| |
|
| | - [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents. |
| | - [Math StackExchange][6], a dataset of 2,466,080 questions and answers. |
| |
|
| | Together theses datasets weight 52GB of text and LaTeX. |
| |
|
| | ## Intrinsic evaluation results |
| |
|
| | Our model achieves the following intrinsic evaluation results: |
| |
|
| | ![Intrinsic evaluation results of MathBERTa][11] |
| |
|
| | [5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/ |
| | [6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html |
| | [9]: https://github.com/huggingface/transformers/issues/16936 |
| | [10]: https://github.com/huggingface/transformers/pull/17119 |
| | [11]: https://huggingface.co/witiko/mathberta/resolve/main/learning-curves.png |
| |
|
| | ## Citing |
| |
|
| | ### Text |
| |
|
| | Vít Novotný and Michal Štefánik. “Combining Sparse and Dense Information |
| | Retrieval. Soft Vector Space Model and MathBERTa at ARQMath-3”. |
| | In: *Proceedings of the Working Notes of CLEF 2022*. To Appear. |
| | CEUR-WS, 2022. |
| |
|
| | ### Bib(La)TeX |
| |
|
| | ``` bib |
| | @inproceedings{novotny2022combining, |
| | booktitle = {Proceedings of the Working Notes of {CLEF} 2022}, |
| | editor = {Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin}, |
| | issn = {1613-0073}, |
| | title = {Combining Sparse and Dense Information Retrieval}, |
| | subtitle = {Soft Vector Space Model and MathBERTa at ARQMath-3 Task 1 (Answer Retrieval)}, |
| | author = {Novotný, Vít and Štefánik, Michal}, |
| | publisher = {{CEUR-WS}}, |
| | year = {2022}, |
| | pages = {104-118}, |
| | numpages = {15}, |
| | url = {http://ceur-ws.org/Vol-3180/paper-06.pdf}, |
| | urldate = {2022-08-12}, |
| | } |
| | ``` |