BoSentencePiece - Tibetan SentencePiece Tokenizer

A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm.

Model Details

Parameter Value
Model Type Unigram
Vocabulary Size 20,000
Character Coverage 100%
Alphabet Size 232 characters
Max Token Length 16

Training Data

  • Total Sentences: 2,302,054
  • Total Characters: 228,818,663
  • Skipped Sentences (exceeded max length): 1,318

Special Tokens

Token ID Description
<unk> 0 Unknown token
<s> 1 Beginning of sequence
</s> 2 End of sequence
<pad> -1 Padding token

Normalization

  • Normalizer: NMT NFKC
  • Add Dummy Prefix: Yes
  • Remove Extra Whitespaces: Yes
  • Escape Whitespaces: Yes

Tokenization Settings

  • Split by Unicode script: ✅
  • Split by number: ✅
  • Split by whitespace: ✅
  • Split digits: ❌
  • Byte fallback: ❌

Usage

With Transformers

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OpenPecha/BoSentencePiece")

text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode
encoded = tokenizer.encode(text)
print(encoded)

# Decode
decoded = tokenizer.decode(encoded)
print(decoded)

With SentencePiece Directly

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("sentencepiece.model")

text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
tokens = sp.encode_as_pieces(text)
print(tokens)

Files

  • sentencepiece.model - SentencePiece binary model
  • sentencepiece.vocab - Vocabulary file with scores
  • tokenizer.json - Hugging Face tokenizer format
  • tokenizer_config.json - Tokenizer configuration

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support