A newer version of this model is available: tokinasin/ruri-v3-70m-code-v0.2

This model is a fine-tuned version of cl-nagoya/ruri-v3-70m for retrieving semantically segmented code snippets using natural language queries.

Supported Natural Languages

Japanese, English

Supported Programming Languages

C, CSharp, Cpp, Go, Java, JavaScript, PHP, Python, Ruby, Rust, SQL, Bash, Swift, TypeScript

Example Usage

Please refer to the original model for more detail: https://huggingface.co/cl-nagoya/ruri-v3-70m

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("tokinasin/ruri-v3-70m-code-v0.1")

code_snippets = [
    """def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a""",

    """import numpy as np
def normalize(v):
    norm = np.linalg.norm(v)
    return v / norm if norm != 0 else v""",

    """def is_prime(num):
    if num < 2:
        return False
    for i in range(2, int(num**0.5) + 1):
        if num % i == 0:
            return False
    return True"""
]

descriptions = [
    "ใƒ•ใ‚ฃใƒœใƒŠใƒƒใƒๆ•ฐๅˆ—ใ‚’่จˆ็ฎ—ใ™ใ‚‹้–ขๆ•ฐ",
    "ใƒ™ใ‚ฏใƒˆใƒซใ‚’ๆญฃ่ฆๅŒ–ใ™ใ‚‹ใƒฆใƒผใƒ†ใ‚ฃใƒชใƒ†ใ‚ฃ",
    "ๆ•ดๆ•ฐใŒ็ด ๆ•ฐใ‹ใฉใ†ใ‹ๅˆคๅฎšใ™ใ‚‹้–ขๆ•ฐ"
]

# Encode
code_embeddings = model.encode(code_snippets, normalize_embeddings=True)
desc_embeddings = model.encode(descriptions, normalize_embeddings=True)

# Calculate similarities
similarities = util.cos_sim(desc_embeddings, code_embeddings)

# Print results
print("\nSimilarity Matrix (Description โ†’ Code):")
print("="*60)
print(f"{'Description':<40} | Code #1  Code #2  Code #3")
print("-"*60)

for i, desc in enumerate(descriptions):
    scores = [f"{similarities[i][j]:.4f}" for j in range(3)]
    best_match = similarities[i].argmax().item() + 1
    print(f"{desc[:37]:<40} | {scores[0]}  {scores[1]}  {scores[2]}  โ†’ Best: #{best_match}")

print("="*60)

# Check accuracy
correct = 0
for i in range(len(descriptions)):
    if similarities[i].argmax().item() == i:
        correct += 1

accuracy = correct / len(descriptions)
print(f"\nAccuracy@1: {accuracy:.2%} ({correct}/{len(descriptions)})")

Acknowledgements

This model is based on cl-nagoya/ruri-v3-70m. Many thanks to those who developed the excellent original model.

Downloads last month
3
Safetensors
Model size
70M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tokinasin/ruri-v3-70m-code-v0.1

Finetuned
(2)
this model
Quantizations
1 model