| Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
| arc_challenge |
1 |
none |
0 |
acc |
โ |
0.2073 |
ยฑ |
0.0118 |
|
|
none |
0 |
acc_norm |
โ |
0.2381 |
ยฑ |
0.0124 |
| arc_easy |
1 |
none |
0 |
acc |
โ |
0.4882 |
ยฑ |
0.0103 |
|
|
none |
0 |
acc_norm |
โ |
0.4234 |
ยฑ |
0.0101 |
| hellaswag |
1 |
none |
0 |
acc |
โ |
0.2880 |
ยฑ |
0.0045 |
|
|
none |
0 |
acc_norm |
โ |
0.3038 |
ยฑ |
0.0046 |
| piqa |
1 |
none |
0 |
acc |
โ |
0.6153 |
ยฑ |
0.0114 |
|
|
none |
0 |
acc_norm |
โ |
0.5968 |
ยฑ |
0.0114 |
| winogrande |
1 |
none |
0 |
acc |
โ |
0.5107 |
ยฑ |
0.0140 |
=== ArithMark-2.0 checkpoint-150000 ===
Average: 0.2540 (635/2500)
Random chance: 0.2500
By difficulty:
easy: 0.2528 (316/1250)
hard: 0.2320 (116/500)
medium: 0.2707 (203/750)
By operator_count:
1: 0.2528 (316/1250)
2: 0.2707 (203/750)
3: 0.2320 (116/500)
By topic:
addition: 0.1543 (83/538)
division: 0.5000 (65/130)
mixed_three_ops: 0.2479 (60/242)
mixed_two_ops: 0.2456 (97/395)
multiplication: 0.4375 (63/144)
parentheses_three_ops: 0.2171 (56/258)
parentheses_two_ops: 0.2986 (106/355)
subtraction: 0.2397 (105/438)