Meta-UAT
Collection
Weight space learning experiments (interpreting behavior through activation signatures)
•
16 items
•
Updated
This model was trained to classify which patterns a subject model was trained on, based on neuron activation signatures.
The model predicts which of the following 14 patterns the subject model was trained on:
palindromesorted_ascendingsorted_descendingalternatingcontains_abcstarts_withends_withno_repeatshas_majorityincreasing_pairsdecreasing_pairsvowel_consonantfirst_last_matchmountain_patternWhen a model was trained on a pattern, what % of the time does the classifier detect it:
| Pattern | Recall (Detection Rate) |
|---|---|
| palindrome | 28.6% |
| sorted_ascending | 15.5% |
| sorted_descending | 32.6% |
| alternating | 30.2% |
| contains_abc | 41.6% |
| starts_with | 8.6% |
| ends_with | 19.2% |
| no_repeats | 9.8% |
| has_majority | 8.3% |
| increasing_pairs | 17.4% |
| decreasing_pairs | 3.8% |
| vowel_consonant | 0.0% |
| first_last_match | 22.5% |
| mountain_pattern | 8.8% |
import torch
from huggingface_hub import hf_hub_download
# Download the model
checkpoint_path = hf_hub_download(repo_id='maximuspowers/muat-separate-pca-10-classifier', filename='best_model.pt')
checkpoint = torch.load(checkpoint_path)