nabeelshan commited on
Commit
2c88d70
·
verified ·
1 Parent(s): 46724ea

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Dahoas/synthetic-instruct-gptj-pairwise
5
+ language:
6
+ - en
7
+ base_model:
8
+ - openai-community/gpt2
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
+ tags:
12
+ - gpt2
13
+ - rlhf
14
+ - reinforcement-learning
15
+ - ppo
16
+ - reward-model
17
+ - instruction-tuning
18
+
19
+ model-index:
20
+ - name: sft_full_final
21
+ results: []
22
+ - name: reward_model_final
23
+ results: []
24
+ - name: ppo_aligned_final
25
+ results: []
26
+ ---
27
+
28
+ # RLHF-Aligned GPT-2 Pipeline Models
29
+
30
+ This repository contains the three key models from an end-to-end, from-scratch implementation of the **Reinforcement Learning from Human Feedback (RLHF)** pipeline. The project's goal was to align a base `gpt2` model with human preferences, following the same three-stage process popularized by models like ChatGPT.
31
+
32
+ The complete training code, notebooks, and in-depth analysis can be found in the primary GitHub repository:
33
+ [**nabeelshan78/reinforcement-learning-human-feedback-scratch**](https://github.com/nabeelshan78/reinforcement-learning-human-feedback-scratch)
34
+
35
+ ## 🎯 Models in this Repository
36
+
37
+ This repository hosts the final checkpoint for each stage of the RLHF pipeline. You can load each model independently using the `subfolder` argument.
38
+
39
+ 1. `sft_full_final` - **Supervised Fine-Tuned (SFT) Model**: The base `gpt2` model after being fine-tuned on an instruction dataset (`Dahoas/synthetic-instruct-gptj-pairwise`) to learn a helpful response style.
40
+
41
+ 2. `reward_model_final` - **Reward Model (RM)**: A `gpt2`-based model trained to predict human preferences. It takes a prompt and a response and outputs a scalar *reward score*, indicating how "good" the response is. This model acts as an automated human preference judge.
42
+
43
+ 3. `ppo_aligned_final` - **PPO-Aligned Model**: The final, alignment-tuned model. This is the SFT model further trained using Proximal Policy Optimization (PPO) and the Reward Model to generate responses that maximize the reward score. **This is the main model intended for generation tasks.**
44
+
45
+ ---
46
+
47
+ ## 🚀 How to Use
48
+
49
+ ### 1. Using the Final PPO-Aligned Model (for Text Generation)
50
+
51
+ This is the recommended model for generating helpful, aligned responses.
52
+
53
+ ```python
54
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
55
+
56
+ # Define the model ID and the specific model subfolder
57
+ model_id = "nabeelshan/rlhf-gpt2-pipeline"
58
+ subfolder = "ppo_aligned_final"
59
+
60
+ # Load the tokenizer and model from the subfolder
61
+ tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
62
+ model = AutoModelForCausalLM.from_pretrained(model_id, subfolder=subfolder)
63
+
64
+ # Set up the text generation pipeline
65
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
66
+
67
+ # Generate a response
68
+ prompt = "How do I price my artwork?"
69
+ output = generator(prompt, max_new_tokens=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
70
+
71
+ print(output[0]['generated_text'])
72
+ # Expected Output (example):
73
+ # To price your art, start by researching the artist and their portfolio to determine what
74
+ # other artists are making... Consider also researching dealerships at the same time... Good luck.
75
+ 2. Using the Reward Model (for Scoring Responses)
76
+ You can use the reward model to score how much a human might prefer a given response.
77
+
78
+ Python
79
+
80
+ import torch
81
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
82
+
83
+ # Define the model ID and the reward model subfolder
84
+ model_id = "nabeelshan/rlhf-gpt2-pipeline"
85
+ subfolder = "reward_model_final"
86
+
87
+ # Load the tokenizer and reward model
88
+ tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
89
+ model = AutoModelForSequenceClassification.from_pretrained(model_id, subfolder=subfolder)
90
+
91
+ prompt = "What diet should I follow to lose weight healthily?"
92
+ good_response = "A balanced, nutritious plan based on eating whole foods is best. Limit processed and sugary foods."
93
+ bad_response = "Just eat less lol."
94
+
95
+ # Tokenize the inputs (prompt + response)
96
+ inputs_good = tokenizer(prompt, good_response, return_tensors="pt")
97
+ inputs_bad = tokenizer(prompt, bad_response, return_tensors="pt")
98
+
99
+ # Get the reward scores (logits)
100
+ with torch.no_grad():
101
+ reward_good = model(**inputs_good).logits[0].item()
102
+ reward_bad = model(**inputs_bad).logits[0].item()
103
+
104
+ print(f"Score for good response: {reward_good:.2f}")
105
+ print(f"Score for bad response: {reward_bad:.2f}")
106
+
107
+ # The model should give a higher score to the better response.
108
+ # Expected: Score for good response: 2.15
109
+ # Expected: Score for bad response: -1.50