nabeelshan
/

rlhf-gpt2-pipeline

+---
+license: apache-2.0
+datasets:
+- Dahoas/synthetic-instruct-gptj-pairwise
+language:
+- en
+base_model:
+- openai-community/gpt2
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- gpt2
+- rlhf
+- reinforcement-learning
+- ppo
+- reward-model
+- instruction-tuning
+model-index:
+- name: sft_full_final
+  results: []
+- name: reward_model_final
+  results: []
+- name: ppo_aligned_final
+  results: []
+---
+# RLHF-Aligned GPT-2 Pipeline Models
+This repository contains the three key models from an end-to-end, from-scratch implementation of the **Reinforcement Learning from Human Feedback (RLHF)** pipeline. The project's goal was to align a base `gpt2` model with human preferences, following the same three-stage process popularized by models like ChatGPT.
+The complete training code, notebooks, and in-depth analysis can be found in the primary GitHub repository:
+[**nabeelshan78/reinforcement-learning-human-feedback-scratch**](https://github.com/nabeelshan78/reinforcement-learning-human-feedback-scratch)
+## 🎯 Models in this Repository
+This repository hosts the final checkpoint for each stage of the RLHF pipeline. You can load each model independently using the `subfolder` argument.
+1.  `sft_full_final` - **Supervised Fine-Tuned (SFT) Model**: The base `gpt2` model after being fine-tuned on an instruction dataset (`Dahoas/synthetic-instruct-gptj-pairwise`) to learn a helpful response style.
+2.  `reward_model_final` - **Reward Model (RM)**: A `gpt2`-based model trained to predict human preferences. It takes a prompt and a response and outputs a scalar *reward score*, indicating how "good" the response is. This model acts as an automated human preference judge.
+3.  `ppo_aligned_final` - **PPO-Aligned Model**: The final, alignment-tuned model. This is the SFT model further trained using Proximal Policy Optimization (PPO) and the Reward Model to generate responses that maximize the reward score. **This is the main model intended for generation tasks.**
+---
+## 🚀 How to Use
+### 1. Using the Final PPO-Aligned Model (for Text Generation)
+This is the recommended model for generating helpful, aligned responses.
+```python
+from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
+# Define the model ID and the specific model subfolder
+model_id = "nabeelshan/rlhf-gpt2-pipeline"
+subfolder = "ppo_aligned_final"
+# Load the tokenizer and model from the subfolder
+tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
+model = AutoModelForCausalLM.from_pretrained(model_id, subfolder=subfolder)
+# Set up the text generation pipeline
+generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
+# Generate a response
+prompt = "How do I price my artwork?"
+output = generator(prompt, max_new_tokens=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
+print(output[0]['generated_text'])
+# Expected Output (example):
+# To price your art, start by researching the artist and their portfolio to determine what
+# other artists are making... Consider also researching dealerships at the same time... Good luck.
+2. Using the Reward Model (for Scoring Responses)
+You can use the reward model to score how much a human might prefer a given response.
+Python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+# Define the model ID and the reward model subfolder
+model_id = "nabeelshan/rlhf-gpt2-pipeline"
+subfolder = "reward_model_final"
+# Load the tokenizer and reward model
+tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
+model = AutoModelForSequenceClassification.from_pretrained(model_id, subfolder=subfolder)
+prompt = "What diet should I follow to lose weight healthily?"
+good_response = "A balanced, nutritious plan based on eating whole foods is best. Limit processed and sugary foods."
+bad_response = "Just eat less lol."
+# Tokenize the inputs (prompt + response)
+inputs_good = tokenizer(prompt, good_response, return_tensors="pt")
+inputs_bad = tokenizer(prompt, bad_response, return_tensors="pt")
+# Get the reward scores (logits)
+with torch.no_grad():
+    reward_good = model(**inputs_good).logits[0].item()
+    reward_bad = model(**inputs_bad).logits[0].item()
+print(f"Score for good response: {reward_good:.2f}")
+print(f"Score for bad response: {reward_bad:.2f}")
+# The model should give a higher score to the better response.
+# Expected: Score for good response: 2.15
+# Expected: Score for bad response: -1.50