Skip to content

TRL / Unsloth GRPO Training

Train LLMs to control the methanol plant using TRL (Transformer Reinforcement Learning) with GRPO (Group Relative Policy Optimization).

Reward Function

from methanol_apc_env.trl_bridge import MethanolRewardFunction

# Create reward function for a specific task
reward_fn = MethanolRewardFunction(task="optimization", seed=42)

# Score LLM-generated action strings
completions = [
    '{"feed_rate_h2": 5.0, "feed_rate_co": 2.5, "cooling_water_flow": 40.0, "compressor_power": 65.0}',
    '{"feed_rate_h2": 3.0, "feed_rate_co": 1.5, "cooling_water_flow": 60.0, "compressor_power": 50.0}',
]
rewards = reward_fn(completions)  # [0.45, 0.38]

Training Configs

from methanol_apc_env.trl_bridge import MethanolGRPOConfig

# Full precision (needs ~32 GB VRAM)
config = MethanolGRPOConfig.get_config()

# 4-bit quantized via Unsloth (needs ~16 GB VRAM)
config = MethanolGRPOConfig.get_unsloth_config()
Config Model VRAM Training
Standard Qwen2.5-7B-Instruct ~32 GB Full fine-tune
Unsloth Qwen2.5-7B-Instruct-bnb-4bit ~16 GB LoRA (r=16, α=32)