Natasha Jaques – Multi-agent RL for Provably Robust LLM Safety [Alignment Workshop]



Natasha Jaques tackles a critical problem: 1 billion weekly LLM users face zero guarantees against harmful outputs like bomb-making instructions. Her multi-agent reinforcement learning solution uses adversarial game theory where attacker and defender models co-evolve through self-play. Unlike single-agent red teaming that produces repetitive attacks, this approach discovers diverse threats while learning defenses simultaneously. The method achieved 95% reduction in harmful outputs with only 5% increase in refusal rates. Through Nash equilibrium game theory, the defender gains theoretical guarantees to handle any adversarial prompt safely.

Note: The opinions shared in this event are those of the speaker(s) and may not represent the views of FAR.AI or their affiliated organizations.

source

Similar Posts