RLAIF Matches Human Feedback for Language Models | Buon Giovedì Vintage Immagini | Creare Immagini con ai | Ai Immagini | Turtles AI

RLAIF Matches Human Feedback for Language Models
DukeRem9 settembre 2023
  New #research from #Google shows "reinforcement learning from #AI feedback", where an AI labels training data, can match the performance of learning from human feedback for improving text summarization. It may enable scaling up without costly human labelling. Find the full article by clicking here. Researchers at Google are exploring a new technique called reinforcement learning from AI feedback (RLAIF) as a way to scale up reinforcement learning from human feedback (RLHF). RLHF aligns large language models with human preferences by training them on human ratings of model outputs. However, gathering high-quality human labels limits scaling. In RLAIF, preferences are labelled by an off-the-shelf large language model instead of humans. Researchers compared RLAIF and RLHF on the task of abstractive summarization. Both techniques improved the quality of summaries compared to a baseline supervised model. In head-to-head tests, human evaluators preferred the RLAIF and RLHF summaries over the baseline 71% and 73% of the time respectively. The small difference was not statistically significant. RLAIF and RLHF summaries were also directly compared and showed no significant difference in preference. The results suggest RLAIF can achieve comparable performance to RLHF without human labelling. Researchers also identified optimal prompting techniques for aligning the AI labeller with humans, like providing detailed instructions and eliciting reasoning chains. Surprisingly, in-context learning with examples did not help. The findings demonstrate RLAIF's promise for scaling RLHF. Highlights:
  • - RLAIF matches RLHF for improving summarization quality over supervised baseline.
  • - Human evaluators show no preference between RLAIF and RLHF summaries.
  • - AI labeller alignment is optimized through detailed instructions and reasoning chains.
  • - In-context learning surprisingly doesn't help alignment.
  • - RLAIF demonstrates promise for scaling up reinforcement learning without human labelling.
RLAIF offers an intriguing new path to scalably optimizing language models without costly human feedback. But it also surfaces important questions around responsible development as we increasingly rely on AI systems to make AI systems. How can we ensure models like RLAIF build capabilities that align with human values over time? What are the risks if alignment deteriorates? I invite readers to share perspectives on deploying techniques like RLAIF responsibly as this technology advances.