Document Details


2404.03715.pdf
Download View Text Delete
Clip: Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Corby Rosset ∗ Ching-An Cheng Arindam Mitra Michael Santacroce Ahmed Awadallah ∗ Tengyang Xie ∗ Microsoft Research Abstract This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy opti- mization. However, such a reward maximization approach is limited by the nature of “point-wise” rewards (such as that
Filename: 2404.03715.pdf
Filetype: application/pdf
Size: 1104120 bytes
Uploaded On: 2024-04-08
Abstract:
Summary:
Tags:
Notes:
Visible: 1
Status: Parsed
Author:
CreationDate: 2024-04-08T00:08:24+00:00
Creator: LaTeX with hyperref
Keywords:
ModDate: 2024-04-08T00:08:24+00:00
PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
Producer: pdfTeX-1.40.25
Subject:
Title:
Trapped: False
Pages: 36

Return to Document Library