招待講演1 [11月6日(水) 10:00 – 11:00]
Furong Huang(University of Maryland)
Towards AI Security – An Interplay of Stress-Testing and Alignment
As large language models (LLMs) become increasingly integrated into critical applications, ensuring their robustness and alignment with human values is paramount. This talk explores the interplay between stress-testing LLMs and alignment strategies to secure AI systems against emerging threats. We begin by motivating the need for rigorous stress-testing approaches that expose vulnerabilities, focusing on three key challenges: hallucinations, jailbreaking, and poisoning attacks. Hallucinations—where models generate incorrect or misleading content—compromise reliability. Jailbreaking methods that bypass safety filters can be exploited to elicit harmful outputs, while data poisoning undermines model integrity and security. After identifying these challenges, we propose alignment methods that embed ethical and security constraints directly into model behavior. By systematically combining stress-testing methodologies with alignment interventions, we aim to advance AI security and foster the development of resilient, trustworthy LLMs.
招待講演2 [11月7日(木) 09:30 – 10:30]
大柴 行人(Cisco Systems)※オンライン講演
セキュアなAIトランスフォーメーション
AIの活用は企業の競争力強化に不可欠であり、その導入スピードがビジネス成功の鍵となる時代を迎えています。しかし、AIにはセキュリティや倫理、法的側面を含むさまざまなリスクが存在し、その対応が遅れれば、AIのTime-to-Marketが大幅に遅延し、機会損失につながる可能性があります。本講演では、AI導入に伴うリスクを具体的に解説し、そのリスク管理を自動化する手法を通じて、AIの迅速かつ安全な活用をどのように実現できるかを探ります。最終的には、セキュアなAIトランスフォーメーションを加速させ、未来のビジネスにおけるAIの可能性を最大限に引き出すための道筋を示します。
招待講演3 [11月7日(木) 17:00 – 18:00]
Quanquan Gu(University of California, Los Angeles)※オンライン講演
Self-Play Preference Optimization for Language Model Alignment
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this talk, I will introduce a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.