Code-of-thought prompting : Probing AI Safety with Code

Abstract

Large Language Models (LLMs) have rapidly advanced in multiple capabilities, such as text and code understanding, leading to their widespread use in a wide range of applications, such as healthcare, education, and search. Due to the critical nature of these applications, there has been a heightened emphasis on aligning these models to human values and preferences to improve safety and reliability. In this paper, we demonstrate that contemporary efforts fall severely short of the ultimate goal of AI safety and fail to ensure safe, non-toxic outputs. We systematically evaluate the safety of LLMs through a novel model interaction paradigm dubbed Code of Thought (CoDoT) prompting that transforms natural language (NL) prompts into pseudo-code. CoDoT represents NL inputs in a precise, structured, and concise form, allowing us to utilize its programmatic interface to test several facets of AI safety. Under the CoDoT prompting paradigm, we show that a wide range of large language models emit highly toxic outputs with the potential to cause great harm. CoDoT leads to a staggering 16.5× increase in toxicity on GPT-4 TURBO and a massive 4.6 x increase on average, across multiple models and languages. Notably, we find that state-of-the-art mixture-of-experts (MoE) models are approximately 3x more susceptible to toxicity than standard architectures. Our findings raise a troubling concern that recent safety and alignment efforts have regressed LLMs and inadvertently introduced safety backdoors and blind spots. Our work calls for an urgent need to rigorously evaluate the design choices of safety efforts from first principles, given the rapid adoption of LLMs.

i
Publication
Under Review - ICLR 2025