Mitigate GPT-4 Hallucinations using Code Interpreter

Aditya Advani
7 min readJul 24


Recently my friend Erik Louie forwarded me a very interesting paper called “How Language Model Hallucinations Can Snowball” on Arxiv, that demonstrates how often the GPT-3.5 and GPT-4 models hallucinate on simple multi-step reasoning problems.

I decided for my AGI House hackathon project to demonstrate how using a Code Interpreter style LLM engine would successfully reduce such hallucinations as in the source paper by an order of magnitude.


Code Interpreter style reduces GPT-4 hallucination rate from < 10% to < 1%. Financial and investment applications like stock trading and credit approval can successfully be performed by GPT-4 in the Code Interpreter style at a fraction of the cost and time and greater accuracy than humans.

Hallucinating GPT-4 robot can’t trade, composed GPT-4 with Code Interpreter bot can trade.
Meme generated by GPT-4 with Code Intepreter! At its request, I fed it some images I quickly generated using DreamStudio!. Check out how at


Using GPT-4 in the Code Interpreter style significantly reduces hallucinations for simple multi-step reasoning problems from < 10% to < 1%. I also presume but haven’t formally verified yet that verification of answers and reasoning in separate GPT-4 sessions is much more robust in the Code Interpreter style than without.

There are significant caveats to doing inference this way — ~4x the cost, ~10x the runtime, and 20% of the time the simple analytical engine from my example isn’t able to successfully determine a final answer before the chain of reasoning hits the iteration limit. All of these runtime stats can be easily optimized a lot and so ultimately I posit that significant financial and investment applications like stock trading and credit approval decisions can successfully be performed by GPT-4 in the Code Interpreter style at a fraction of the cost and time and with greater accuracy than asking humans to perform the same task.

What the hallucinations Source Paper is about

The ambit of this paper is a bit wider than the catchy title would suggest. The empirical methodology within focuses on zero-shot question-answer prompting on three classes of problems where GPT3.5 and GPT4 regularly hallucinate answers. These sets are —

  1. primality testing whether a number is prime,
  2. senator search whether there is a U.S. senator satisfying two given constraints
  3. graph connectivity whether two cities are connected given a set of flights between cities.

Each dataset contains 500 yes/no questions that the paper authors expect are not answerable by GPT-3.5 and GPT-4 transformers in one timestep. To aid evaluation, questions are designed so that an incorrect answer would be justified with easily verifiable claims.

In the paper they first try prompting with direct prompts and then again with reasoning prompts.

Source Paper conclusion 1: Very high error rate with “direct prompts”

GPT-3.5 gets answers wrong 60.13% of the time and GPT-4 gets answers wrong 83.40% of the time (yes, higher than GPT-3.5!) with zero-shot prompting when presented with a direct prompt e.g.

  1. primality testing direct prompt: Is 9791 a prime number?
  2. senator search direct prompt: Was there ever a US senator that represented the state of New Hampshire and whose alma mater was University of Pennsylvania?
  3. graph connectivity direct prompt: There is a flight from city X to Y (times many such routes) … Question: Is there a series of flights that goes from city B to city E?

Source Paper conclusion 2: lower but still unacceptably high error rate with “reasoning prompts”

In a second pass, the authors of the paper noted how significantly the error rate dropped when the model was additionally prompted with simply “Let’s think step by step”, aka reasoning prompts.

With reasoning prompts the error rate dropped to 0 (!) for senator search with both GPT-3.5 and GPT-4. However, the error rates for primality testing and graph connectivity, while low enough at < 10% to make for interesting prototype applications, were still empirically way too high for many kinds of analytical operations in production.

The authors additionally point out the snowball hallucination effect in multi-step reasoning: “Despite the large improvement in accuracy, we identify a potential issue: the model sometimes hallucinate while outputting the reasoning chain, which causes snowballed hallucination in future steps.”

Another interesting conclusion from the source paper is empirical verification of the fact that in a separate session the LLM is often able to successfully detect the hallucination in a previous conversation, particularly when it’s a snowballed hallucination.

The source paper additionally has a section entitled “Can we prevent snowball hallucinations?” where several methodologies are attempted or suggested to reduce hallucinations without too much notable success in initial experimentation. I suggest we add “Use Code Interpreter” to that list of execution approaches and exhaustively study how it performs. If the authors of the paper or anyone are interested in pursuing that work with me or with my assistance, please get in touch.

My Hypothesis: we can reduce hallucinations in multi-step reasoning problems by an order of magnitude through use of a Code Interpreter style LLM engine

Code Interpreter is a recently released mode of ChatGPT, “an experimental ChatGPT model that can use Python, handle uploads and downloads”. Basically it writes code to lookup data from source files and arrive at conclusions instead of reasoning freestyle like ChatGPT normally does. When it does reason freestyle, it writes code to test its assumptions based on the source data provided. I’ve been so enamored by Code Interpreter’s ability to analyze data successfully without hallucinations that I’ve been working on creating my own async Code Interpreter style LLM engine so I can use it within my apps, as it is not presently available via API. Although I’m working on my own library, thus far in my experiments I’ve been primarily successfully using the wonderful LangChain-based codeinterpreter-api library by @shroominc. While not as robust as the OpenAI version, fundamentally it does a good enough and relatable enough job out of the box for prototyping and demo runs like this.

Steps I took to rerun graph connectivity problems locally using a Code Interpeter style engine

First I downloaded the source data set for the paper from their official Github repo. I then wrote a simple typer script using that attempts the first 20 questions from the graph-connectivity set. You can find that script and instructions on how to run it here.

Headline result: codeinterpreter-api script got the correct answer 100% of the time there was a chain completion

The codeinterpreter-api script got the correct answer 100% of the time there was a chain completion. 20% of the time the agent went into a doom loop — eminently correctible with a meta agent — the error cases are really straightforward to debug and can likely be fixed at the library level.

Example successful run:

This example provided below is admittedly cherrypicked. See all output from the GPT-4 run here.


Starting AgentExecutorChain:

Invoking code … Output “False”

Final answer: Based on the flight information provided and the graph traversal performed, there is no series of flights that goes from city L to city M.

GPT-3.5 fails miserably at working well in the code interpreter style

When I ran a similar test using GPT-3.5, I was shocked by how bad the code it wrote for this simple case was, and how often even the code written outputted an incorrect result. I leave it as an exercise for intrigued readers to check out the loopy attempts here. GPT-3.5 is really very imaginative in the types of bad code it writes!

How the idea to do this analysis came about

As my side project I’ve been working on a GPT-4 based stock trading bot I call Parameta Trades that analyzes and then momentum trades publicly traded stocks. While working on it I’ve noticed hallucinations and snowball hallucinations creeping in to the GPT-4 analyses — to the point where I am too afraid to deploy capital on the bot’s reasonable-sounding suggestions.

My intuition to reduce hallucinations was to use Code Interpreter. As I hope this short analysis may convince you, this approach is the perfect one for this case, and presumably loads of others.

What’s next?

I’m going to try running a relatable analysis using Claude 2 to see how it performs at the Anthropic hackathon next weekend. I’m also working on publishing my own Code Interpreter style LLM engine that works asynchronously and can be parallelized a bit to improve overall throughput for production applications.

About me

I’m Aditya Advani, the CTO of Best Parents, the first marketplace for teen travel experiences. I’ve been a full-time Internet Engineer since 2008, live in SF, am beguiled by autonomous applications.


Many thanks to



Aditya Advani

Teaching is more distinctively human than learning.