Is the Reversal Curse Real?

Published by

on

A recent paper The Reversal Curse points out an apparent failure in large large language models like GPT-4.

From the abstract:

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “is B”, it will not automatically generalize to the reverse direction “is A”. This is the Reversal Curse. For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany”, it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?”. Moreover, the likelihood of the correct answer (“Olaf Scholz”) will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if “is B” occurs, “is A” is more likely to occur).

This is a very big claim. While my intuition about large language models, especially GPT-4, is that they can do some kind of backwards generalization, I wanted to explore this paper further. (We’ll also get to the problem with the example in their abstract.)

The Network in Neural Networks

When the paper authors point out that you’re far less likely to get an accurate response to “Who is the son of Mary Lee Pfeiffer?” (Tom Cruise) than if you ask “Who is Tom Cruise’s mother?” (Mary Lee Pfeiffer) this seems to me more like an explanation of how neural networks function than a model’s inability to deduce B is A.

If you look at Google Search as a proxy for training data frequency:

Mary Lee Pfeiffer has approximately 46,600 results:

Whereas her son Tom Cruise, has approximately 66,800,000 results:

By this metric, Mary Lee Pfeiffer has 0.0698% of the results as her son. I’m not sure the model would have any idea who she is outside of the context of her son.

If you search Wikipedia to see how many times “Mary Lee Pfeiffer” is mentioned. It turns out “Mary Lee Pfeiffer” has zero mentions in Wikipedia:

Which is interesting and reveals a limitation in that example. Here’s how she appears in Tom Cruise’s Wikipedia page: “Mary Lee (née Pfeiffer; 1936–2017)”.

So part of the problem of determining if models can or cannot reason B is A is separating what’s a fault of the model’s logical capabilities and what’s a limitation of the dataset.

If you start a query with “Mary Lee Pfeiffer”, you’re not going to get very far because neural networks aren’t equidistant grids of points (besides the fact that she may not appear very often under that version of her name.) They’re networks of nodes, some with many connections, some with few. One of the ways you optimize large models is by pruning off weakly connected regions. This may come at the expense of destroying B is A relationships for weakly represented entities.

This isn’t a failure of neural networks. It’s a feature. It’s why you’re not flooded with every single memory and experience you’ve ever had every moment.

In different terms: Not all information has its own node (or neuron). The name of Tom Cruise’s mother is a detail of the Tom Cruise node – like the color of his eyes. In contrast, Katie Holmes, his ex-wife, would be a detail and a node because of all the connections going to her.

How do we know if something is a node or just a detail? If the model doesn’t recognize a detail, it’s probably not a node.

Saying that models can’t automatically generalize from B to A when B is vastly underrepresented in the dataset feels rather obvious and not so much a curse as a description of how neural nets function. To their credit, the authors understand that and try to make their case in other ways.

What about a person that should be well-represented in the dataset and a datapoint almost always appearing in the B position? Their key example from the abstract is: Who was the ninth Chancellor of Germany?

This involves a well-documented person, Olaf Scholz (A) and a datapoint about him, being the ninth Chancellor of Germany, (B) that should appear frequently in training data.

Here’s the latest version of GPT-4 trying to answer “Who was the ninth Chancellor of Germany?” and failing:

Okay, except there’s a catch. It’s a trick question. Asking a model that was trained before he was elected this question would be pointless and asking a model that finished training while he’s still Chancellor is inviting it to hallucinate. “Was” and “is” have different connotations. (Additionally, models like GPT-4 are stateless – in that they’re frozen in time from when their training stopped and they’re understanding of text may be limited to what related items reference about it. “Is” is usually better than “was”.)

The question asks who “was“, implying a past tense (even though we’re asking about a current Chancellor.) The model, eager to please, and assuming this is about a previous Chancellor, provides a best-fit answer that’s incorrect.

However, when you turn “was” to “is” you will frequently get this response which refutes the claim in the abstract that the answer “will not be higher than for a random name”.

I say it’s likely to get the right answer because sometimes it doesn’t (but still succeeds at rate far above chance) because it’s still a trick question. There have been 9 (and one acting) Chancellors of the Federal Republic of Germany…but there have been 36 people who have held the office of Chancellor in Germany if you include prior governments.

Because of this ambiguity, the model is still trying to guess what you mean. Often it gets it wrong, sometimes not. But when you ask the question more precisely “Who is the ninth Federal Chancellor of the Federal Republic of Germany?” it gets it right a majority of the time:

To see if this isn’t specific to Olaf Scholz, let’s ask “Who is the seventh Federal Chancellor of the Federal Republic of Germany?”:

Correct again. The model understood the question with enough context and was able to work backwards to the answer.

There’s an argument to be made that the model “should know” what you mean when you ask the question, but that means asking it to be imprecise and/or hallucinate. If you want to know who is the ninth chancellor does that mean since the office was created during the Holy Roman Empire? Or since formation of the Federal Republic of Germany? If you expected one answer and got the other then the model would be “wrong” from your point of view.

“Is” and “was” phrasing is a limitation that can likely be eliminated by preprocessing the text that goes into training. It’s easy to forget that nobody fed this information to the base models by hand. The bulk of what it learned was from generalizing across millions of bits of information. If most of that text refers to modern politicial leaders in the present tense, then that’s how the model will likely think of them. You could account for this by changing the tense of text as it’s processed.

Regardless, we can see that GPT-4 can easily go from B to A in that example when the question is posed unambiguously. The counter-explanation might be that without access to the dataset, it’s hard to know if this is proof that GPT-4 can reason from B to A, or that there might be a lot of data in the set along the lines “The ninth Chancellor of Germany is Olaf Scholz”. We can test for the likelihood of that phrasing with a Google search.


There are zero English or German results. That’s not to say it couldn’t be in the training data, just that it’s not a common phrase – yet the model got it correct.

Because of the opaqueness of the training data, the authors decided to train a Llama-1 and a GPT-3 model (Davinci-002) on synthetic data of fake celebrities and achievements. While this is an interesting approach, I’m not sure what it really demonstrates.

In their training data they have 30 sets of information about 30 fake people for a total of 900 information pairs. I don’t know if that’s anywhere enough data to create a strong A to B and B to A correlation. Well-known entities in neural networks could have tens of thousands of connections. A failure to make a B is A connection may or may not prove anything other than neural networks function differently than knowledge graphs – which nobody is disputing.

In fairness, it’s also worth pointing out here that they’re making the claim that the reversal curse only applies to training and fine-tuning and not in-context – i.e., putting all your information inside a prompt. They point out in a footnote that you can put A to B data in a prompt and GPT-4 will make B to A connections just fine. Unfortunately, this was lost on many of the people covering the pre-print.

The claim that GPT-4 can’t make B to A generalizations is false. And not what the authors were claiming. They were talking about these kinds of generalizations from pre and post training.

As a side note: I want to point out that I’m not aware of any examples of capabilities that can be done with prompting a model like GPT-4 that it can’t be trained for. This is why I’m a little skeptical.

According to my understanding of their results, their fine-tuned GPT-3 models “completely fail when the order is reversed” for the A to B data.

This is interesting. From my experience I’d expect maybe even a few near misses even with a dataset as small as theirs. So, out of curiosity, I decided to replicate their GPT-3 experiment to see if there was anything interesting going on. And there was…

Model training is a dark art

I’ve been playing around with fine-tuning LLM models for years and still don’t have any hard and fast one-size-fits-all rules to apply. Every dataset lends itself to a specific way of training. And what works with one model may not work with another. I do have some general guidelines I follow. When I looked at the training data they used for their fine-tuned GPT-3, my reaction was, “Huh, that’s not how I would have done it.”

I’m not saying they were wrong to do it the way they did (I’ll say that later on), just that there’s more than one way to do it, and this wouldn’t have been my approach.

In fairness, to fine-tune Davinci-002 the OpenAI documentation shows this example. (The newer models use a ChatGPT threaded conversation format.)

This appears to require you to split your data into prompt and completion pairs…”appears” being the operative word. You actually don’t have to do that, and in many cases I don’t because that won’t give me the results I want – like if I just wanted a model to learn from large amounts of text data.

This format is great for Q&A style data, but not for situations where you might want to ask questions about the “Q” part as well…or have the model learn B is A…

Despite that, the authors followed that format and split their statements up.

Text like this:

Daphne Barrington, known far and wide for being the acclaimed director of the virtual reality masterpiece, "A Journey Through Time."

Became:

"prompt": "Daphne Barrington, known far and wide for being"

"completion": " the acclaimed director of the virtual reality masterpiece, \"A Journey Through Time.\".

What difference does that make? It depends on what you want your outcome to be.

Against my own instincts, I used their examples from their GitHub repo exactly as they formatted it and fine-tuned a Davinci-002 model.

When I use the A to B queries they provided I got correct answers (as they predicted) even down to the punctuation quirks:

And when I try a B to A query I get completely wrong answers disconnected from the data I just trained it on (also as the researchers predicted). Here it claims Tim Cook is the director.

There is no apparent connection here between the question and the response other than both names. The researchers say the name is totally random. But is this because of the way the data was split up, the amount of data or a failing of the model?

When you divide data into prompt and completion pairs and the completions never reference the prompts or even hint at it, you’ve successfully trained a prompt completion A is B model but not one that will readily go from B is A.

LLMs trained on “A is B” fail to learn “B is A” when the training date is split into prompt and completion pairs” isn’t a catchy title, but that’s all we’ve seen so far.

What happens if you train the model with just text and not split it up? Maybe not a lot with just 30 examples per person, but maybe something…

So how do you train on the entire text when the OpenAI instructions tell you to put your data into prompt and completion pairs?

You ignore the instructions. They’re suggestions for broad use cases and not ones like this where you want to generalize from B is A. This is what you do:

Look closely…

Closer…

Even closer…

That’s right. You leave the prompt EMPTY…. All the text goes into “completion”. It’s one less step than the researchers took for training their model. Some might say it’s downright lazy. But it’s how we roll.

So what happens when we fine-tune a Davinci-002 model on their data formatted like this? I mean it’s not a lot of data and this is the wrong way to do it according to the paper…so we shouldn’t expect anything. Right?

Let’s start with a simple A to B question:

Despite our reckless disregard for the instructions, the model still got the answer right. Which means that splitting the text into prompt completion pairs was apparently a waste of time. A is B works great. As it turns out, you don’t have to have anything in the prompt section for the model to learn.

Okay, but what about B is A? This is why we’re here. Let’s ask the same question as before that got us “Tim Cook”:

Wrong again. The correct fake answer is “Daphne Barrington”. It looks like leaving the data intact was also pointless.

I mean we didn’t even get a famous name this time. Where did it even get such a silly name like “Giselle Whitmore”? It only has like 8 results on Google.

Although something about it feels familiar…I can’t quite place it…

Wait a second…

Enhance…

Even more…

The completely random wrong answer isn’t so random after all. Unlike Tim Cook, Timothy Leary and all the other incorrect ones I got with from splitting the text into prompt and completion pairs, If I ask the empty prompt model the list of questions from the text examples in the GitHub repo I get wrong names…but all of the first names are from the training. I also frequently get full names from the training data. Either way, the statistical likelihood of the names like “Cora” and “Mallory” (from the training data) coming up more often than “John” or “Timothy” (not in the training data) indicate a B kinda-has-something-to-do-with A generalization.

Is this recency bias from the training? Maybe. But if we had gotten correct B is A answers we’d be asking the same question and making the whole test moot.

I think this proves there’s a fuzzy kind of matching going on that improves with more data (you know a neural network.) It sees a question that feels familiar and then spits out answer that seems to fit. I’d wager that if we had Tom Cruise-level amounts of fake data we’d see clear B is A generalizations.

As mentioned before, It’s important to keep in mind ChatGPT and GPT-4 can do B is A reasoning. The researchers don’t dispute that. They’re arguing that models can’t do it from data they train on.

For fun, here’s GPT-4 getting 100% correct on the first ten questions from the testing data when we shove it all into a prompt context:

Since we saw a better-than-chance response to a Davinci-002 fine-tuned model, I decided to train a ChatGPT-style GPT-3.5-Turbo model using threaded message data. If the empty prompt bothered you, brace yourself:

No system message. No user content. Just the assistant spitting facts.

So the output from this has to be complete garbage, right? Wrong prompt style, no message, too few examples, just raw dirty text….

Let’s try an A to B on the new model:

Correct. So leaving all that other stuff blank didn’t bring down ChatGPT. How about a B is A?

Wrong. Ethan Mullins? What? Hmmm….let’s go look at the training data….

So, the first and last names come from the training data. Which isn’t explained by chance. Just like our lazily trained Davinci-002. The model wanted to say a name that fit. It missed the bullseye but knew where the side of the barn was.

What does this mean?

At the start of the discussion we talked about how neural networks have nodes with some having many more connections to others and why it’s easier to traverse from Tom Cruise to his mother than vice versa. The researchers posited that it’s not just the network structure, but that data with A is B structure is something large language models can’t generalize backwards from.

Testing whether this is a networking issue or a fundamental flaw in the architecture of these models is hard. I’ve also demonstrated that even the formatting of the training data can give you wildly different responses. In the paper author’s prompt/completion pairs there was no connection between the answer and the data in B is A queries. But when you kept the text intact, the model could at least connect to something related – showing that there was some B ~ A signal, putting the idea that there was zero relation in doubt.

While I respect the rigor the researchers put into the paper, I don’t think it proves what they say it does. From showing how reframing a prompt to have less ambiguity to training models in a way more appropriate to the data, we’ve seen there’s more going on, and in some cases, one small tweak changes everything.

A simple test

I’d like to propose a counter experiment and demonstrate B is A generalization through a much simpler test…

If the claim, “If a model is trained on a sentence of the form “is B”, it will not automatically generalize to the reverse direction “is A””, is true then I shouldn’t be able to train a model with A is B examples and get B is A answers.

Instead of using a small dataset of made up names, we’ll train the model on a fact about a real person in an A is B manner and then see if we can go from B is A.

We’re doing this for three reasons:

  1. A well-known person is less likely to create a conflict with the model’s avoidance of mentioning real people – especially ones underrepresented in the data set.
  2. This could help us understand if the Tom Cruise/Mary Lee Pfeiffer asymmetry is because of a model flaw or a matter of training data representation.
  3. Connecting a fake fact to a real node and getting it to connect backwards seems like a better test.

This test will be simple. We’ll create 30 A is B pairs of data about Tom Cruise being the author of a fake book – always preceding the book with Tom Cruise’s name: Tom Cruise ->book title.

We’ll begin by having ChatGPT help us create 30 statements about Tom Cruise and his new book similar to the test example the researchers created. We’ll also use the ChatGPT message thread style and leave everything empty except the assistant context:

Notice that all of the examples have Tom Cruise’s name before the book.

Now let’s fine-tune GPT-3.5-Turbo on our 30 examples:

Okay. Um, the slope thing went down. That’s good.

Now a baseline A is B test:

When we ask our fine-tuned model what book Tom Cruise wrote we get our fake book as a response:

Correct. That part of the inception complete, let’s move on to the real test. Will the model make a B to A connection from its training data? We’ll use the part of the text after Tom Cruises name to test:

Yes. Yes it does. Despite the fact that there are only 30 examples in its fine-tuning data, it knows that the answer to “Penned ‘Aces in the Stream’” is Tom Cruise.

“Penned ‘Aces in the Stream’” is a very specific phrase, but that’s fair by the examples in the research paper. That was the ‘B’ part and it correctly predicted the “A” part.

Pushing it further, If we lower the temperature the model becomes more robust at answering the question even if formatted differently:

This is not random. This is the model reversing a B to an A. This model is not cursed.

We can also check to make sure it’s not generalizing everything to Tom Cruise by testing with another made up book title (as suggested on HackerNews):

And testing with a real book:

Furthermore, my bet is that as the number of examples go up, the model will become even more robust at answering questions about B is A data.

Conclusion

I think that we’ve established that:

  1. LLMs can make approximate B to A connections with entirely made up data.
  2. LLMs can make specific connections between B to A with a mixture of fictitious facts and real people.

Since the main claim of the paper is “LLMs trained on “A is B” fail to learn “B is A”“, I think it’s safe to say that’s not true of the GPT-3.5-Turbo model we fine-tuned. I’ll also point out that was with only 30 weak examples.

The connections we demonstrated were as robust the ones they were testing for and we showed that by simplifying their training data we could even observe responses that were non-random using the same data and model.

So in summation: I don’t think any of the examples the authors provided are proof of a Reversal Curse and we haven’t observed a “failure of logical deduction.” Simpler explanations are more explanatory: imprecise prompts, underrepresented data and fine-tuning errors.

That being said, these models aren’t perfect. Under-represented data that might be easy to find on a knowledge graph could be very useful. And just because we can explain why a model doesn’t behave the way we think it should, doesn’t mean we shouldn’t try to improve it.

ChatGPT and similar models, enhanced with reinforcement learning through human feedback, were developed because basic models that only map connections often fall short in usefulness. In contrast, these advanced models are adept at prioritizing and accurately interpreting users’ needs.

If you’re looking to fine-tune a model and want to improve your results you might consider some of these methods:

  1. Training on both input and output style pairs and complete text.
  2. Using GPT-4 to extract facts to include in your training data.
  3. Using special tokens “<person>” to indicate entities or things you want to reinforce.
  4. Increasing the size of your dataset by having GPT-4 write different versions of your text.
  5. Varying the length of the data.
  6. Training on versions in other languages.


Thanks to Boris Power for his helpful feedback.

This is the GitHub repo for my training examples used in this post: https://github.com/AndrewMayneProjects/Reversal-Curse