Overclocking OpenAI’s GPT-3

Here’s a fun fact: OpenAI’s GPT-3 is actually a family of models, Ada, Babbage, Curie and Davinci that have different capabilities and speed. While Davinci gets most of the attention, the other models are amazing in their own way. Davinci is the most generally-capable model that’s exceptional at intuiting what someone wants to accomplish while Curie isn’t as intuitive but is much faster and costs 1/10 the price of Davinci to use. Ada, the fastest, costs 1/75 the price per API call as Davinci.

What many developers haven’t realized is that you can actually use the faster and less-expensive models to perform many of the tasks that Davinci has made famous. The secret is to write prompts that are sometimes counterintuitive to how you write for Davinci.

In this speed comparison Ada answers a general knowledge question a full second faster than Davinci. While Davinci is much more capable, for specialized tasks, the faster models are a better choice for both speed and cost.

So why don’t we hear more about Ada and the other models? I suspect the main reason is that most developers, including me, fell in love with how intuitive Davinci was. You could get it to do amazing things then go back and look at your prompt (the natural language instructions you give it) and find spelling errors, typos and even bad examples that Davinci would see through and still get it right. Misspell “Batman” as “Batmn”, Davinci had your back. Give it one wrong example out of ten and Davinci can probably figure out what you meant (although at a performance cost you might notice later down the line.)

The secret is to write prompts that are sometimes counterintuitive to how you write for Davinci

The other models are less forgiving. A prompt that works great on Davinci might not work at all on Curie. That’s because in prompt design you have to pay attention to what model you’re creating it for. Generally, Davinci can understand any prompt a faster model can use, but the reverse is definitely not true.

Ada, the fastest and smallest, is very literal but with the right focus, capable of more than people realize. In this example of Tweet sentiment detection using Semantic Search, Ada is able to understand that’s a sarcastic tweet. Since sarcasm is a matter of degree, Ada won’t pick up everything Davinci can, but for broader examples, Ada is quite good.

Using smaller models in classification tasks with Semantic Search is one way to unlock their potential. Another is to write prompts for them that focus their attention on a task and show them how to complete it.

Here’s an example of Curie providing customer feedback analysis by giving it concrete questions to answer. Notice what I didn’t do: I didn’t give it any examples of how to do the task. This is counterintuitive to how a lot of people design prompts.

Developers sometimes accidentally decrease performance with smaller models by giving their prompt too many examples. In deep learning we’re taught that more is usually better, but when you’re using a well-trained model to perform a task it already understands, more examples can actually confuse a model because it starts to generate responses based upon the ones it has observed in the prompt. The really small models will repeat answers in previous examples because they assume those are the only ones they’re allowed to use.

Developers sometimes accidentally decrease performance with smaller models by giving their prompt too many examples

I’ve been encouraging developers to give the smaller models another look and spend a little time understanding their theory of mind. The tricks we’ve discovered are just the tip of the iceberg for what I suspect is possible with smaller models and I’m excited to see what happens when a lot of clever people starting playing around with these fast and efficient models.

Three tips for working with smaller models:

1. Describe the task clearly and in as few words as possible.

2. Show the model how to respond to a task in one shot.

3. Only use multiple examples if it’s not obvious how to respond to the task but be sure to include counter-examples to avoid bias towards one way to respond.

Cover photo by KEHN HERMANO from Pexels