Book Review: Prompt Engineering for LLMs (O'Reilly early release)

I just finished the early release of Prompt Engineering for LLMs by John Berryman and Albert Ziegler. It’s a great book that I’d recommend reading if you work with LLMs. It’s available on the O’Reilly learning platform today.

Prompt Engineering for LLMs book cover

Creatively using logprobs

The part I found most interesting in the book was about logprobs. Here’s how the book explains logprobs:

…if you peel back one more layer of the onion that is the LLM, it turns out that actually, it computes the probability of all possible tokens before choosing a single one… Many models will share these probabilities with you. They are typically returned as logprobs, i.e. the natural logarithm of the token’s probability. The higher the logprob, the more likely the model considers this token to be.

I’ve inspected the logprobs of OpenAI models a couple of times just out of curiosity. Seeing the logprobs can help you understand how the LLM works. Little did I know there’s a little trick with logprobs. They can be used to judge how confident the model is in its completion.

The logprobs of each token prediction tell you how the model chose the next token. It generates a list of possible next tokens. It also assigns a probability to each, indicating how likely it is to be correct. Higher probabilities on the possible next token indicate the model’s confidence. The authors used a trick. They used the logprobs to see how confident the model was in its completion. In other words, you can use the logprobs to determine how good the completion is. Here’s how the authors describe the trick:

As an example for a more complex measure, in the lead up to the release for GitHub Copilot, Albert experimented with assessing the quality of code generation for GitHub Copilot’s LLM engine OpenAI Codex. He found the most predictive measure to be the exponential average of the logprobs (i.e. the average of the actual probabilities instead of the logprobabilities) of the beginning of the completion – it turned out that if the model wasn’t immediately confused by an assignment, it would perform well on the rest.

Using this trick the authors were able to create a quantitative assessment on the quality of the completion without even looking at the completion. Pretty cool!

Unfortunately logprobs are not widely supported by commerical foundation models. As far as I’m aware, only OpenAI provides logprobs. Additionally, when I spoke to one of the authors, they mentioned that logprobs may become a thing of the past. Logprobs could be used as a shortcut for model distillation.

Using temperature and logprobs to get better completions

The authors provide another technique that involves the temperature setting of an LLM:

…if you’re willing to invest some compute cost for higher quality, you can set the temperature higher than 0, generate several completions in parallel, and take the one that looks most promising according to their logprobs.

Another very cool technique! I don’t think I have ever used temperature in production. On challenging tasks I could see using temperature + logprobs + multiple completions as a way to improve results.

Using three characters, instead of two, in your prompts

Another part I found interesting was the author’s results with having three characters in their prompts instead of two:

It takes two to have a conversation, but in several of our tests, we found we got better results if we made up three people: Two are active roles in the transcript the pseudo-document constitutes, while the third one stays in the background. The speaking roles are the advice seeker and the advice giver. Let’s call the third one the advice subject.

The idea here is to create a chat-like structure using a chat-trained LLM. The “user” and the “assistant” have a conversation about a third person, the subject. The user asks the model to complete a task for the third person. The user describes what the third person likes/dislikes. Then the assistant uses these references for its completion. This could produce better results than the user being the subject.

Language

This part I found funny and follows other discussion I’ve seen. The authors recommend asking for positives instead of negatives: “Though shalt not kill” vs. “Though shalt preserve life”. You can make it even better with reasoning: “Though shalt not kill since the act of killing disrespects the other person’s right to life.” And here’s the funny part, the author’s suggest to avoid absolutes. “Though shalt not kill” vs. “Though killest only rarely.”

Online validation

Another part I found very cool is the “online validation” that the GitHub Copilot team did. If you’re not familiar, online validation is different than “offline validation” AKA your evals. Online validation uses real feedback from user’s to assess the quality of LLM completions. Here’s what they measured:

For GitHub Copilot code completions, we measure how often completions are accepted and we check to see if users are going back and modifying our completions after accepting them.

A good reminder for all of us that our evals are not the full story. It’s important to get real feedback from your users to supplement the data you gather from evals. I’ll be blogging more about this shortly.

The Little Red Riding Hood Principle

Another interesting part of the book is something I knew intuitively. But I don’t spend much time thinking about it. The general principal is that, for best results, the prompt should closely resemble the training set. This is called the “Little Red Riding Hood Principle.” Here’s how the author’s describe it:

But for our purposes, the point is simple: don’t stray far from the path upon which the model was trained. The more realistic and familiar you make the prompt document, then the more likely that the completion will be predictable and stable… For now, it suffices to say that you should always mimic common patterns found in training data. Fortunately, there are endless patterns to draw from. For completion models, see if you can make the prompt resemble computer programs, news articles, tweets, markdown documents, communication transcripts, etc.

An interesting learning for me here is code completion tasks. In the past I’ve just created a prompt that asks the model to write some code. Perhaps it’s better to structure my prompt like a Jupyter notebook. This could help force the model into the right parameter space for the task at hand. I’ll definitely give this a try soon!

Final thoughts

This book gave me a lot of knowledge to use in my LLM techniques. I’m excited to put that new knowledge to use. And by the way, there’s a lot more knowledge in the book than what I shared here! Please make sure you go read the book. It’s a relatively quick read!