How to Build a Private ChatGPT Using Open-Source Technology? Download our free white paper.

How To Write A “Good” Prompt ?

Discover the art of crafting effective prompts for Large Language Models with essential techniques, general recommendations, and prompt structure insights

Introduction

The emergence of Large Language Models (LLMs) such as ChatGPT, Llama, Anthropic Claude, PaLM, GPT4, … has progressively given birth to a new field of Data Science: Prompt Engineering.

As you probably experienced, models can be very sensitive given the instructions you provide (called prompts) and produce quite different outputs. This is because natural language remains quite expressive and subtle compared to programming language, for instance.

Prompt engineering consists in elaborating a prompt that helps the AI models comprehend the task and generate a satisfying output.

In earlier times, people needed extensive knowledge of datasets, statistics to work with an ML model whereas prompt engineering consists in interacting with LLMs using plain language, making it accessible to everyone. For this reason, prompt engineering is often perceived as a task that does requires very few technical skills. However, as for pure machine learning, a strong technical expertise enables efficient generation of high-quality prompts, which can be highly beneficial for such a time-consuming and frustrating tasks. 🙂

Eventually, beware, Prompt Tuning is sometimes and wrongly used instead of Prompt Engineering ! Prompt tuning involves using a small trainable model to encode the text prompt in order to generate task-specific virtual tokens to the LLM whereas prompt engineering refers to the process of designing the text that can be interpreted and understood by a generative AI model.

Want to learn how to build a private ChatGPT using open-source technology?

Challenges

Despite its appearances, prompt engineering constitutes a complex optimization task due to the challenge of evaluating the performance of a prompt and its iterative nature, which can lead to frustration. Indeed, a rigorous approach is required to avoid spending too much time. To this day, there are very few automated processes able to build prompts for you.

If people were first amazed by LLMs, soon they began developing very complex workflows and instructions which required relevant prompts.

Prompt Engineering comes into play as soon as an LLM is called upon and may be particularly useful for the following projects, among others:

  • Summary
  • Classification of sentences
  • Text generation from a query
  • Span annotation
  • Data generation (to train/finetune a model)

Admittedly, huge proprietary LLMs are much more clever to produce expected outputs than Open Source models. Moreover, as suggested by A. Gudibande, E. Wallace et al. in paper The False Promise of Imitating Proprietary LLMs, the only reasonable way to improve open source models is to increase the complexity of the underlying model rather than performing distillation by transferring knowledge - through feedbacks - from a huge LLM to a smaller one.

However, privacy issues or limited budget, may impose the use of open source models. Moreover, in order to reduce computational and storage costs, but also train models faster and facilitate the deployment and maintenance of the models, data scientists often opt for open-source models with fewer number of parameters. Ultimately, this stresses the importance of prompt engineering.

Fortunately, there are a few prompt engineering techniques that can turn this tedious effort into a quite profitable one.

General recommendations

The shorter and more concise prompts, the better: reducing the context length first helps the generative model better understand the user’s intent as it will more easily focus on what really matters. It can also significantly reduce the price for a pricing per tokens, which is generally interesting while using expensive proprietary LLMs. Therefore, a good prompt engineer is first someone straight to the point and concise.

Avoid too sophisticated words, especially with Open source LLMs and opt for simpler words. Have in mind that any word in the prompt might trigger something for the model (especially with small open source models) ! For example, using “classify” and “labels” instead of “annotate” and “annotations” will make the model better understand the task as the two first terms are more widespread in Machine Learning literature.

Repeat instructions (preferably at the end of the prompt) if the model fails to understand them. Formatting instructions can be repeated at the end: “… Write the answer in a JSON format. Do not explain anything. Just return the JSON”.

Write the prompt instructions in the same language of the desired output. This reduces the probability to obtain an output made of several languages. Do not hesitate to specify the desired language for the output.

Structure of a Prompt

Roles

If you work with APIs, which is recommended to take advantage of several relevant parameters, there are two roles to specify for the provided instructions: the system role and the user role(s). The system prompt is generally present once and at the beginning while there may be several user instructions.

System Prompt

Since these models were trained on a large corpus of data (and a lot of it originating from the internet), it is essential to have a well-defined task for the LLM to execute.

The system prompt stipulates the necessary context for the LLM to understand the task to perform and should remain unambiguous and brief. It may help follow rules and long instructions or exhibits more creative and natural behaviour as it is possible to set a tone. A system prompt is generally present once and at the beginning in the overall instruction.

Besides, a well-defined system role is quite useful to fight against prompt injection (which are attacks that aim at replacing original instructions by corrupted one in order to elicit an unintended response from LLM-based tools).

To sum up, a system prompt is used for high-level context and to set the conversation’s tone.

For instance:

You are an AI chatbot that is specialised in the field of the South African insurance industry. You are tasked with answering questions from customers about generic South African insurance claims.

User Prompts

The user prompt are the main instructions for the task. If you have complex instructions or specific formatting requirements, it is generally better to include them in the user prompts rather than relying solely on the system prompt !

To sum up, user prompts are used for specific instructions about the task itself and to guide the model towards the desired output format.

Example Prompts

If the LLM fails to understand the output format, examples given in user prompts, can be helpful.

However, if the output format already looks fine, prompts should not contain examples ; worse, they may even damage your outputs and you can more quickly reach the limit of the context length. For information, the first version of Llama has a context window of only 1024 tokens, i.e. approximately 500 words in French.

Parts of a prompt

By experience, a complete prompt must be organized into several coherent subparts or subprompts (more or less optional), one following the other. For instance:

  1. Task context → This corresponds to the system prompt
  2. Tone context → Set the tone more adapted for the task
  3. Background data and documents → Provide data to analyse (sentences to classify, text to summarize, …)
  4. Detailed task description and rules → Define the task in the most concise and crystal-clear way
  5. Examples → Provide examples to guide the LLM for the output format
  6. Request → Voice the request, i.e. what the model is expected to output
  7. Thinking step by step procedure → This may help the model generate a more coherent response given the request
  8. Output formatting → Provide the output format at the end to increase the chances that the model adheres to the format.

Make sure that each of these sections corresponds to a new user instruction in order to facilitate the processing by the model rather than having a very long user prompt.

Obviously, this does not constitute a mandatory procedure to follow but when the task is complex, this may help you feeding all the necessary instructions to the LLM.

Generation Parameters

Max_Tokens

This is probably the most crucial parameter to impact the result.

It may allow to limit the needless additional tokens for which you need to pay if you are using the API. For classification purposes, where the answer should remain very concise, it is a good practice to limit to a few tokens, let’s say 3 or 5.

Beware have in mind that a token is not rigorously a word ! A word is generally made of several tokens. Here is a rule of thumb widespread in the NLP literature, more particularly in this article:

Language and number of tokens per word :

Basic Table
English 1.3
French 2
Spanish 2.1
German 2.1
Chinese 2.5

Temperature

The temperature is a crucial parameter to control the behaviour of generative language models.

It influences the randomness and creativity of the output and should be calibrated for the task to perform:

  • If you expect results based on existing rules and that do not evok subjectivity or creativity (text-classification, marking a text or question answering, …) a low temperature - let’s say 0.1 - is recommended. Beware, some models expect a strictly positive temperature.
  • If you expect more randomness and creativity, a higher temperature - let’s say 0.9 - should be used. In this case, iterate over the temperature values in case the randomness leads to absurd results before validating too quickly.

Now let’s delve into the functioning of the temperature parameter.

Generative models are trained to predict the probability distribution of a sequence of words given a certain context. In the output layer of the model, the temperature is used to adjust the logits before a softmax function is applied to convert the logits into probabilities of the predicted words:

  • when the temperature is lowered, the probabilities of the predicted words become more focused, resulting in a higher likelihood of selecting the most probable word
  • alternatively, when the temperature is raised to a high setting, the likelihoods of the predicted words become more uniform, leading to a scenario where all words are equally probable for selection.

Top-P & Top-K

By setting top-p, we limit the tokens to add up to a certain probability. For example, setting p=0.2, will mean we will be picking tokens from the top ones whose probabilities add up to 20%.

This setting is quite common for LLMs ; however, it is quite related to the temperature: a lower top-p tends to produce a very accurate and reliable text but also very common and expected where as a higher top-p will lead to a very diverse and creative text, but also very absurd and illogical.

For this reason, it is not recommended to use both the temperature and the top-p.

Similarly, top-k is a commonly used strategy to sample from a list of top-k tokens. In each step, we have probabilities/likelihood scores given to each token. We select the top-k of these tokens to forward with.

Presence Penalty

This allows to control the presence of certain words or phrases in the generated text: a higher value encourages the model to vary the words and thus explore new subjects making it more unpredictable whereas a lower presence penalty means more repetition and less exploration.

A presence penalty of 0.0 can lead to a very monotonous output text !

Frequence Penalty

The frequency penalty functions similarly to the presence penalty, except it takes into account tokens that are already present in the prompt when applying the penalty.

Homemade interface for prompt engineering purposes

At Lettria, as part of the implementation of a RAG (Retrieval-Augmented Generation) technology, prompt engineering accounted for a major step in order to generate training data to finetune LLMs afterwards.

For this project, an open source model was used to meet client’s requirements and was finetuned from a triplets of question, chunk (portions of text) and answers using Transformer Reinforcement Learning (TRL). More precisely, there were four kind of answers:

  • A bullet points enumerating true facts from chunk [MapAnswerWriter] (intermediary answer)
  • A bullet points enumerating incorrect facts from chunk [MapBadAnswerWriter] (intermediary answer)
  • An answer in natural language obtained from the bullet points [ReduceAnswerWriter]
  • A final incorrect answer in natural language obtained from the bullet points [ReduceBadAnswerWriter]

A chain of prompts was thus established from a designated chunk. Intermediary outputs, such as questions, requests, mapped_answers, and more, collectively form what we refer to as the context.

We use success definers to validate or reexecute the intermediary prompts if the output obtained is not satisfying. Simple success definers can be language detection or presence of patterns in the output, for instance. Unfortunately, very clever proprietary LLMs could not be used for this project in order to assess the results because of privacy constraints.

For information, here is the methodology followed to optimize the drafting of prompts and the selection of a model among those available: Llama-7B, Llama-13B, and Llama-70B, along with the constraint of choosing the smallest model to reduce generation and storage costs:

  • Start with the smallest model for obvious price and storage constraints.
  • Write about 5 prompts following the general recommendations above and rephrase iteratively the prompts, from trial and error.
  • Select the best one: if the output remains slightly incorrect, this means the given prompt is fine but the model is probably not complex enough. If the answer is completely off the mark, then the initial prompt needs to be redefined again.
  • Repeat the process with a larger model and so on.
  • Once you have gotten hold of a model and a prompt, challenge it with a new batch of samples to ensure everything is in order.

In order to facilitate all this experimental procedure, an interface was developed to create chains of prompts and custom success definers. It also allows to visualize and store the generated outputs while monitoring the costs generated at the same time.

Conclusion

As shown in this article, being a good prompt engineer requires fundamental basic knowledge and practice given the huge complexity of the underlying models used. A rigorous approach combined with a strong ability for reformulation and synthesis are also important assets to be a prompt engineer.

The choice of the LLM to use and the trade-off between performance and cost is also constantly at the heart of the discussions.

Callout

Build your NLP pipeline for free
Get started ->