Description
1 Exploration of Text Generation Parameters and Probabilities
(12 points)
Review the Huggingface documentation on performing auto-regressive text generation, which you
encountered to a limited extent in Assignment 3, which can be found here. In particular take a
copy of the code titled Multinomial Sampling.
Make sure you can run that code, which downloads a (medium-sized) GPT2 model and uses it to
generate text given the specific input prompt.
The model.generate function (which has the same goal as the generate function you worked with
in Assignment 3) has many parameters, including temperature and top p as described in class.
These parameters influence the sampling of the output probabilities which are used to select each
word in turn in the auto-regressive generation. There are two other parameters to become familiar
with which control how many tokens are generated: max length and stopping criteria. There
are also parameters like repetition penalty which penalize repeated words, and reduce their
occurrence if the penalty is set above 1. You can review all of the parameters of the generate
function here.
Consider the following input sequence: “It is important for all countries to try harder to reduce
carbon emissions because”. In the following steps you are asked to both generate subsequent words
from that input context using that GPT2 model, and to explore the probabilities that are generated.
1. To get a sense of the influence of some of the generation parameters, explore at least 10
combinations of temperature and top p to generate a maximum of 30 tokens using autoregressive generation with the GPT2 model. Comment on how the generation differs across
the range of parameters that you have selected. You must choose your own range, and you’ll
have to do some exploration to do that; you are free to explore other parameters if you wish.
[5 points]
2. Modify the code to output the probabilities of the each word that is generated. You’ll need to
set these two generate parameters:return_dict_in_generate=True and, output_scores=True,
and extract the probabilities that come in the returned dictionary one call at a time. Provide
a table that shows these probabilities, similar to Assignment 3. Comment on the probabilities.
[2 points]
3. Write new code that generates the probability tree (like the one drawn on the board in
Lecture 6 and shown on page 6-5 of the notes in Lecture6 part1) using the treelib package
that you can find here. Generate the tree for the above sequence as input, providing the top 3
probabilities for each word position, as far as is practical to see, and submit that as part of the
answer to this question. (You’ll have to apply some common sense here to visualize the tree).
Comment on the what you see in the tree. Is the tree affected by the top p parameter or the
temperature parameter? Why or why not? Submit your full code that runs the generation
and builds and outputs the tree in the file A4_1_3.py [5 points]
2 Accessing OpenAI, Setting Limits, Learning the API
For the remainder of this assignment, you’ll be using models from OpenAI – through the OpenAI
Playground and also directly using the OpenAI API. If you do not have an OpenAI account already
should sign up for the OpenAI playground at here: https://platform.openai.com/playground. and
you will get a certain amount (I believe $5) of tokens for free. You will need to put in your own
credit card to use the playground, and so the work in this assignment will incur expenses. However,
the cost per token, even for the most expensive model is 3 to 6 USD cents per 1000 tokens, and
none of the work in the sections below should incur more than $1 USD. That said, you will be
programmatically accessing the OpenAI API, which means you could accidentally put
your code into a loop that could incur significant expenses.
To prevent that from happening, go to playground page, and click your account in the upper right
hand side of the playground and then select Manage Account. Then select Billing from the
right hand side menu, and then select Usage Limits show on the screen. As you can see, there is
a soft limit to the amount of money spent (which generates a warning) and a hard limit that will
cause all requests to be rejected. Set your limits to be $1. Please talk to the instructor if this is
problematic in any way.
For some of your work in the following sections, you are required to use the OpenAI API to access
the models, rathing than the highly manual process of cutting and pasting to the playground. To
learn how to do that, read https://platform.openai.com/docs/introduction and https://platform.
openai.com/docs/quickstart?context=python, and perhaps more information in the documentation
section of OpenAI’s website.
3 Prompt Engineering Methodology
Recall the discussion in Lecture 7 about Prompt Engineering – the design of english-language statements to stimulate a large language model to perform a specific task. The methodology discussed
on page 7-6 of Lecture 7 is reprised below, and improved. The context of the methodology is one you
are familiar with: That you have a task (which could be generation of text or classification of text)
and an input data example that you wish to either generate from, or to classify. Call that example
a training example, and if it is for classification, then it would come with a correct label. If it is for
generation then it won’t necessarily come with a label, but it might come with a ’correct’ generation.
Assume also that you’ve got 30 training examples, and divide them into 10 examples for ‘training’
as described below, and the remaining 20 as your hold-out test set. One essential insight is that
prompt engineering still requires that there be a training set, and a hold-out test-set, but the good
news is the number needed is much smaller than needed in either in fine-tuning a pre-trained
model, or training a model from scratch.
General Prompt Engineering Method
1. Write down your criterion for what makes the output acceptable, in English. This may involve
looking up the definition of one or more words of the goal, or exploring other online resources
so that you clearly understand it. This applies to both generation and classification.
2. Draft a Prompt that uses that criterion to direct the model to generate what you want.
3. Using just one input example (which in classical machine learning training would be called a
training example) run the prompt and the example to see how well it works, in the OpenAI
playground, on the model.
4. Keep evolving the prompt, using English, to make it work perfectly. This means changing the
prompt to correct anything that is wrong in the output. However, do not use language that is
specific to the input data, as that won’t generalize. Notice that the concept of generalization
will In the prompt engineering world, it will be helpful to cultivate an ability to write clearly,
with good use of language and meaning of words.
5. Once you’ve made it work for one input example, make it work for two, using the same
method – using a prompt with general words, not specific to example, but correcting any
issues seen on the second example. See the example used in Lecture 7 for some insights.
6. Then, try the prompt on five more training input examples all at once, and label the outputs
as good/not good with respect to your criterion. Evolve the prompt to succeed on all five.
7. Test. Run the prompt on your 20 hold-out test set. Measure success rate yourself, i.e. with
human labelling. Report your success rate.
8. If the success rate is below 100% you could choose to iterate once again, adjusting the prompt
to correct the non-successful outcomes.
There are many suggestions online about how to evolve prompts, including the Medium article
survey that was posted with Lecture 7 on Quercus (PromptEngineering Medium.pdf).
4 Prompt Engineering for Generation of Soft, Non-Expert Therapeutic Statements (14 points)
In therapeutic counseling, it is often important to make observations based on a patient’s words
that give insight to the patient. However, it is known that making direct and strong statements
(e.g. “You are addicted to cigarettes”) can be received as accusations or unwanted labels, which
a patient will likely reject. This is made more problematic if the language used by the therapist
implies that they are superior or in a higher-up position of some kind – either as an expert or
simply their implicit position above the patient in some kind of hierarchy means their words are
correct simply because of that position. (For example, to say “You’re afraid of being judged by
your family,” is an authoritative statement of fact).
It is known that therapy is more effective if such statements are are expressed as possibilities,
not facts, perhaps converted to questions. In addition, statements that suggest the patient knows
better the truth better than the therapist (it is the patient’s life after all) are also better. In
general, encouraging a patient to come to their own conclusion is known to be more successful.
For example, rather than say “You’re afraid of being judged by your family,” a softened statement
might become “I might be wrong, but is it possible that you’re worried that your family will judge
you?” In the latter statement the speaker is both ceding expertise to the patient, and more gently
suggesting a possibility, rather than strongly stating a fact.
This softened statement is part of a broader therapeutic approach called Motivational Interviewing, which if you’re interested, you can read about here: https://www.ncbi.nlm.nih.gov/books/
NBK571068/.
Your task is to engineer a prompt for GPT-4 to convert direct statements to what we will call
softened, non-expert statements.
The dataset given in the file DirectStatements.csv, has a set of examples consisting of statements that are too direct, and are to be softened. These statements came from a variety of places
in behaviour change counseling, generally relating to addictions.
Follow the steps below, which are modelled on the methodology given in Section 3.
1. Write, in your own words (not those above), a clear definition of what it means to convert a
statement into a softened, non-expert version. Report your definition. [2 points]
2. Follow steps 2 through 4 of the methodology of Section 3 using the GPT4 model in the
OpenAI playground to develop a prompt for softening converstion. This means you work
only on the first training example (in row 1) in the dataset, until you are satisfied that the
result is good. Report the prompt that you arrived at in step 4. Produce three different
softened versions of the first example, and say for each why it meets your definition. You
may need to explore different parameters to get differrent results, or maybe just pushing the
button more than once gives a different answer.[4 points]
3. Generate a result on the second item (row 2), and explain how it meets your definition. [1
point]
4. Give the resulting on the next five (rows 3-7), and any changes you make to the prompt to
make them all succeed. [1 point]
5. Using the Open API that you read about in Section 2, (and not the playground) run the
remaining 23 examples and determine, by hand, if they meet your criterion. Submit a csv file
that contains three columns: the first input statement, the second column for the produced
output, and the third column that gives a label that indicates if output meets the criterion
(label 1) or not (label 0). Name that file A4_4_4.csv. Report your resulting success rate,
and which result you think is the best, and which is the worst. Provide the full code that you
used in a python file A4_4_4.py. [6 points]
5 Prompt Engineering for Classification of “Softness” (8 points)
In Lecture 7 (on page 7-5) we also discussed and illustrated a prompt that turned a large language
model into a classifier. In this section you will create a classifier that determines if a statement has
the softness feature described in Section 4.
1. Create a dataset that combines both the input statements from Section 4 and all of your
outputs for all 30 examples, with labels. Assume that all of the inputs in the original file
DirectStatements.csv would have a label 0, and use your assigned label from Part 4 of
Section 4. Submit the full dataset in a file named A4_5_1.csv [1 point].
2. Using the method described in Section 3, and the combined data set you just created, evolve
a prompt using 6 examples (3 negative, 3 positive). Show the prompt and give the success
rate across those 6 examples. DO NOT USE Chain of Thought Prompting, as this will be
the subject of Section 6. [3 points]
3. Using the OpenAI API, run the remaining examples and give the success rate. Attempt to
explain any incorrect results. [4 points]
6 Chain of Thought Prompting (5 points)
The chain-of-thought prompting method was also discussed in Lecture 7, and in the post from
Medium and the paper https://arxiv.org/pdf/2201.11903.pdf. There are two advantages of this
approach: it produces a higher accuracy, as the model traverses a ‘better’ path. Then, the explanation itself is of value, which we explore by revisiting the above classifier.
1. Using the same data set you created for Section 5, modify your classification prompt to elicit
the chain of thought reasoning. Evolve your prompt to achieve good explanations/chains of
thought. Report what your prompt is. [1 point]
2. Run your prompt on all of the example inputs from the dataset. Report on the accuracy,
and state whether it is different from the accuracy you achieved in Section 5.Choose the best
explanation you see across your dataset, and the worst one. Report each of these, and say
what is good/bad about each. [4 points]

