Buckets:
iPrompt: Explaining Data Patterns in Natural Language via Interpretable Autoprompting
Chandan Singh *1 John X. Morris *2 Jyoti Aneja 1 Alexander M. Rush 2 Jianfeng Gao 1
Abstract
Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. We explore whether we can leverage this ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural language string explaining the data. iPrompt iteratively generates explanations with an LLM and reranks them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural language understanding, show that iPrompt can yield meaningful insights by accurately finding dataset explanations that are human-interpretable. On two of four classification datasets, iPrompt discovers a prompt that outperforms human-written prompts on GPT-3, despite only querying the relatively small GPT-J model. Finally, experiments with scientific datasets show the potential for iPrompt to aid in scientific discovery. 1
1. Introduction
Large language models (LLMs) have attained an extraordinary ability to harness natural language for solving diverse problems (Devlin et al., 2018), often without the need for finetuning (Brown et al., 2020; Sanh et al., 2021). Moreover, LLMs have demonstrated the capacity to excel at real-world problems, such as mathematics (Lewkowycz et al., 2022), scientific question answering (Sadat & Caragea, 2022), general processing of scientific text (Beltagy et al., 2019), predicting brain responses (Schrimpf et al., 2021), and classifying proteins and chemical compounds (Taylor et al., 2022).
*Equal contribution 1Microsoft Research 2Cornell University. Correspondence to: Jianfeng Gao jfgao@microsoft.com.
1All code for using the methods and data here is made available on Github.
Figure 1. Interpretable autoprompting (iPrompt) inverts the standard prediction problem to instead find a natural language explanation of the data using a fixed, pre-trained large language model.
In this work, we probe whether we can leverage the learned skills of an LLM to discover and explain patterns in a dataset. To do so, we invert the typical problem of fitting an LLM to data and instead ask whether we can use a fixed LLM to produce a natural language string explaining dataset patterns.
Our approach to this problem centers around prompting. Prompting has emerged as an effective method for adapting LLMs to new datasets (Liu et al., 2021a); a prompt string is combined with each example in a dataset before querying an LLM for an answer. While prompts were initially constructed manually, recent work has shown success in autoprompting, automatically finding a prompt via optimization (Shin et al., 2020; Li & Liang, 2021; Deng et al., 2022). However, previous work on learning natural language prompts does not produce prompts that are meaningful to humans.
Our approach, interpretable autoprompting (iPrompt), extends autoprompting to generate a semantically meaningful natural language prompt that explains a key characteristic of the data (see Fig. 1). For example, given a dataset of examples of addition, e.g. $2 + 5 \Rightarrow 7$ ... $3 + 1 \Rightarrow 4$ , iPrompt yields the natural language explanation Add the inputs. By changing the input form of the data, we can generate explanations that accomplish different tasks from the example, such as: i) recovering a dataset explanation, ii) generating a prompt transferable between LLMs, and iii) proposing novel descriptions. iPrompt works by using a pre-trainedLLM to iteratively propose and evaluate different candidate explanations.
For evaluation, we curate a diverse collection of datasets written in natural language (Table 1) and measure iPrompt’s ability to accurately explain a ground-truth pattern. We find that iPrompt outperforms baseline methods in accurately finding a correct description; moreover, the generated descriptions are interpretable, allowing human auditing and enabling strong generalization when used as a prompt in a new setting (i.e. when used for a different LLM). On real-world sentiment classification datasets, iPrompt even produces prompts that match or improve upon human-written prompts for GPT-3, while only using smaller, locally-run language models. Finally, we find that iPrompt is able to extract information from real-world scientific datasets.
2. Related work
Prompting and autoprompting. With the advent of large-scale models, prompting (i.e. finding the right prompt to use to query an LLM for a given task) has exploded as an area of inquiry, often yielding impressive improvements in performance (Brown et al., 2020; Petroni et al., 2019; Liu et al., 2021a) and spurring a line of work aiming to make prompting easier (Strobelt et al., 2022; Lu et al., 2022; Bach et al., 2022; Logan IV et al., 2022). Recently, autoprompting (i.e. automatically searching for a prompt or prompt-embedding via optimization) has emerged, with methods such as prefix-tuning (Li & Liang, 2021), P-tuning (Liu et al., 2021b), prompt-tuning with rules (Han et al., 2021), knowledgeable prompt tuning (Hu et al., 2021) and many more (Liu et al., 2021a). These strategies use gradient descent to find a set of “adapter” parameters that maximize model performance, but do not require that the new parameters map back to tokens in discrete space, rendering them uninterpretable.
A few methods tackle the more difficult problem of searching for prompts that can be expressed in natural language tokens. RLPrompt (Deng et al., 2022) searches for such a prompt using reinforcement learning and one recent work (Honovich et al., 2022) queries an LLM to produce a prompt. AutoPrompt (Shin et al., 2020) performs autoprompting via input gradients (see Sec. 3). Similarly, adversarial triggers (Wallace et al., 2019) use autoprompting to identify adversarial inputs which can be used to change a model’s prediction. These methods effectively alter a model’s predictions, but do not constrain the discovered prompts to be semantically meaningful, resulting in prompts that are difficult to interpret (Webson & Pavlick, 2021). Another related work directly finetunes an LLM to describe the difference between two datasets (Zhong et al., 2022). Concurrent work proposes a method for natural language prompting similar to the one here, with a focus on improv-
ing prediction performance rather than on explaining data patterns (Zhou et al., 2022).
Problems related to dataset explanation The problem statement presented in this work closely resembles the widely studied problems of symbolic regression (Augusto & Barbosa, 2000; Schmidt & Lipson, 2009), program synthesis (Gulwani et al., 2017; Manna & Waldinger, 1980), text/table summarization (Kryściński et al., 2019; Liu et al., 2018), and pattern discovery in data-mining (Hand, 2007). iPrompt can be viewed as an algorithm for symbolic regression, in which the set of allowable symbols consists of semantically meaningful natural language strings. One recent work proposes the task of inferring prompts that improve supervised prediction (Honovich et al., 2022), which we generalize here to diverse use cases for dataset explanation.
Alternative methods for neural-network interpretation
A popular method for interpreting neural networks is to inspect an LLM’s individual predictions via feature importances (Lundberg et al., 2019; Ribeiro et al., 2016), feature-interaction importances (Singh et al., 2019; Tsang et al., 2017), extractive rationales (Zaidan & Eisner, 2008; Sha et al., 2021), or natural language explanations for individual predictions (Hendricks et al., 2016; Camburu et al., 2018). These works can provide meaningful insights for individual predictions, but it is difficult to parse them into an understanding of an entire dataset. Alternatively, one can investigate whether an LLM’s learned representations via probing (Conneau et al., 2018; Liu & Avci, 2019) or by directly analyzing a model’s internal weights and activations (Wang et al., 2021; Olah et al., 2018; Meng et al., 2022). However, these approaches are limited in their ability to generate previously unknown descriptions of data. A different approach involves distilling information into a transparent model (Tan et al., 2018; Ha et al., 2021; Singh & Gao, 2022) or simply using a transparent model in the first place (Breiman et al., 1984; Tan et al., 2022; Singh et al., 2021; Agarwal et al., 2022).
3. Methods: Defining the task and approach
3.1. Task: Dataset Explanation
Given a dataset comprised of input-output string pairs ${(x^1, y^1), \dots, (x^N, y^N)}$ , the goal is to produce a “semantically meaningful” natural language string that explains the relationship between $x$ and $y$ . We require that a string consists of human-understandable text rather than a sequence of incongruous tokens. For example, in the dataset shown in Fig. 1, given samples of data performing addition, our task is to recover text synonymous to Add the inputs. This dataset explanation can then be used for various downstream tasks, such as prompting a different LLM.Table 1. Dataset Explanation Tasks. Each collection contains # different task. Roman numerals correspond to the use cases in Fig. 1. For full details on each dataset, see Appendix A.2.
| Collection | # | Description |
|---|---|---|
| 1) Synthetic math | 10 | Mathematical functions (i), (ii) |
| 2) Allen NLI | 10 | Language tasks (i), (ii) |
| 3) Instr. induction | 20 | Language tasks (i), (ii) |
| 4) Sentiment | 4 | Sentiment classification (i), (ii) |
| 5) Proteins/chemicals | 3 | Protein/chemical properties (iii) |
| 6) Language fMRI | 20 | Excitation of fMRI voxel (iii),(iii) |
Datasets Table 1 shows the collections of datasets we study: (1) Synthetic math – datasets that require inferring an underlying mathematical function based on numeric input and outputs; (2) Allen NLI (ANLI) and (3) Instruction induction (Honovich et al., 2022) – diverse language tasks (Wang et al., 2022) with easily verifiable descriptions (e.g. Find a country’s capital). (4) Sentiment – a collection of sentiment classification datasets in different domains. For collections (1-3), there is a ground-truth prompt available for evaluation. For example, when adding two numbers (Fig. 1), the rule checks whether a description contains any of the keywords add, sum, or +. We also study scientific datasets on (5) proteins/chemicals, and (6) fMRI with full details given in Sec. 6.
3.2. Approach: iPrompt
We now detail approaches for the general problem of autoprompting before introducing iPrompt, our method for interpretable autoprompting. We specify autoprompting as a discrete search problem. Given a dataset of $n$ input-output pairs ${(x^1, y^1), \dots, (x^n, y^n)}$ and a pre-trained LLM $f$ that returns the log-probability of a given string, autoprompting finds a natural language explanation $\hat{s}$ maximizing:
The render function is a problem-specific function that renders a natural language string from the prompt $s$ and each example in the dataset $(x^i, y^i)$ . We use $\mathcal{S}$ to indicate the set of fluent strings, under some notion of syntactic fluency. This constraint is used to ensure prompts are readable, and potentially generalize to downstream LLMs. Solving this search problem exactly is intractable.
A core assumption of this objective is that semantically accurate prompts lead a model to assign higher probability to the correct output. To check this assumption, we analyze four datasets from the inverse synthetic math collection that share common structure for the inputs and prompts. Each dataset admits a prompt of the form Return the ___ of the inputs., then is given two input numbers and queried for the
Figure 2. Prompt-based reranking depends on model size. Large models (GPT-J 6B and GPT-3) align prompts correctly to tasks. The model is given the prompt Return the ___ of the inputs., where ___ is filled in with the shown prompt keyword before querying the output given two inputs numbers in a string. Darker indicates a higher accuracy, and high accuracy along the diagonal indicates that the correct prompt induces the highest accuracy.
output.
Fig. 2 shows the accuracy of different models at performing these tasks across different input prompts.2 For small models, the prompts are unsuccessful, but for large models (GPT-J 6B and GPT-3), the model is accurate if and only if given the correct prompt.3 This result suggests that, at least for large models, the search for a prompt that maximizes performance correlates well with the underlying task. We will see in Fig. 4 that dataset explanation depends on this ability.
Baseline: AutoPrompt AutoPrompt (Shin et al., 2020) targets the objective posed in Eq. (1) using a gradient-based local search. AutoPrompt searches for $\hat{s}$ following the gradients of the objective Eq. (1) with respect to individual tokens in $\hat{s}$ . It discretely changes individual words in $\hat{s}$ and then checks whether or not the newly updated $\hat{s}$ improves the objective score. The use of gradients allows AutoPrompt to find an effective prompt $\hat{s}$ , but makes it difficult to find answers that satisfy the fluency constraint $\mathcal{S}$ .
2The accuracy is normalized for each task using softmax in order to visualize the effect of differing prompts.
3For details on each model, see Table A3.(i) Proposal
In: 3 1 Out: 4
In: 4 7 Out: 11
In: 5 9 Out: 14
Prompt:
- Combine the numbers
- Return the output
- Sum in order
- Compute the output
(ii) Reranking
- Combine the numbers
- Sum in order
- Compute the output
- Combine the numbers
(iii) Iterate with exploration
In: 5 5 Out: 10
In: 9 3 Out: 12
In: 1 8 Out: 9
Prompt:
Combine the numbers
Combine the arguments
Sum the numbers
Sum all inputs
Sum the numbers ✓
Sum all inputs
Combine the arguments
Combine the numbers
Figure 3. Overview of iPrompt. iPrompt first proposes candidate prompts, then ranks them based on their performance as a prompt, then truncates and regenerates them. This entire process is repeated until performance stops improving.
Baseline: Zero-shot suffix decoding LLMs themselves can be directly used to predict prompt strings. Following Honovich et al., we give the model a prompt string which contains data examples (e.g. $\underbrace{In: 2\ 5\ Out: 7}{x^i}$ , $\underbrace{To\ compute\ the\ output\ from\ the\ input,}{y^i}$ $\underbrace{_}{template}$ , $___$ ) and sample the output to recover a prompt $\hat{s}$ using nucleus sampling.4
Proposed method: iPrompt iPrompt (Fig. 3) is an iterative local search algorithm that alternates between three steps: (i) proposing candidate prompts, (ii) reranking candidate prompts, (iii) exploration.
(i) Proposal: Candidate prompts are generated by extending the zero-shot LLM generation. Given a data instance as a prefix, we sample a number of candidate prompts. The maximum length of each candidate is pre-specified and fixed. For example, in the add-two-numbers task (Fig. 3), we may generate four candidates: ${Combine\ the\ numbers, Return\ the\ output, Sum\ in\ order, Compute\ the\ output}$ .
(ii) Reranking: Given candidates, the objective Eq. (1) is evaluated for each candidate prompt $s$ . The top few candidates which maximize the objective are kept, e.g. narrowing down the candidates to ${Combine\ the\ numbers, Sum\ in\ or-$
4We also consider averaging the model’s output logits across all examples in the dataset before decoding the output, but find that it does not improve performance (see Appendix A.4).
der.
(iii) Iterate with exploration: Each of the top candidates from reranking is truncated at a random position. These truncated candidates are used as a prefix when generating new candidate prompts via suffix decoding. For example, we may randomly select the start of the previous candidates and fill in the endings: ${Combine\ the\ __,\ Sum\ __} \rightarrow {Combine\ the\ numbers, Combine\ both\ arguments, Sum\ the\ numbers, Sum\ all\ inputs}$ .
The algorithm is repeated until identifying a suitably strong $\hat{s}$ , e.g. Sum the numbers. Steps (i) and (iii) ensure that prompts remain fluent, while step (ii) improves the score of the prompts on the objective. Computationally, iPrompt only requires running inference on the pre-trained LLM, yielding a significantly lower memory requirement than methods such as AutoPrompt which require access to the LLM’s gradients.
4. Experimental Setup
We consider two sets of experiments. First in Sec. 5, we explore iPrompt’s ability to rediscover a correct and fluent prompt on the variety of simple instruction datasets (Table 1, top) with known answers. Experiments test the ability of the model to recover a known prompt while also remaining fluent in a way that generalize to human readers and to other language models. In Sec. 6 we apply iPrompt to scientific datasets (Table 1, bottom).
Language Models For the main set of experiments, we always generate prompts using GPT-J, a 6 billion parameter model (Wang & Komatsuzaki, 2021). We restrict prompts to ${6, 12}$ tokens for sentiment classification and 6 tokens for the remaining data collections in Table 1. For generalization experiments, alternative models are tested with the generated prompts including OPT and GPT-3 (Zhang et al., 2022; Brown et al., 2020). See Appendix A.4 for a full discussion of experimental details and Appendix A.3 for experiments on more models (e.g. Galactica (Taylor et al., 2022)) and more datasets.
Evaluation metrics We consider two types of evaluation: closeness to ground-truth and accuracy as a prompt. To measure closeness we use three metrics: (1) Correct – whether the generated explanation contains one of a set of problem-specific keywords. (2) MRR – Mean reciprocal rank measuring the rank of the first task-correct prompt. Given a set of datasets $\mathcal{D} = {\mathcal{D}_1, \dots, \mathcal{D}N}$ , we compute: $MRR = \frac{1}{|\mathcal{D}|} \sum{i=1}^{|\mathcal{D}|} \frac{1}{rank_i}$ , where $rank_i$ is the one-indexed rank of the first correct explanation. (3) Human – The human evaluation scores between the top-generated explanation and a pre-specified groundtruth explanation, when instructed “You are given a groundtruth description alongTable 2. Performance for dataset explanation. Dataset from Table 1 (1-3). Accuracy measured via (1) Human-evaluation (H, normalized %), (2) Mean Reciprocal Rank across the collection (M) and (3) 1-best correctness (C, %). For all metrics, higher is better.
| iPrompt H / M / C |
AutoPrompt H / M / C |
Suffix H / M / C |
|
|---|---|---|---|
| Math | 60 / 0.69 / 60 | 25 / 0.14 / 13 | 20 / 0.08 / 03 |
| ANLI | 56 / 0.41 / 37 | 21 / 0.07 / 07 | 25 / 0.06 / 01 |
| Induction | 42 / 0.35 / 28 | 21 / 0.09 / 08 | 23 / 0.04 / 01 |
with a generated one. On a scale of 1 (worst) to 5 (best), how interpretable and accurate is the generated description?5. The mean human evaluation score (ranging from 1 to 5) is normalized.
To measure generalization ability, we evaluate explanations based on accuracy as a prompt for other models. Accuracy is computed following (Brown et al., 2020; Raffel et al., 2020): using exact matching with beam search, a beam width of 4, and a length penalty of $\alpha = 0.6$ .
For sentiment evaluation, we learn a prompt within the template Input: “${input}” {prompt}.6 We use positive and negative as positive and negative labels and require the LLM to rank the two options. Human-written prompts are adapted to this template from open-source prompts available through PromptSource (Bach et al., 2022).
5. Results and Analysis
5.1. Dataset explanation recovery
Table 2 compares prompting methods across three diverse data collections. The Human evaluation scores are much higher for iPrompt than the baselines, suggesting that it finds prompts which are both accurate and human-interpretable. Similarly, the MRR and Correct scores show that iPrompt considerably improves in finding accurate explanations. See all generated explanations in Appendix A.3.
To assess the best-case absolute accuracy of the approach, we note it is impossible for the approach to recover the prompt if the underlying LLM cannot solve the task. Fig. 4 plots the prompt recovery performance (MRR) against the underlying LLM’s accuracy (when using the groundtruth prompt) for each dataset. When the model can solve the task, iPrompt does well on recovery. However for many tasks the model has low accuracy even with the correct prompt, putting a ceiling on the performance of iPrompt.
5Human evaluation scores are averaged over 4 PhD students in machine learning not affiliated with the study.
6In initial experiments, we find that performance drops significantly when learning a prompt that comes before the input.
Figure 4. Comparison of model accuracy with correct prompt and iPrompt ability to find the correct prompt across each individual task (single-task MRR). Prompt recovery ability is dependent on the model’s ability to perform the task.
Table 3. Generalization accuracy (zero-shot) with the prompts generated with GPT-J as the LLM across different models.
| Correct Prompt | iPrompt | AutoPrompt | No prompt | ||
|---|---|---|---|---|---|
| Math | GPT-J 6.7B* | 54.0 | 51.5 | 41.6 | 16.3 |
| OPT 6.7B | 12.7 | 19.3 | 18.9 | 8.4 | |
| GPT 20B | 76.1 | 54.4 | 23.2 | 8.5 | |
| GPT-3 175B | 76.0 | 62.1 | 40.8 | 28.4 | |
| ANLI | GPT-J 6.7B* | 9.0 | 4.7 | 1.9 | 2.0 |
| OPT 6.7B | 10.7 | 6.7 | 4.7 | 7.9 | |
| GPT 20B | 31.0 | 14.2 | 5.6 | 4.0 | |
| GPT-3 175B | 37.6 | 11.7 | 2.7 | 7.7 |
5.2. Generalization accuracy of prompts
Do prompts generated for a specific LLM still work when applied to a different model? Table 3 shows the generalization accuracy when testing the prompts generated using GPT-J (Table 5) on different LLMs. The prompts maintain effectiveness across most models. For the Math datasets, the iPrompt prompts elicit improvement over the baselines and approach the accuracy of the correct prompt. For the ANLI datasets, all prompts induce poor performance. Notably, the gap between iPrompt and AutoPrompt is larger for larger models (i.e. GPT 20B and GPT-3); this suggests that, by generating fluent prompts, iPrompt generates more generalizable descriptions.
Table 4 shows results on the sentiment analysis datasets. As prompts for GPT-J, iPrompt outperforms not only AutoPrompt, but also the manually-written prompt on all four datasets. Interestingly, the average performance of human-written prompts on GPT-J is very low, unlike the prompts generated by iPrompt. This indicates that models at 6B parameter scale may be brittle to the choice of prompt, even among a set of reasonable options, and iPrompt (and to an extent, AutoPrompt) is able to discover how to phrase prompts so that models of this scale can complete the task.Table 4. Zero-shot accuracy on sentiment classification datasets: SST-2, Rotten Tomatoes, IMDB, and the Financial Phrasebank (Socher et al., 2013; Malo et al., 2014; Pang & Lee, 2005). Generation with GPT-J 6B and evaluation on both on the original GPT-J model and GPT-3 (text-davinci-002). Errors are standard errors of the mean.
| Human-written | iPrompt | AutoPrompt | No prompt | ||
|---|---|---|---|---|---|
| GPT-J | FFB | 27.0 1.9 | 79.3 2.1 | 74.0 9.1 | 47.5 |
| RT | 58.9 3.1 | 84.8 0.9 | 73.0 4.8 | 59.2 | |
| SST-2 | 58.4 2.8 | 86.7 1.0 | 76.7 3.9 | 60.9 | |
| IMDB | 66.0 3.2 | 87.9 1.4 | 86.7 1.2 | 58.6 | |
| GPT-3 | FFB | 39.6 1.6 | 57.2 6.9 | 28.2 3.1 | 39.1 |
| RT | 82.7 3.3 | 77.4 2.8 | 57.8 3.5 | 54.8 | |
| SST-2 | 90.5 3.9 | 82.4 2.3 | 61.8 7.0 | 58.4 | |
| IMDB | 75.6 3.3 | 86.6 1.1 | 70.0 6.5 | 66.2 |
When sentiment prompt generalization is tested on GPT-3, we find that iPrompt prompts outperform human-written prompts on two of the four datasets. When tested on GPT-3, iPrompt prompt To summarize this review! : outperforms all PromptSource IMDB prompts that use the same verbalizer (positive/negative). When its prompts are tested on GPT-3, baseline AutoPrompt only slightly outperforms testing with no prompt at all.
Table 5 shows the top-ranked explanation generated by each method for selected datasets. iPrompt often finds an explanation that is indicative of the underlying relationship, even if the phrasing is not perfect. For example, for the add two numbers dataset, it finds Create a function named ‘sum’. The prompts found by iPrompt also read as fairly fluent strings compared to AutoPrompt, which produces an incoherent set of tokens.
5.3. Model ablations
We run ablation experiments to analyze the three steps of iPrompt: (1) Proposal, (2) Reranking, and (3) Iteration. we use the Math and ANLI datasets and run on a maximum of 5000 data points using 5 shots in context for prompt generation.
(1) Proposals are partially guided by examples. During the proposal stage, iPrompt prefixes potential prompts with dataset examples. Table 6 considers variants of this stage that remove input and output examples during the proposal stage. Note the system still has access to the full examples during the reranking stage. We find the system can achieve decent performance on Math simply by iterating. However for ANLI, the model needs to at least see the inputs/outputs during the proposal in order to find accurate prompts.
(2) Reranking zero-shot recovers better prompts. iPrompt uses zero-shot accuracy to rank prompts. As we
have examples of the task, we could instead use in-context few-shot prompting for ranking. Prior work suggests that prompt wording is less influential as the number of in-context examples increases (Webson & Pavlick, 2021). Table 6 shows that using these examples in-context for reranking does, in fact, considerably hamper prompt recovery. We further find that the LLM used for reranking is more important than the LLM used for proposals (see Appendix Fig. A3).
(3) Iteration improves performance Finally, Table 6 shows that without multiple iterations, performance drops nearly to zero (Fig. A2 shows more details on loss as a function of iterations).
6. Scientific investigations with iPrompt
We now investigate whether iPrompt can explain patterns in scientific datasets. Specifically, we analyze the Galactica model (Taylor et al., 2022) with 6.7 billion parameters. We query whether it can describe differences in datasets of chemical compounds and protein sequences before investigating a neuroscience problem.
Toxic chemical compounds We first ask whether iPrompt can explain the difference between two groups of chemical compounds with a known difference. We use the Tox21 dataset (Richard et al., 2020) which contains toxicity measurements on 12 biological targets. For each of the 12 biological targets, we search for a prompt that differentiates compounds that are toxic to the target (positive) from those which are not toxic to any of the targets (negative). We use 100 positive/negative examples for each biological target and format each input with the text Here is a compound: \n [Compound Name] \n Answer: followed by Yes for a positive compound and No for a negative one. iPrompt is run for a single epoch with 5 shots in each example.
Ideally, the elicited prompt would mention toxicity. Table 7 shows results for whether the elicited prompts contain the substring tox, both in terms of MRR and top-prompt correctness. iPrompt often finds an accurate prompt: one representative example is: Answer yes if the compound is toxic, and Otherwise answer NO. To ensure that this substring is not simply a popular completion for the language model, we compare against a baseline which runs iPrompt using Galactica proposals from empty inputs/outputs and reranking with Galactica; over 36 random seeds, tox appears in any generated prompt.
Differentiating protein sequences We turn to whether iPrompt can explain the differences between two groups of proteins. We use protein sequences and keywords from Swiss-Prot (Bairoch & Boeckmann, 1991) (a high-quality subset of Uniprot (Consortium, 2015)) to construct twoTable 5. Examples of generated explanations by iPrompt and AutoPrompt. See all prompts in Appendix A.3.
| Human-written prompt | iPrompt | AutoPrompt | |
|---|---|---|---|
| Math | Return the sum of the inputs | Create a function named 'sum | ¿:Returns Adding togetherFont accomplish |
| Return the square of the input | Input number and return its square | Cal impl qApplySquare fiat | |
| Differentiate between prime/non-prime integers | Are these pairs of integers prime | ropheospels&& Norestricted | |
| ANLI | Differentiate vegetarian/non-vegetarian foods | Are you a vegetarian? | compliedthe whether methamphetamine provided comp |
| Differentiate the subject in a sentence based on gender | Predict the gender (F = | ¿ endoftext ¿ -¿ M Fundamental FG Fav | |
| Return a synonym | what is a synonym for | Word termOn English meanings | |
| Translate english to spanish | please write English meaning in Spanish | the ththebb volunt | |
| Sentiment | Return a country’s capital city | Which city is the capital and | Ang Suppose AUTHthe beh Assassins |
| What is the sentiment expressed by the reviewer for the movie? | Describe what it is about this film has caused it | Pap Azerb Saiyan Forean Talatar Yemeni IndBloomberg re- | |
| How does the author of the news headline feel? | <input> neutral> The result was due to: ” | ceiveda Fur resultolandgroundur augmented= |
Table 6. Algorithmic ablations for each stage of iPrompt. Gives prompt recovery (MRR) achieved by ablating each stage. Averaged over 3 random seeds.
| iPrompt | MRR | ||
|---|---|---|---|
| Math | ANLI | ||
| (1) Proposal | w/o inputs+outputs | 0.400 | 0.015 |
| w/o inputs | 0.463 | 0.244 | |
| w/o outputs | 0.539 | 0.255 | |
| (2) Reranking | w/ in-context examples | 0.071 | 0.152 |
| (3) Iteration | No iteration | 0.075 | 0.050 |
Table 7. iPrompt performance at recovering prompts for toxic chemical compounds. Tox21 results are averaged over 12 datasets with 3 random seeds each. Null data is averaged over 36 random seeds. Error bars are standard error of the mean.
| iPrompt | Baseline | |
|---|---|---|
| MRR | 0.83 ± 0.04 | 0.0 |
| Top-prompt correctness | 0.67 ± 0.08 | 0.0 |
datasets: each dataset contains two groups of proteins, which are differentiated based on their keywords.7 The first dataset, which we call Cyto, has proteins with either the keyword Cytoplasm or Membrane. The second dataset, which we call Binding, has proteins with either the keyword RNA-binding or ATP-binding. Each group is randomly down-sampled to 100 proteins and iPrompt is run with the same hyperparameters as when finding chemical compounds.
We make this problem more challenging by feeding the model the raw protein sequence (not the protein name) which ranges from hundreds to thousands of amino acids. Each input is presented with the following text: Here is a protein sequence: \n [Protein Sequence] \n Answer: followed by Yes for a one group and No for the other. Table 8
7We search for reasonably popular but non-cooccurring keywords in the proteins; see details in Fig. A5
Table 8. iPrompt performance at differentiating protein sequences. For both the Cyto and Binding datasets, the correct keywords are successfully identified better than for the Baseline. Results are averaged over 12 random seeds; error bars are standard error of the mean.
| iPrompt (Cyto) | iPrompt (Binding) | Baseline | |
|---|---|---|---|
| MRR | 0.2 ± 0.08 | 0.08 ± 0.04 | 0.03 ± 0.01 |
| Recall @ 5 | 0.25 ± 0.13 | 0.17 ± 0.11 | 0.05 ± 0.05 |
| Recall @ 20 | 0.83 ± 0.11 | 0.33 ± 0.14 | 0.23 ± 0.09 |
shows results for identifying whether the elicited prompt contains one of the relevant keywords for each dataset (e.g. Cytoplasm). Despite the difficult input format, the correct keywords are successfully identified for both the Cyto and Binding datasets better than for the Baseline (which again contains empty inputs).
Scientific investigation into an fMRI natural language dataset
We now explore using iPrompt in a simple neuroscience experiment. A central challenge in neuroscience is understanding how and where semantic concepts are represented in the brain. A recent seminal study (Huth et al., 2016) explores this question by investigating where different natural language categories are represented in the human neocortex. Specifically, the authors collect functional MRI (fMRI) responses as human subjects listen to hours of narrative stories. They then build a predictive model of these responses for each voxel (i.e. a small region in space) in the brain, which takes as input the words contained in the stories (and other features). To interpret these individual voxel models, they cluster the words in the narrative stories into 12 groups and manually annotate them, resulting in 12 categories, such as tactile, visual, and professional. Finally, they view the spatial mapping of these 12 concepts (projected onto low dimensions) across the brain using their individual voxel models.
We revisit a small piece of this study’s analysis throughthe lens of iPrompt. Specifically, we ask whether iPrompt could generate plausible categories that are well-represented across the brain but differ from the manually identified 12. We fit a predictive model for each voxel, following the pipeline of the original study (details in Appendix A.6). We then use the resulting models to identify a list of the top-15 words which most excite each voxel. For example, the top-15 words that excite the best-predicted voxel are: sheet, edges, diameter, strips, cardboard, copper, steel, colored, coloured, leaf, wire, cap, paper, shaped, tin. To identify a plausible semantic category, we construct a template string as follows: The following list of words all belong to the same semantic category: ____\n\nsheet, edges, ..., shaped, tin. We then use iPrompt (again with a GPT-6B parameter model) to generate a category by filling in the blank (restricted to a single token). To make iPrompt more effective, for each voxel we use iPrompt on a set of examples consisting of 15 permutations of the top-15 words, allowing finding patterns that are not overly sensitive to the word-ordering.
Given the top categories for each voxel, we analyze the mapping of recurring categories across the neocortex. We aggregate the top-15 inferred categories8 over the top-15 best-predicted voxels and find that the most frequently inferred categories are: material, color, surface, text, & fabric. Interestingly, these are sensible quantities that different voxels could reasonably be selective for. We spatially map each of these identified categories (e.g. material) across the 10,000 best-predicted voxels by using the LLM in a second way. For each voxel, we condition the LLM (again GPT-6B) on the top-15 words list, and evaluate the predicted probability for each category, i.e. The following list of words all belong to the same semantic category: sheet, edges, ..., shaped, tin The semantic category they all belong to, in one word, is ____. The higher this predicted probability, the more selective we infer that a voxel is for the category. Fig. 5 shows these predicted probabilities for the top-two inferred categories (material and color) across the cortex of a human subject.
While there is no groundtruth for this semantic map, one noteworthy feature of the resulting map is that it is spatially smooth (quantitatively, Fig. A8 shows that the variance of the map among neighboring pixels is significantly lower than we would expect by shuffling the map's values). This is non-trivial, as nowhere in the modeling process was spatial information incorporated: each voxel was modeled independently and the displayed prediction was queried independently. We expect the underlying map to be smooth, both due to local connectivity in brain regions and also because the BOLD signal measured by fMRI does not have perfect spatial resolution. Thus, the fact that our inferred map is
8We apply stemming and remove stopwords before choosing the best categories.
Figure 5. Representations of the iPrompt-elicited concepts material (blue) and color (red) across the surface of the neocortex are spatially clustered and smooth. Only the top 10,000 best-predicted voxels are shown, remaining voxels are shown in black. Only the right hemisphere is shown (see both hemispheres, which show consistent smoothness in Fig. A6).
smooth suggests that (i) something about these categories is genuinely captured by the representation in the human brain, and (ii) that the iPrompt approach was able to reflect at least some of it. Beyond the two categories shown, the five categories generated by iPrompt exhibit spatial smoothness across the neocortex (Fig. A8).
7. Conclusion and Discussion
iPrompt makes a meaningful step towards finding natural language prompts that are both accurate and human-interpretable. We show this method can be used to recover dataset descriptions, produce transferable prompts, and provide explanations for experimental data. One future direction could elicit targeted information from data via the use of a template. For example, one may use iPrompt to extract feature importance by prepending the learned prompt with the string “To get the answer from the inputs, the most important inputs are ____”. As another example, in a scientific study such as the fMRI study in Sec. 6, a scientist interested in a particular topic (e.g. fear) may investigate that particular topic by making a more specific template (e.g. How are these words related to the concept of “fear”?).
While we focus on text, iPrompt could be applied generally settings where an LLM performs well. For example, in computer vision, an interpretable autoprompt may look like a mask of an image, and in vision-language models, an interpretable prompt may be a description of a vision task,e.g. find the largest shape in this image.
Acknowledgements
AR is supported by NSF CAREER 2037519, NSF 1704834, and a Sloan Fellowship. JM is supported by Weill Cornell Medicine. Thanks to Wenting Zhao and Woojeong Kim for comments on drafts of this paper and to Jeevana Priya Inala, Xin Wang, Baolin Peng, Michel Galley, and Hao Cheng for interesting discussions related to the work. We would also like to thank the authors of (Huth et al., 2016) for making their data publicly available.
References
Agarwal, A., Tan, Y. S., Ronen, O., Singh, C., and Yu, B. Hierarchical shrinkage: improving the accuracy and interpretability of tree-based methods. arXiv:2202.00858 [cs, stat], 2 2022. URL http://arxiv.org/abs/2202.00858. arXiv: 2202.00858.
Augusto, D. A. and Barbosa, H. J. Symbolic regression via genetic programming. In Proceedings. Vol. 1. Sixth Brazilian Symposium on Neural Networks, pp. 173–178. IEEE, 2000.
Bach, S. H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., et al. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.
Bairoch, A. and Boeckmann, B. The swiss-prot protein sequence data bank. Nucleic acids research, 19(Suppl):2247, 1991.
Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
Black, S., Leo, G., Wang, P., Leahy, C., and Biderman, S. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. March 2021. doi: 10.5281/zenodo.5297715. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984. URL https://www.routledge.com/Classification-and-Regression-Trees/Breiman-Friedman-Stone-Olshen/p/book/9780412048418.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Camburu, O.-M., Rocktäschel, T., Lukasiewicz, T., and Blunsom, P. e-snl: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31, 2018.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070, 2018.
Consortium, U. Uniprot: a hub for protein information. Nucleic acids research, 43(D1):D204–D212, 2015.
Deng, M., Wang, J., Hsieh, C.-P., Wang, Y., Guo, H., Shu, T., Song, M., Xing, E. P., and Hu, Z. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Gao, J. S., Huth, A. G., Lescroart, M. D., and Gallant, J. L. Py-cortex: an interactive surface visualizer for fmri. Frontiers in neuroinformatics, pp. 23, 2015.
Gulwani, S., Polozov, O., Singh, R., et al. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2): 1–119, 2017.
Ha, W., Singh, C., Lanusse, F., Upadhyayula, S., and Yu, B. Adaptive wavelet distillation from neural networks through interpretations. Advances in Neural Information Processing Systems, 34, 2021.
Han, X., Zhao, W., Ding, N., Liu, Z., and Sun, M. Ptr: Prompt tuning with rules for text classification. arXiv preprint arXiv:2105.11259, 2021.
Hand, D. J. Principles of data mining. Drug safety, 30(7):621–622, 2007.
Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., and Darrell, T. Generating visual explanations. In European conference on computer vision, pp. 3–19. Springer, 2016.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
Honovich, O., Shaham, U., Bowman, S. R., and Levy, O. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
Hu, S., Ding, N., Wang, H., Liu, Z., Li, J., and Sun, M. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv preprint arXiv:2108.02035, 2021.Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E., and Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458, 2016.
Kryściński, W., Keskar, N. S., McCann, B., Xiong, C., and Socher, R. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960, 2019.
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
Liu, F. and Avci, B. Incorporating priors with feature attribution on text classification. arXiv preprint arXiv:1906.08286, 2019.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021a.
Liu, T., Wang, K., Sha, L., Chang, B., and Sui, Z. Table-to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021b.
Logan IV, R., Balazevic, I., Wallace, E., Petroni, F., Singh, S., and Riedel, S. Cutting down on prompts and parameters: Simple few-shot learning with language models. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 2824–2835, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.222. URL https://aclanthology.org/2022.findings-acl.222.
Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. Explainable ai for trees: From local explanations to global understanding. arXiv preprint arXiv:1905.04610, 2019.
Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala, P. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65, 2014.
Manna, Z. and Waldinger, R. A deductive approach to program synthesis. ACM Transactions on Programming Languages and Systems (TOPLAS), 2(1):90–121, 1980.
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual knowledge in gpt. arXiv preprint arXiv:2202.05262, 2022.
Olaf, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., and Mordvintsev, A. The building blocks of interpretability. Distill, 3(3):e10, 2018.
Pang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL, 2005.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Ribeiro, M. T., Singh, S., and Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.
Richard, A. M., Huang, R., Waidyanatha, S., Shinn, P., Collins, B. J., Thillainadarajah, I., Grulke, C. M., Williams, A. J., Lougee, R. R., Judson, R. S., et al. The tox21 10k compound library: collaborative chemistry advancing toxicology. Chemical Research in Toxicology, 34(2):189–216, 2020.
Sadat, M. and Caragea, C. Scinli: A corpus for natural language inference on scientific text. arXiv preprint arXiv:2203.06728, 2022.
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafei, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
Schmidt, M. and Lipson, H. Distilling free-form natural laws from experimental data. science, 324(5923):81–85, 2009.
Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J. B., and Fedorenko, E. The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45):e2105646118, 2021.
Sha, L., Camburu, O.-M., and Lukasiewicz, T. Learning from the best: Rationalizing predictions by adversarial information calibration. In AAAI, pp. 13771–13779, 2021.
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.Singh, C. and Gao, J. Emb-gam: an interpretable and efficient predictor using pre-trained language models. arXiv preprint arXiv:2209.11799, 2022. doi: 10.48550/arxiv.2209.11799. URL https://arxiv.org/abs/2209.11799.
Singh, C., Murdoch, W. J., and Yu, B. Hierarchical interpretations for neural network predictions. International Conference on Learning Representations, pp. 26, 2019. URL https://openreview.net/forum?id=SkEqro0ctQ.
Singh, C., Nasser, K., Tan, Y. S., Tang, T., and Yu, B. imodels: a python package for fitting interpretable models. Journal of Open Source Software, 6(61):3192, 2021. doi: 10.21105/joss.03192. URL https://doi.org/10.21105/joss.03192.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.
Strobel, H., Webson, A., Sanh, V., Hoover, B., Beyer, J., Pfister, H., and Rush, A. M. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. arXiv preprint arXiv:2208.07852, 2022.
Tan, S., Caruana, R., Hooker, G., and Lou, Y. Distill-and-compare: Auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 303–310, 2018.
Tan, Y. S., Singh, C., Nasser, K., Agarwal, A., and Yu, B. Fast interpretable greedy-tree sums (figs). arXiv:2201.11931 [cs, stat], 1 2022. URL http://arxiv.org/abs/2201.11931. arXiv: 2201.11931.
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., and Stojnic, R. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
Tsang, M., Cheng, D., and Liu, Y. Detecting statistical interactions from neural network weights. arXiv preprint arXiv:1705.04977, 2017.
Wallace, E., Feng, S., Kandpal, N., Gardner, M., and Singh, S. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
Wang, B. and Komatsuzaki, A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Wang, X., Xu, X., Tong, W., Roberts, R., and Liu, Z. Inferbert: a transformer-based causal inference framework for enhancing pharmacovigilance. Frontiers in Artificial Intelligence, 4: 659622, 2021.
Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv, 2022.
Webson, A. and Pavlick, E. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247, 2021.
Zaidan, O. and Eisner, J. Modeling annotators: A generative approach to learning from annotator rationales. In Proceedings of the 2008 conference on Empirical methods in natural language processing, pp. 31–40, 2008.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Zhong, R., Lee, K., Zhang, Z., and Klein, D. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021.
Zhong, R., Snell, C., Klein, D., and Steinhardt, J. Describing differences between text distributions with natural language. In International Conference on Machine Learning, pp. 27099–27116. PMLR, 2022.
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.## A. Appendix
A.1. Sentiment classification results
Table A1 shows the best prompt produced by each method for each sentiment dataset. iPrompt often learns to recreate significant examples from the dataset, as a prompt. Figure A1 shows loss across training step for each method and dataset, across three random seeds. We see that AutoPrompt often finds a prompt with slightly lower loss on the training data, although its prompts lead to worse generalization, as reported in Table 4. Each training step represents a single word swap (in the case of AutoPrompt) or the truncation and generation of a new prefix (in the case of iPrompt).
Different from the other experiments in this paper, for sentiment classification, we initialize AutoPrompt with random tokens instead of all the, as we find AutoPrompt fails to find an effective solution for longer prefix lengths when all tokens are initialized to the. To accommodate for a complex input-output relationship, we test prompts of length 12 as well as length 6.
Accuracy is measured on the test set when available; otherwise, it is measured on a held-out 25% of the train set.
Table A1. Best-of-three prompts generated by each method on sentiment classification datasets. (Human-written prompts are best-of-eight and take from PromptSource (Bach et al., 2022)).
| Task | Method | Prompt |
|---|---|---|
| Financial phrasebank | AutoPrompt | Fur resultolandgroundur augmented |
| Human-written prompt | How does the author of the news headline feel? | |
| iPrompt | <input> neutral> The result was due to: ” | |
| IMDB | AutoPrompt | uclear cend Koretravel NAACP curses SicAstings production received |
| Human-written prompt | The movie review in negative/positive sentiment is: | |
| iPrompt | This movie needs to be put up on my profile as my | |
| Rotten Tomatoes | AutoPrompt | Whether{ { anotherath;—endoftext—¿ how |
| Human-written prompt | What sentiment does the writer express for the movie? | |
| iPrompt | what words would you try to add to help you express that | |
| SST-2 | AutoPrompt | BryceSpecificallyWASHINGTONRatedam |
| Human-written prompt | What is the sentiment expressed in this text? | |
| iPrompt | It is clear from the sentence that all three actors have something |
Figure A1. Loss plots for methods across sentiment analysis datasets, showing AutoPrompt and iPrompt across three random seeds.## A.2. Data/model details
Table A2. Details for each dataset. For details on Instruction induction, see (Honovich et al., 2022) and for details on Distribution differences, see (Zhong et al., 2021).
| Task name | Samples | Description | Example |
|---|---|---|---|
| fibonacci_one | 10 | Given an input x, return the xth fibonacci number. | Given the input x is 8, the output f(x) is 21.\n\n |
| double_one | 10 | Given an input x, return 2*x. | Given the input x is 6, the output f(x) is 12.\n\n |
| exp_one | 10 | Exponentiate the input to get the output. | Given the input x is 8, the output f(x) is 2980.96.\n\n |
| square_one | 10 | Square the input to get the output. | Given the input x is 2, the output f(x) is 4.\n\n |
| first_two | 100 | Return the first of the inputs. | Given the input numbers 7 and 8, the answer is 7.\n\n |
| add_two | 100 | Return the sum of the inputs. | Given the input numbers 9 and 7, the answer is 16.\n\n |
| subtract_two | 100 | Return the difference of the inputs. | Given the input numbers 5 and 4, the answer is 1.\n\n |
| divide_two | 100 | Return the quotient of the inputs. | Given the input numbers 2 and 7, the answer is 2/7.\n\n |
| multiply_two | 100 | Return the product of the inputs. | Given the input numbers 3 and 3, the answer is 9.\n\n |
| max_two | 100 | Return the maximum of the inputs. | Given the input numbers 1 and 1, the answer is 1.\n\n |
| task1191_food_veg_nonveg | 101 | Return whether the input food dish is vegetarian (yes or no). | Input: Haq Maas Answer: no\n |
| task1149_item_check_edible | 119 | Return whether the input item is edible (yes or no). | Input: vase Answer: no\n |
| task1146_country_capital | 231 | In this task, you are given a country name and you need to return the capital city of the given country | Input: Saint Pierre and Miquelon Answer: Saint-Pierre\n |
| task1147_country_currency | 232 | You are given a country name and you need to return the currency of the given country. | Input: Senegal Answer: CFA Franc BCEAO\n |
| task1509_evaluation_antonyms | 551 | In this task, you are given an adjective, and your job is to generate its antonym. An antonym of a word is a word opposite in meaning to it. | Input: paper Answer: scissor\n |
| task183_rhyme_generation | 999 | Given an input word generate a word that rhymes exactly with the input word. If not rhyme is found return "No" | Input: think Answer: sync\n |
| task107_splash_question_to_sql | 2031 | In this task you are expected to write an SQL query that will return the data asked for in the question. An SQL query works by selecting data from a table where certain conditions apply. A table contains columns where every row in that table must have a value for each column. Every table has a primary key that uniquely identifies each row, usually an id. To choose which columns are returned you specify that after the "SELECT" statement. Next, you use a "FROM" statement to specify what tables you want to select the data from. When you specify a table you can rename it with the "AS" statement. You can reference that table by whatever name follows the "AS" statement. If you want to select data from multiple tables you need to use the "JOIN" statement. This will join the tables together by pairing a row in one table with every row in the other table (Cartesian Product). To limit the number of rows returned you should use the "ON" statement. This will only return rows where the condition... | Input: What are the order ids and customer ids for orders that have been Cancelled, sorted by their order dates? Answer: SELECT order_id , customer_id FROM customer_orders WHERE order_status_code = "Cancelled" ORDER BY order_date\n |
| task088_identify_typo_verification | 6499 | The given sentence contains a typo which could be one of the following four types: (1) swapped letters of a word e.g. 'niec' is a typo of the word 'nice'. (2) missing letter in a word e.g. 'nic' is a typo of the word 'nice'. (3) extra letter in a word e.g. 'nicce' is a typo of the word 'nice'. (4) replaced letter in a word e.g. 'nicr' is a typo of the word 'nice'. You need to identify the typo in the given sentence. To do this, answer with the word containing the typo. | Input: A laege display of apples, pears, and oranges Answer: laege\n |
| task1336_gender_classifier | 6500 | Return the gender of the person in the input sentence. | Input: Justin made me feel discouraged. Answer: M\n |
| task092_check_prime_classification | 6500 | In this task, you need to output 'Yes' if the given number is a prime number otherwise output 'No'. A 'prime number' is a whole number above 1 that can not be made by multiplying other whole numbers. | Input: 9319 Answer: Yes\n |
Table A3. Models analyzed here.
| Model name | Huggingface identifier | Citation |
|---|---|---|
| GPT-2 (1.5B) | gpt2-xl | (Radford et al., 2019) |
| OPT (2.7B) | facebook/opt-2.7b | (Zhang et al., 2022) |
| GPT-Neo (2.7B) | EleutherAI/gpt-neo-2.7B | (Black et al., 2021) |
| Flan-T5 (3B) | google/flan-t5-xl | (Chung et al., 2022) |
| GPT-J (6B) | EleutherAI/gpt-j-6B | (Wang & Komatsuzaki, 2021) |
| OPT (6.7B) | facebook/opt-6.7b | (Zhang et al., 2022) |
| Galactica (6.7B) | facebook/galactica-6.7b | (Taylor et al., 2022) |
| GPT-Neo (20B) | EleutherAI/gpt-neox-20b | (Black et al., 2022) |
| GPT-3 (175B) | text-davinci-002 (OpenAI API) | (Radford et al., 2021) |
We consider discriminators of varying sizes, with GPT-J (6B) as a prompt generator. We also compare generators of varying sizes with GPT-J (6B) as a prompt discriminator. Models considered are of ${125M, 1.3B, 2.7B, 6B}$ parameters from the GPT-Neo/GPT-J language model family. Results are shown in Fig. A3. Performance varies smoothly across model sizes, with the highest performance when using the largest model for both reranking and generation. Reranking appears slightly more important than generation. When using a 1.3B parameter model for generation, MRR drops only slightly, from 0.418 to 0.399, while when using a 1.3B parameter model for reranking, MRR drops to 0.211. In general, prompt recovery performance improves smoothly with reranking model size.
Fig. A2 plots the progress of iPrompt across iterations, comparing runs on Math datasets (blue) to runs on ANLI datasets (gray). iPrompt appears to make most of its progress during the first 20% of training and then continue to slowly decrease the average loss. Running for more iterations on additional datapoints would likely increase performance.
Figure A2. iPrompt performance across training, averaged across three random seeds and all tasks from Math datasets (Blue) and ANLI (Gray).
Figure A3. iPrompt performance across different size language models for the prompt proposal and reranking steps. Values are mean reciprocal rank of first accepted prompt averaged across 20 tasks and 3 random seeds.Table A4. Performance of Galactica at prompt recovery, including DD datasets (Zhong et al., 2022; 2021).
| iPrompt | AutoPrompt | Suffix | ||
|---|---|---|---|---|
| MRR | Math | 0.2 | 0.09 | 0.025 |
| ANLI | 0.39 | 0.0025 | 0.085 | |
| Induction | 0.14 | 0.098 | 0.056 | |
| DD | 0.064 | 0.0082 | 0.066 | |
| Correct | Math | 0.12 | 0.075 | 0 |
| ANLI | 0.34 | 0 | 0.025 | |
| Induction | 0.071 | 0.087 | 0.02 | |
| DD | 0.043 | 0 | 0.052 | |
| BLEU-Top Prompt | Math | 0.0073 | 0 | 0 |
| ANLI | 0.01 | 0 | 0.00032 | |
| Induction | 0.022 | 0 | 0.0027 | |
| DD | 0 | 0 | 0.0015 |
| autoprompt | iprompt | suff | |
|---|---|---|---|
| active to passive | (= 18 the the subst | Choose a pronoun for each sentence | Create a sentence or group of |
| add two | >:Returns Adding togetherFont accomplish | Create a function named 'sum | n>2 ml |
| antonyms | the beetheBut But | The noun to its opposite ( | The code to ascend. You |
| cause and effect | REG Kinect virginity developed mosquit The | What would each sentence be if | write programs that read through an |
| common concept | ???????? parted configuredthe ???????? | Find a noun that includes all | which is a common word used |
| diff | "Fair 62 disgust 92 81 | Find the difference between largest | Write a program or function to |
| divide two | soughtWomen surgicalthe Percentage treated | "Divide each digit by | write a program or function who |
| double one | says transit Farethe doubles dollars | Write a function called double_ | Given two function pointer A and |
| exp one | &&wl +# 123 270 Earthquake | Input this into your calculator ( | Type in number between 15 & |
| fibonacci one | baptipi produce347).'' | Implement a function to find Fib | Given an integer n (1 |
| first two | Binding decode wr detect shortest numeric | Find first digit of given number | When was Python added to Ubuntu |
| first word letter | Exception Ps< endoftext >the the | Make a program that reads in | nimshul, a |
| informal to formal | CLASSIFIEDthe themselves strongly Plays Chamber | These are questions on simple sentences | Make the following sentences positive statement |
| larger animal | ????????thethehethethe | What is the most common animal | dogAnswer to "What's |
| letters list | fluidsthetethethehethethe | Given the following list of tokens | The computer will make this document |
| max two | spendingthethehethethe | Implement a version of max() | Write code to find out given |
| multiply two | ruits="# multipl integer multiplied False | 'How do you multiply a | write a program or function who |
| negation | performs antiv Sizethe NULL NULL | I found these four mistakes below | Your friends think that you |
| num to verbal | irritatedhedd respectfully Protectivethe | Output each number below in the | The program outputs the first input |
| orthography starts with | nextbusiness wordevery morphpp | Name of two homophones | You will be given five words |
| rhymes | Steal batter dating: unfold testosterone | Find the missing word for all | Input [create] What |
| second word letter | i mascot okay kk | Who gave the answer "o | the United states government outlawed |
| sentence similarity | value $$$ Math | 3 (5 marks). The | Read five sentences about your topic |
| sentiment | positively optimistic&&& negative | I'm voting "negative" | Melvins at CBGB |
| singular to plural | Enhanced shorthand Lets pluralbetweenhe | Given a noun and its plural | 1. It may be |
| square one | Cal impl qApplySquare fiat | Input number and return its square | Write a program or function to |
| subtract two | ignorethethehethethe | Write a function to find difference | Given a non-negative integer |
| sum | Photosthetethethehethethe | Add two numbers together and then | The program outputs, without any |
| synonyms | Word termOn English meanings | what is a synonym for | Is there a cure for an |
| task088 identify typo verifica- | thethehethethe | This word scramble is to test | You wake up in the morning |
| task092 check prime classifica- | ropheospels&& Norestricted | Are these pairs of integers prime | Print the input numbers in order |
| task107 splash question to sql | How Do You Connect SQL To | To get into MySQL you first | |
| task1146 country capital | Ang Suppose AUTHhe beh Assassins | Which city is the capital and | France, England or the UK |
| task1147 country currency | aaaathecurrency Nib Sc | Ireland. Which currency is spoken | "I am working on a |
| task1149 item check edible | no the870830 yes | coffee and beans are fruits. | Which one of the following is |
| task1191 food veg nonveg | complicatedthe whether methamphetamine provided comp | Are you a vegetarian? | It could be any food, |
| task1336 peixian equity evalu- | < endoftext > -> M Fundamental FG Fav | Predict the gender (F = | ?????,???, |
| task1336 corpus gender classifier | |||
| task1509 evaluation antonyms | contrad orously inverted ironically trans | find words with the opposite meaning | Record your input and answer, |
| task183 rhyme generation | quarterdream dug). Thro rhy | Mind vs Glee! There | what do you love to eat |
| taxonomy animal | programmingQ errorsBefore admitting mont | What are the most common animals | Each of these questions is a |
| translation en-de | H prob Hyper Forthe | You are a lawyer practicing in | This is an example of input |
| translation en-es | the ththebb volunt | please write English meaning in Spanish | Porque? |
| translation en-fr | IRthe< endoftext >thethe the | What is the French word for | Your code needs to deal with |
| word in context | ("nSame distinguishedthethe | Same and Not-Same - | What you will do is have |
| autoprompt | iprompt | suff | |
|---|---|---|---|
| d3 0 | line contains this string? No | contains all 6 items, No | |
| d3 1 | Ghostbusterthe interrogation condition criminal | sentence contains "yes" or | string doesn't match any template |
| d3 10 | preceded Roosevelt nonexistentuphem_-_ Tw | message contains "no". No | contains all of these words or |
| d3 11 | caused senator prompt Recall interacted | string contains "No" or | was matched; output otherwise No |
| d3 12 | begin:" r "},{" contradict | tweet mentions yes | is true or output false if |
| d3 13 | },{" vote [*"]=> | answer "no" (or | contains all correct answers, No |
| d3 14 | nonexistent undead questions Enhance mandated no | string begins 'no' and | string contains any non blank white |
| d3 15 | rarely ----Question not), {" geometric | string contains "no" or | includes exactly two English words with |
| d3 16 | \n pearthemar Display RUN | text contains any "yes". | text is true, otherwise write |
| d3 17 | EMP Similarly\t=== charsthe | is an answer ("no", | contains all correct answers for this |
| d3 18 | \n\n Verb horm suffix Eucl | phrase starts with 'no', | contains all correct answers else No |
| d3 19 | \n."," Emacs strips colors strips | word starts with 'yes', | text contains any of these strings |
| d3 2 | indirectly [[ pervasive?"Spoiler exhaustive | ends with "yes". If | sentence has an "O" |
| d3 20 | \n\n dips Vote flower Ainthe | ted sentence contains both "yes | contains one of these words or |
| d3 21 | \nthePubLeft Abstract | ends with 'no'. No | contains all correct answers, or |
| d3 22 | Nov wholesno Eucl NO | can output no/yes, | data set contains results for output |
| d3 23 | vantage immediately recogn example nails 309 | no else output none? Input | contains data describing or referring to |
| d3 24 | noBER nonosRew [ | datum defines finite number fields | is in fact equal 2; |
| d3 25 | withdrawalsnob inher nob Among | contains both gene list data file | has already started in state x |
| d3 26 | Joined robberHigthe contradictionNarr | line ends with a space, | ted series matches any of these |
| d3 27 | verseoleon:- inferred cannabinoids | was positive answer and "No | string of words, as shown |
| d3 28 | \n repet999 REM=[nov | refers exclusively (only literally or | was a real question that could |
| d3 29 | \n Pat uncertaintiesMerit oppos | line begins with yes | text meets any one or more |
| d3 3 | \n\n887odynamHor mun\t | ends with "yes" and | statement reflects truth. Otherwise output |
| d3 30 | detainees gap ${. hardness | statement is false? Otherwise | is an example from each category |
| d3 31 | \n055 helium **** itching | phrase does not contain any words | given was false or not a |
| d3 32 | Afghthethe | matches either one of these strings | text is true, and write |
| d3 33 | le \r 253 | has a duplicate word. Correct | contains yes |
| d3 34 | the Carnegie allerg Qu the | no,no for (1 | was "The End" or |
| d3 35 | Hatch Land pri poker[[ Yah | would be a no (I | text can create a good argument |
| d3 36 | }, egregbyte?Sensor | matches exactly a "no". | string meets any, or exactly |
| d3 37 | noun441...? word first neg | question has an answer "no | string meets any, and write |
| d3 38 | wond <+ HELP"), ("InvalidOtherwise | says yes | "yes" has an |
| d3 39 | notnobbutthe but | reads like no. | answers "yes" for all |
| d3 4 | \n\n 760 consensualNarr Fog cabbage | sentence ends with "no". | string was a valid answer otherwise |
| d3 40 | modeXP/, \n but | question contains an actual "no | given was wrong or not relevant |
| d3 41 | opinions universitythe began followingawaru | sentence is grammatically correct, | equals to zero (i. |
| d3 42 | disqualified humor Ratings [ contradiction Moham | phrase represents something that is actually | has 1 out of 2 responses |
| d3 43 | \n\n saturated Phot misc | would be rightAnswer :no | was about a government regulation ( |
| d3 44 | \n <[ npm spaces1 | was "no": Input | was "yes" else false |
| d3 45 | \n\n pit VerbFalse Tok | string contains one "no". | text starts with "OK", |
| d3 46 | }, {" Neil kingthe | no when a string containing one | contains this string! Yes, |
| d3 47 | network intuitive 19 Lamp | sentence implies that no can mean | contains all digits, else No |
| d3 48 | nond307 Literally negativeJun corpor | conforms with known facts no | ted number from user base 5 |
| d3 49 | Falsethe Rect 802 | string contains "no" or | contains all of these words, |
| d3 5 | contradicts absurdity Luffythe neg answ | string 'no' appears as | is correct ; No otherwise |
| d3 50 | _____ WithNo", "hedon | mentions "no" (or | contains all correct items, No |
| d3 51 | \n\n 276WithNo noodles Cosponsors | reads "no" no else | given was no; not output |
| d3 52 | \n\n 225Should laure | string was 'no' and | string contains just one space. |
| d3 53 | never_{ Johns neo no | is all lower case answer 1 | was what I described above! |
| d3 6 | forbids Literally reminisNone negate | text contains any "no" | text contains Syrian |
| d3 7 | }, {" \r stringologically ${\ git | contains 'no' or output | text contains yes |
| d3 8 | unlikelyEitherselessletter Ches contradictory | sentence contains 'no' or | contains any newlines after matching |
| d3 9 | reactive happensMiddle lot Inc | matches any word (no is | text meets any, or none |
| autoprompt | iprompt | suff | |
|---|---|---|---|
| active to passive | Transmission Electthe chromosome initialized empl | 4-way Multiple Choice | Is the context a good response |
| add two | addthe Hyper addi | In order to add two or | Given three real-valued inputs |
| antonyms | meet equilibration stiptertead asymmetry | What is the opposite of each | [T1] Question |
| cause and effect | shaking Dthethethe | Find clues as to why each | What do you think will happen |
| common concept | Bary techntbltbtbl Te | Where are all the animals? | What's the most common |
| diff | quartic digits shorter recreational genomics | Given two positive integers a and | What's the most efficient |
| divide two | manipulations comput iterationects quotients | The ratio of two real or | Given two different positive integers what |
| double one | roll Add Pingthe brakingthe | Determine how much money did Al | What's it like to |
| exp one | visc poplLSPLC Viscositythe | Given a number y and an | Find a formula for this linear |
| fibonacci one | start Attstrass Prim Polynomial emotions | \bigcirc m o | Write a function that gives an |
| first two | AICthethe Adethe | Solve using negative exponents? Explain | We have found it helpful to |
| first word letter | d rthe l c syllable | What is the last word? | the program {x. |
| informal to formal | Why unpredictable comprablyould Detecting | Yes! However, since we | Text-to-Text Data |
| larger animal | sharkoganopeanionaller descri | A question is given about three | Is the pair of animals on |
| letters list | microm phon te photothermal te te | How many 8 letter words | Given the following paragraph, indicate |
| max two | $$amater Penet credible b | How large was each of your | Is that as simple or complex |
| multiply two | aris visualthe Gibson multiplicative lexical | When we multiply two even or | What number divided by what other |
| negation | brood he Apparent denselythe FIG | What did these people have as | This time we do two prompt |
| num to verbal | Pixel lum sedimentary precedenceathion thousand | P(data answer) | Number pairs that are in the |
| orthography starts with | criptions geochemistry Harvey preprocessed Kus Cap | The correct verb after each input | Why did they choose this strategy |
| rhymes | hallucinations song cooperationcorner ask smear | Which phrase did "sea | My favorite food is a |
| second word letter | oderraj dialectath u o | What is the fourth letter | Is the object in this image |
| sentence similarity | false provleasteleast Apparently | I understand your definition correctly that | Chinese No Vote and Euro |
| sentiment | nominationegative<unk>indolinivalentpolar | What is the sentiment of a | What do you think will happen |
| singular to plural | mes sequthethethe | Find the pluralization of | Do you have any good ways |
| square one | AnalyticmassesAtomnamespace binning pow | Determine how much money did Al | What's it like to |
| subtract two | ComplexRemthe scienti Event | Given a variable called A whose | Is that close to your actual |
| sum | Horujanthethe | I'm trying to solve | Is the following number even? |
| synonyms | straightforward conceptual Striking Etymology tra | Can you think of a word | [T1], |
| task088 identify typo verifica- | Etymology nom scalesrolateral QMples | What is the plural form? | Other types Task Definition :: |
| task092 check prime classifica- | Accept No source Inter question | Q3_NoAnswerYes | Are there any types of chemical |
| task107 splash question to sql | Question answering Input #Name | Is the following SQL clause equivalent | |
| task1146 country capital | Outer Hassan wal Tu Spontaneous Qu | List the capital cities in each | The country that _____ |
| task1147 country currency | lthethestr the | Find the most common currency in | What currency was the first to |
| task1149 item check edible | nonthe Characterizing Nothe | Why is no answer | True or False, " |
| task1191 food veg nonveg | gue axiomsepid Output yes Birk | Are you a native speaker of | In a world where the Supreme |
| task1336 peixian equity evalu- | lineage Mthe knockdown Fthe | What is the gender of | Who is a good conversational partner |
| task1509 evaluation antonyms | Modern Carlson Weyl Linguistic counterfactual met | Find the opposite of each given | We can predict text from an |
| task183 rhyme generation | stellarthethe pl battle | The 6-letter word | We are given a dataset consisting |
| taxonomy animal | duoull Pap codebook varic lysozyme | When two objects collide and expl | What's the most common |
| translation en-de | shor Thanthe condens Intinte | Test for spelling error in word | Is the object of your activity |
| translation en-es | trophic Description params oscsthe | In Spanish, there are two | cuatro con la frec |
| translation en-fr | TT tic tgtythe Disk | Les champs du monde | What can the words in bold |
| word in context | " Tang samethe offOff | Identify similar phrases based on given | Does this sentence come from an |
| autoprompt | iprompt | suff | |
|---|---|---|---|
| d3 0 | Alloy ReeABL vetotitledthe | satisfies sarcastic predicate; otherwise | is sarcastic, otherwise ignore |
| d3 1 | Cosm compositionallyind locom astro bfnm | and output share 82 | sentence describes or is related to |
| d3 10 | onso Seman NichentiVALID | paragraph does not contain any word | says the answer is yes on |
| d3 11 | enzo conspicuous Widespreadfeature cis orth | mention e does not match any | says that the United States president |
| d3 12 | assert unco Nog antich DesignsFOR | contained a negation phrase otherwise an | says that someone arrives or de |
| d3 13 | functionnoAns medi monos BAA | text contains no keywords and none | is valid, no otherwise. |
| d3 14 | E PotassiumtheANASS | the United Nations integrated multi | contains the context word or response |
| d3 15 | no Nons TRANS Trajectories Exclusionifying | phrase is not a noun; | example satisfies all rules, otherwise |
| d3 16 | TiHas Gomes immigPropthe | sentence contains the word no | mentions the answer and @US |
| d3 17 | spatiotemporal extragalactic conflicts forbidden | data includes at least one Sem | was true, and output false |
| d3 18 | formulAns revisit transcri neither | ends in no no | contain any formals in it |
| d3 19 | fatSPR Inhibitsickel nestedyes | is valid.Answer: no | text contains the word " |
| d3 2 | propositional ScalarAsp Attacks train Rabin | contain any of given words otherwise | |
| d3 20 | Sem adjunct DCT Eriks admissibleArg | is prochoice no otherwise | says something about abortion or human |
| d3 21 | scatterflows vettoriz pen | sentences contain both "no | sentence includes sexual, gender identity |
| d3 22 | yesoscopyGal martingale Yes epistemic | no. For ``yes | data satisfy certain conditions Otherwise No |
| d3 23 | NoELO predictors SBATCHvect | holds no otherwise [START_REF] Primordial | Predictive Models are Interpretable on |
| d3 24 | norist Investigating Nos tumorigenesis Bit | term "noisy inputs | follows the given probability density function |
| d3 25 | nopins bil field ensembles Locus | no output no yea Prom | says that neutrinos have been observed |
| d3 26 | NeuthePreftheDEthe | sentence is a negation; an | sentence includes "cutter |
| d3 27 | no Conditional abstract definiteLD | statement contains this word, and | says that certain events have happened |
| d3 28 | CIS raftriendrolimussubseteq | data contains feminism, and | says that are feminists |
| d3 29 | noAns Semantic neitherHamiltonian dissoci | text contains no, | says something against women or gender |
| d3 3 | nondec yes Census Tam Policies acyclic | IS semst; else, | says something against your religion on |
| d3 30 | itasenta Assim allergic Fraser | text contains answer=yes and | data includes y and n, |
| d3 31 | Strategy monitors Confl HaleFIELD Rhode | data contains a negative sentiment, | matches at least one of a |
| d3 32 | Regulates term Cliff steer VER Saskatchewan | mentions no and no | sentence includes a pronoun that refers |
| d3 33 | mut Congress SyntN weakhis | text contains the phrase yes | sentence includes a token for each |
| d3 34 | yes<fragments> Kohn povertyyes Circular | are based in movies. no | says that Erik has his |
| d3 35 | noon nonlocalakh no no s | question contains YesNo words like | movie was very good otherwise mark |
| d3 36 | describes nomoduleno RevealsAs | sentence does not contain a factor | text includes any unanswerable |
| d3 37 | penADOapineg autoclHAL | phrase no appears only | sentence has an answer. Otherwise |
| d3 38 | noNoEnabl complementation BIT Polar | question contains the phrase no, | says that certain language has more |
| d3 39 | Neuastro neur runaway suffixthe | utterance contains this phrase no | says something about your personality, |
| d3 4 | MULT semilinear unarybuffer Gior fate | sentence does not contain a modal | meets any condition given in Sem |
| d3 40 | outputs vigilance mK Unsupervised Status initial | data contains no and no else | correctly answers your question, otherwise |
| d3 41 | answ neph Membership Bess decomp neurop | equilibrium does not hold; no | does not contain either of x |
| d3 42 | Surveillance Semantics Obl Inhibits Hels MEL | string isn't in English | says that climate issues have worsened |
| d3 43 | AnsArg Zika spar | supports my belief no otherwise Input | follows the context; Otherwise output |
| d3 44 | wer: inducible affirm Abl reflex | contain any formals words or | |
| d3 45 | anal ERGsentence loopsyless | string does not occur in training | question were "Is there |
| d3 46 | GitHub Clevelandck negation RCC Microbial | contains no fake or misn | movie was released before year |
| d3 47 | ful eth massoc bis NA | debris affects doesnt have any | says that we need your assistance |
| d3 48 | \n Nons FernclassGridUHFFFAOYSA | holds for all possible inputs no | sentence includes a pronoun as well |
| d3 49 | noNo Imper Creating noPan | sentence contains no in | matches answer which will give correct |
| d3 5 | volat Salv Artificial economies fut Hale | prompt is followed by no | says that the output is a |
| d3 50 | failedkin ResDesMM | string does not contain any stop | says that wight is decreasing |
| d3 51 | bl Frederthe Novo phylogeneticthe | for "is my child | contains the context of your response |
| d3 52 | onasnono domainsex Quanti | phrase has the value no, | sentence includes something that will lead |
| d3 53 | onisenony anonh | includes the words no output will | contains at least two noun phrases |
| d3 6 | Alle substrthe Edmund Hos forks | answer no contains this word or | is a valid response and vice |
| d3 7 | Antithethe Blakethe | word is a negation of micro | sentence includes all possible answers Prom |
| d3 8 | Brand abolished affili attri Recon | corresponds with prompt question no | sentence is suitable Question for yes |
| d3 9 | Bou counterex abstnougin literal | question has answer no, output | is correct but maybe not relevant |
A.4. Experiment details / hyperparameters extended
Average-output suffix decoding LLMs themselves can be directly used to predict prompt strings. We can give the model a prompt that includes examples such as the following context string:In: 2 5 Out: 7. To compute the output from the input, $\underbrace{\quad}{\text{template}}$ , and sample the output for the blank to recover a prompt $\underbrace{\quad}{x^i}$ , $\underbrace{\quad}_{y^i}$
§. Sampling directly from $f$ helps ensure that the generated explanation is fluent and semantically meaningful. We decode the output using beam search to find the highest-probability outputs for multi-token prompts.9 To improve on this approach, we place several examples into the model’s context, and then average the model’s output logits across all the examples in the dataset before decoding the output, an approach we refer to as average-suffix decoding. However, we find that average-suffix decoding does not yield a performance improvement over straightforward decoding from a single sample with examples in the context. For example, Fig. A4 shows that for the ANLI datasets, the mean reciprocal rank for average-output sampling does not tend to be higher than for single-output sampling across two different models.
Figure A4. Average suffix sampling versus individual-example suffix sampling does not improve performance (for ANLI datasets).
Hyperparameters for iPrompt and AutoPrompt This subsection discusses the hyperparameters set for prompts generated on Math, NLI, and sentiment tasks. For Math and NLI tasks we considered prompts of length 6 tokens; for sentiment we considered prompts of length 16. For all experiments with iPrompt we consider 8 candidate explanations for each step and generate 4 new generations per candidate, for a total of 32 candidates. For fair comparison, we consider 32 candidates per step for AutoPrompt. We generate Math and NLI from 5,000 training steps and Sentiment candidates from 10,000 steps. We truncate examples to a maximum of 128 tokens. We measure loss for re-ranking (used by both AutoPrompt and iPrompt) using the LLM’s loss over the full space of output tokens, i.e. we do not restrict the vocabulary to the space of label tokens for classification problems.
Details of iPrompt Here we explicate the details of iPrompt. At each step, we consider a fixed number of mutations for each example in the population, as well as an additional number of random generations to prevent the population from getting stuck in a local minimum. When we sample a new population, we sample the best-performing prompts seen so far, as measured by a running average zero-shot loss. In order to encourage diverse candidate prompts, sample a population such that each sample starts with a different token. During preliminary experiments, we found that enforcing different starting tokens for each candidate prompt helped promote more diverse and interpretable prefixes.
For generation, we sample directly from the LLM given the data concatenated with the string nPrompt :. We sample with a temperature of 1 and do not use a sampling strategy like nucleus sampling. For Math and NLI, we set the “repetition penalty” for generations to 2.0 to discourage copying from the training set. For the sentiment experiment, we reduce the repetition penalty to 1.0.
Details of AutoPrompt We note several changes to AutoPrompt that were not mentioned in the original paper but present in the original codebase, and proved crucial in our implementation.
First, if we compute the top-candidates over every position, the magnitude of the gradient will always be highest at position 0, and thus AutoPrompt will prefer to make a swap at that position every time. To fix this issue, at each training step, we randomly select a position of the token to edit and consider word swaps only at that position.
Second, as described, AutoPrompt will always take one of the candidate substitutions, even when said candidate does not
9Here we prefer beam search here over alternatives such as nucleus sampling (Holtzman et al., 2019) as we are interested in finding an accurate prompt description with as few samples as possible.improve the loss compared to the current prefix. Instead, we only make a substitution if the candidate prefix loss is lower than the loss on the same batch computed with the current prefix.
Finally, unlike the AutoPrompt implementation found online, we allow AutoPrompt to select from any token to substitute, including special tokens and non-English characters.
To make AutoPrompt compatible with ranking-based metrics, we store the losses for each candidate ranked during training. At the end, we consider the “top prefix” to be the prefix with the lowest average loss during training, that has been considered at least three times. This final consideration criteria prevents candidates from the very end of training that only have a few loss estimates from being counted as the top prefix.
A.5. Galactica experiment details
Figure A5. Swiss-Prot (Bairoch & Boeckmann, 1991) protein keyword cooccurrences. To construct the Cyto and Binding datasets, we search for popular but non-cooccurring keywords.
A.6. fMRI experiment details
This section gives more details on the fMRI experiment analyzed in Sec. 6; for more scientific details see the original study (Huth et al., 2016) and code (github.com/HuthLab/speechmodeltutorial). Sec. 6 analyzes data from one human subject in the original study, as the subject listened to approximately two hours of narrative speech from the Moth Radio Hour, which consists of short autobiographical stories. The subject underwent fMRI scanning as they listened, yielding an fMRI volume brain scan consisting of tens of thousands of voxels roughly every two seconds.
The individual voxel models described in Sec. 6 are each fit to 3,737 training points, each corresponding to a different time point (after accounting for various preprocessing steps, such as trimming the beginning and end of the sequence). They are evaluated on 291 training volumes which come from a 10-minute story that was not seen during draining.
Fig. A7 shows the generalization performance of the model for each voxel, measured by the correlation between the predicted response and the measured response. Some regions are very poorly predicted (black), but many voxels can be predicted quite well (bright).Figure A6. Representations of the iPrompt-elicited concepts material (blue) and color (red) across the surface of the neocortex are spatially clustered and smooth. Left hemisphere corresponds to Fig. 5. Only the top 10,000 best-predicted voxels are shown, remaining voxels are shown in black. Plotted with pycortex (Gao et al., 2015).
Figure A7. Generalization performance for individual-voxel models, measured by correlation between the prediction and the measured response.Figure A8. Concepts are spatially localized in the brain maps: the variance between neighboring voxels is considerably lower than would be expected from shuffling the voxel values. Note that we take care to shuffle the map values only within the 10,000 top-predicted voxels, ignoring the poorly predicted voxels. Error bars (within the points) are standard errors of the mean.
Xet Storage Details
- Size:
- 106 kB
- Xet hash:
- 0304f1d95288126472729b9b26da43315d4352c0ea4395499ecd947d55df1f98
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.