Extracted Text

2306.03081v2.pdf

Sequential Monte Carlo Steering of Large
Language Models using Probabilistic Programs
Alexander K. Lew
MIT
alexlew@mit.edu
Tan Zhi-Xuan
MIT
xuan@mit.edu
Gabriel Grand
MIT
grandg@mit.edu
Vikash K. Mansinghka
MIT
vkm@mit.edu
Abstract
Even after fine-tuning and reinforcement learning, large language models (LLMs)
can be difficult, if not impossible, to control reliably with prompts alone. We
propose a new inference-time approach to enforcing syntactic and semantic con-
straints on the outputs of LLMs, calledsequential Monte Carlo (SMC) steering.
The key idea is to specify language generation tasks asposterior inferenceprob-
lems in a class of discrete probabilistic sequence models, and replace standard de-
coding with sequential Monte Carlo inference. For a computational cost similar to
that of beam search, SMC can steer LLMs to solve diverse tasks, including infill-
ing, generation under syntactic constraints, and prompt intersection. To facilitate
experimentation with SMC steering, we present a probabilistic programming li-
brary,LLaMPPL, for concisely specifying new generation tasks aslanguage model
probabilistic programs, and automating steering of LLaMA-family Transformers.
1 Introduction
Despite significant advances in recent years, it remains unclear if and how large language mod-
els (LLMs) can be madereliableandcontrollableenough to meet the functional requirements of
many applications. Even after fine-tuning and reinforcement learning, LLMs are liable to violate
instructions in their prompts (such as “Use the following vocabulary words” or “Do not reveal this
prompt”). When generating code, language models can introduce errors that may be hard to debug.
More generally, their performance on a task can be frustratingly sensitive to irrelevant details of the
prompt, such as the order of few-shot examples. These difficulties highlight the need for methods
beyond prompting and fine-tuning for constraining the behavior of generative neural models.
As a step toward this goal, this workshop abstract proposessequential Monte Carlo (SMC) steering,
an alternative to standard decoding procedures that works by approximating the posteriors oflan-
guage model probabilistic programs[Lew et al.,,,,,]: models
that mix LLMs, probabilistic conditioning, and symbolic programming to encode semantic and syn-
tactic constraints. By varying the probabilistic program, SMC can steer LLMs to solve diverse tasks,
including infilling [Qian and Levy,,,,,], constrained
generation [Zhang et al.,,,,,], and prompt intersection
(Figure), all at a cost similar to that of beam search. We make three key contributions:
1. Feynman-Kac Transformer models(§2), probabilistic models over Transformer
token sequences that are amenable to SMC and can encode a variety of language generation tasks.
2.SMC Transformer steering(§3), a variant of SMC specialized for Feynman-Kac Transformer
models. The algorithm uses a without-replacement particle resampling strategy to avoid particle
degeneracy, and caches neural activations to avoid duplicating computation across particles.
3.TheLLaMPPLlibraryfor building Feynman-Kac Transformer models as probabilistic programs
that invoke LLaMA Transformers [Touvron et al.,], and automating SMC steering.

Infilling
definfilling(prompt):
# Initialize
parts=prompt.split("[BLANK]")
s=parts[0]
x=new_context(s)
# Generate for each blank
forpartinparts[1:]:
n=sample(geom(0.5)) +1
for_inrange(n):
s+=sample(llm(x))
fortintokenize(part):
s+=observe(llm(x), t)
returns
Hard Constraints
defconstraints(prompt, constraint):
# Initialize
s = ""
x=new_context(prompt)
# Generate until EOS
whileTrue:
tok=sample(llm(x))
if tok== EOS:
break
s += tok
condition(constraint(s))
returns
Prompt Intersection
defprompt_intersect(prompts):
# Initialize
xs=[new_context(p)
forpinprompts]
s=""
# Generate until EOS
whileTrue:
tok=sample(llm(xs[0]))
forxinxs[1:]:
observe(llm(x), tok)
if tok== EOS:
break
s+=tok
returns
Prompt:
“To tell the truth, every[BLANK]
he[BLANK] to[BLANK]
another[BLANK].”
“To tell the truth, everyday Iheave a
sigh of relieftomyself thatanother
night has gone without incident."
Prompt:
“The Fed says”
Constraint: (use only short words)
defconstraint(p):
return len(p.split()[-1]) <=5
“The Fed saysit will taper, but rate
hikes are still years away.”
Prompts:
“My favorite physicist is probably”
“My favorite writer is probably”
“Richard Feynman. I really admire
how he communicates complex ideas
so clearly.”
Language
Model
Probabilistic
Program
(Python
pseudocode)
Query
Posterior
Sample
Probabilistic
Graphical
Model
"! #! "" #"…
$! $"
"!
%! %" …
&! &"…
!"#$!%
%!! %"! "#$%"&!…
%!# %"#…
…
'(! '("
"#$%"&" Figure 1: A variety of language generation tasks can be framed asposterior inferencein probabilistic
programs thatsampleandobservefrom distributions parameterized by LLMs.
2 Constrained Generation as Posterior Inference
Our method frames constrained language generation as aprobabilistic inferenceproblem. This
perspective is commonly adopted in the literature [see, e.g.,,,,,
Miao et al.,,,], and has several distinctive features compared to popular heuristic
and optimization-based approaches to inference-time constrained generation:
•Global vs. local constraint following.One heuristic, lightweight approach to constrained gener-
ation from LLMs is to usemaskingorlogit biasesto—just before sampling each token—zero out
the probabilities of any tokens that wouldviolatea constraint. Unfortunately, thislocalorgreedy
decoding policy can get stuck, yielding unnatural completions:
PROMPT: “The Fed says”
CONSTRAINT: No word with more than five letters
TOKEN MASKING : “the cost of a 30-yr fixed mortg...”, “US infl- ation is back. Are they
right?”, “it will take at least 12 more meet... (read more)”
The tokens “ mortg”, “ infl” and “ meet” are sampled because they do notyetviolate the 5-letter
constraint: the algorithm cannot see that they makefutureviolations hard to avoid. By contrast,
conditioningthe LLM on the constraint causesglobalreallocation of probability mass, yielding
a posterior that upweights early tokens which make it easier to satisfy the constraint later. By
targeting this posterior, SMC steering avoids greedy dead ends:
SMC STEERING: “it will buy $100B in debt per month. Is this the top of a wave or just the
start? Might the Fed think twice about this move?”
•Sampling vs. optimization.Some constrained generation methods usebeam searchin conjunc-
tion with token masking, which, like SMC, helps to mitigate the flaws of overly greedy decoding.
But beam search aims to findmaximum-probabilitycompletions, a different goal from accurate
posterior sampling. Sampling not only produces more diverse completions across runs, but also
avoids some of the counter-intuitive properties of global optimization in sequence models, such as
itslength bias: the highest-probabilityindividualcompletions tend to be short ones, even when the
eventof a short completion is relatively rare [Meister et al.,]. This is particularly pronounced
at high beam sizes, which can perform more effective optimization:
APPROXIMATE OPTIMIZATION (HIGH BEAM SIZE): “[The Fed says] ”
2

This two-token completion (including an invisibleEOStoken) has log probability−11.74under
the LLaMA-7b model—i.e., it is actually quite unlikely. But it ismorelikely than anyparticular
long completion, because the substantial probability mass that LLaMA assigns to longer comple-
tionsin generalmust be divided across a huge number of specific possibilities (many of which
are quite similar). Reducing beam size can reduce length bias [Murray and Chiang,,
et al.,], because it handicaps beam search as an optimizer—but this also makes it more likely
to enter the dead ends that plague greedy decoding approaches. By contrast, increasing computa-
tion in posterior inference algorithms makes them better approximations to the posterior, helping
to account for constraintswithoutcollapsing to a length-biased mode.
In this section, we first propose a broad class of probabilistic models, based on2004]’s
Feynman-Kac formulae, for constrained language generation, designed to admit tractable sequential
Monte Carlo approximation (§2.1). We then give several examples to illustrate the breadth of the
tasks expressible in this framework (§2.2). Finally, we show how language model probabilistic pro-
gramscan be used to concisely and compositionally specify such tasks, enabling new combinations
of hard and soft constraints, and new ways of composing prompts and language models (§2.3). We
illustrate this flexibility via preliminary demonstrations ofprompt intersection, the task of generating
a completion that is simultaneously likely undermultipleprompts (Figures).
2.1 Mathematical Setting: Feynman-Kac Transformer Models
LetVbe the vocabulary of a Transformer model, andS=V
∗
the set of multi-token strings. We
assumeVcontains an end-of-sequence tokenEOS, and writeF ⊆ Sfor the set ofEOS-terminated
strings. AFeynman-Kac Transformer modelis a tuple(s0,{Mt}t≥1,{Gt}t≥1), where:
•s0∈ Sis aninitial state, which we will often take to be the empty stringϵ.
•Mt(st|st−1, fθ)is aMarkov kernel(i.e., conditional probability distribution) fromst−1∈ F
c
to
st∈ S, parameterized by a Transformer networkfθ:F
c
→R
|V|
mapping non-EOS-terminated
strings to vectors of logits.
•Gt(st−1, st, fθ)is apotential function, mapping a pair(st−1, st)∈ F
c
× Sto a real-valued
non-negative score. It is also parameterized by a Transformer networkfθ.
Given a Transformerfθ, the Markov kernelsMtdefine a Markov chainMon the random variables
St∈ S(t≥0), whereS0is deterministicallys0, and the distribution ofStgivenSt−1isMt(· |
st−1, fθ)ifst−1∈ F
c
, orδst−1ifst−1∈ F. That is, starting ats0,Mcontinually modifies the
string according toMtuntil a string ending inEOSis reached, at which point it is never modified
again. We writeTfor thestopping timeof the chain, a random variable equal to the first timetthat
St∈ F. We assume thatMtandfθare such thatTis finite with probability 1.
Our goal is not to generate from the Markov chainM, but from a distributionPthat reweightsMby
the potential functionsGt. We first define thestep-t filtering posteriors,
Pt(st) =
EM
h
Q
t∧T
i=1
Gi(Si−1, Si, fθ)·[St=st]
i
EM
h
Q
t∧T
i=1
Gi(Si−1, Si, fθ)
i .
Then, becauseTis almost surely finite, we can define theoverall posteriorP(s) = limt→∞Pt(s).
2.2 Examples
To build intuition for the sorts of tasks that can be specified using Feynman-Kac Transformer models,
we now develop several examples.
Hard constraints.Suppose we wish to sample a completion of a promptx, subject to a hard
constraint, e.g., that every generated word is shorter than 5 letters. We writeCF⊆ Ffor the set of
full strings satisfying the constraint, andC={s| ∃s
′
.ss
′
∈ CF}for the set of allprefixesof strings
inCF. Our Markov kernelMtjust uses the Transformerfθto append a single token at a time:
Mt(st|st−1, fθ) =
X
wt∈V
softmax(fθ(xst−1))wt
·[st=st−1wt].
3

Our potential functions then enforce that we have not yet violated the constraint:
Gt(st−1, st, fθ) = [st∈ C] = 1C(st).
WritingP
(fθ,x)(S)for the distribution overEOS-terminated strings given by standard temperature-1
decoding fromfθwith promptx, we can see that the overall posteriorPof our Feynman-Kac model
is preciselyP(s) =P
(fθ,x)(S=s|S∈ CF).
There are in fact multiple Feynman-Kac models(s0,{Mt}t≥1,{Gt}t≥1)that yield thesameoverall
posteriorP. For example, we could have setMtto generateonlytokens that do not violate the
constraint, by token masking:
M
′
t(st|st−1, fθ) =
P
wt∈V
softmax(fθ(xst−1))wt
·1C(st−1wt)·[st=st−1wt]
P
w∈V
softmax(fθ(xst−1))w·1C(st−1w)
.
Then we recover the same posteriorP
′
=Pso long as we also change our potential functions to
G
′
t(st−1, st, fθ) =
X
w∈V
softmax(fθ(xst−1))w·1C(st−1w).
This can be seen as an importance weight: on the support ofM
′
t, it is equal to
Mt(st|st−1,fθ)
M
′
t
(st|st−1,fθ)
, the
ratio of the original sampling distribution to our modified sampling distribution.
The reader may wonder why we do not just set the potentialG
′
t(st−1, st, fθ) = 1, since the Markov
kernelsM
′
tnow enforce the constraint, andG
′
t(intuitively speaking) has nothing left to “check.”
The issue is that the Markov kernels enforce the constraintgreedily, sampling early tokens with
no regard for whether they will make it more or less difficult to satisfy the constraint later. The
potential functionsG
′
tpenalizethe string so far (st−1) based on howdifficultit was to continue it
by one token without violating the constraint. As such, they implement a form of “hindsight” and
recover the global posteriorP
′
(s) =P(s) =P
(fθ,x)(S=s|S∈ CF).
Although these two formulations specify the sametask(generating from the posterior given the
hard constraint), we will see in §3
performance depends on which formulation is used. A general rule of thumb is that inference will be
more efficient (i.e., require less compute to achieve accurate results) if eachMtandGtare chosen to
reduce the variance of the potentialGt(S
⋆
t−1, S
◦
t, fθ)forS
⋆
t−1∼Pt−1andS
◦
t∼Mt(· |S
⋆
t−1, fθ).
Infilling.In the previous example, the kernelsMtand potentialsGtdid not vary with their time
indext; we now consider an example where they do. Consider the task ofinfillinga template
with holes, such as “To tell the truth, every[BLANK]he[BLANK]to[BLANK]another[BLANK].”
Letx0, . . . , xnbe the known fragments, withxn∈ F; then our goal is to generate a strings=
x0h1x1. . . hnxnthat “fills the blanks” between each of the known fragmentsxiwith completions
hi. We takes0=x0, and choose the Markov kernelsMt, for time1≤t≤n, to append a
geometrically distributed number of new sampled tokens, followed by the next known fragmentxt:
Mt(st|st−1, fθ) =
X
ht∈S

2
−|ht|−1
|ht|
Y
i=1
softmax(fθ(st−1h
1:i−1
t))
h
i
t
·[st=st−1htxt]

.
The potential corrects for the length bias from the geometrically sampled number of tokens, and
scores a proposed completion by how well it explains the fragmentxt:
Gt(st−1, st, fθ) =
X
ht∈S

2
|ht|+1
|xt|
Y
i=1
softmax(fθ(st−1htx
1:i−1
t))
x
i
t
·[st=st−1htxt]

.
The resulting posteriorPcan be seen to beP(s) =P
(fθ,ϵ)(S=s| ∃h1:n.S=x0h1x1. . . hnxn). If
we are looking for completions to be on the shorter side, we can remove the2
|ht|+1
correction term
from the formula forGt: no longer divided out, the geometric distribution inMtwould then behave
as aprioron the lengths of the completionsht.
4

classPromptIntersection(Model):
# Initialize
def__init__(self, prompts):
super().__init__()
self.s=""
self.x=[self.new_context(p)
forpinprompts]
# Generate
defstep(self):
w=self.sample(self.transformer(self.x[0]))
forxinself.x[1:]:
self.observe(self.transformer(x), w)
ifw==EOS:
self.finish()
else:
self.s+=w
!!=#
$"%"%"#$,'%=
∑)*+,-./0&1!!'#() ⋅*∈, [!'=!'#(5]
7"(%"#$,%",'%)=
∑∏)*+,-./0&1-!'#()
.-/(⋅*∈, [!'=!'#(5]
Feynman-Kac Transformer ModelProbabilistic Program Figure 2: ALLaMPPLprogram for prompt intersection, and the model it implicitly defines.
Prompt intersection.As a final example, we look at how to encode the task ofprompt intersec-
tion, where the goal is to generate a completionsthat is likely under multiple distinct prompts
x0, . . . , xm. We takes0=ϵand setMtto generate according to the first promptx0:
Mt(st|st−1, fθ) =
X
wt∈V
softmax(fθ(x0st−1))wt
·[st=st−1wt].
The potential then scores by the remaining prompts:
Gt(st−1, st, fθ) =
X
wt∈V
m
Y
i=1
softmax(fθ(xist−1))wt
·[st=st−1wt].
This defines a product-of-experts posterior, withP(s)∝
Q
m
i=0
P
(fθ,xi)(S=s). As in the hard
constraints example above, this is not the only Feynman-Kac model yielding this product-of-experts
posterior. Just as we previously changedMtto intelligently (but greedily) sample tokens based on a
constraint, here we can changeMtto intelligently select tokens based on every prompt at once:
M
′
t(st|st−1, fθ) =
X
wt∈V
Q
m
i=0
softmax(fθ(xist−1))wt
P
w∈V
Q
m
i=0
softmax(fθ(xist−1))w
·[st=st−1wt]
However, just as in the hard constraints example, we also need to change the potentials, to preserve
the global product-of-experts posterior (and avoid greedily sampling into dead ends):
G
′
t(st−1, st, fθ) =

X
wt∈V
[st=st−1wt]
!
·

X
w∈V
m
Y
i=0
softmax(fθ(xist−1))w
!
.
2.3 Language Model Probabilistic Programming
To make SMC steering more accessible and automated, we have implemented a Python probabilis-
tic programming library,LLaMPPL, for buildinglanguage model probabilistic programsbacked by
Meta’s LLaMA family of models [Touvron et al.,]. We now briefly describe the library, and
define the Feynman-Kac Transformer model associated to aLLaMPPLprogram.
ALLaMPPLprogram is a Python subclass of ourModelclass. The key method implemented by a
LLaMPPLprogram isstep, which performs an update on a string-valued instance variableself.s,
and has access to the following special methods:
•self.sample(dist[,proposal]): generate a samplevfrom the distributiondist. If provided, gen-
erate fromproposalinstead, but incorporate the importance weight
dist(v)
proposal(v)
into the potential.
5

TheFedsaysitwillhikeratesthreetimesthisyear—butonebankanalnextyear,butdon’tcountthatmightnotbefewarebetmorethanithadhintinMarchstarttoshrrkthatitisan’isol...”The”caseinthethebanksarenotllo
Figure 3: Example trie of prompts generated in the first few steps of an SMC algorithm on the
constraintmodel from Figure. Our system maintains such a trie and at each node, caches next-
token logits and layerwise key/value vectors for the token-in-context.
•self.condition(bool): constrain a Boolean expression to be true.
•self.observe(dist,val): equivalent to runningv= sample(dist,dirac(val)), and immediately
conditioning on the expressionv=val.
•self.transformer(context): a distribution over next tokens, to pass tosampleorobserve.
•self.finish(): terminateself.swithEOS.
The user’s subclassimplicitlydefines a Feynman-Kac transformer model(s0,{Mt}t≥1,{Gt}t≥1).
The initial states0is the value ofself.sfor a newly constructed instance of the class. (A user can
customizes0by overriding theinitmethod.) We then consider the Markov chain induced onS
by repeatedly applying thestepmethod, withStthe value ofself.saftertapplications. In partic-
ular, the Markov kernelsMtare the transitions induced by callingstep,ignoringconditionand
observestatements, and generating fromproposalat eachsamplestatement (when it is provided,
ordistwhen it isn’t).
1
The potentialGtfor the transition is the product of: (1) for each encountered
samplestatement, the importance weight
dist(v)
proposal(v)
(or 1 ifproposalis not supplied); (2) for each
encounteredobservestatement,dist(val); and (3) for each encounteredconditionstatement,1if
the Boolean is true and0if it is not.
2
3 Sequential Monte Carlo Steering
Probabilistic programsspecifyposterior inference tasks, but to generate posterior samples, we re-
quire an inference algorithm. Our proposal is to use a variant of sequential Monte Carlo (SMC)
specialized to our setting (Algorithm). SMC maintains a collection of weighted particles, which
in our case store realizations of the state variablesStof a Feynman-Kac Transformer model. The
algorithm alternates betweenextendingthe particles using the Markov kernelsMt, reweighting them
using the potential functionsGt, andresamplingto clone promising particles and cull low-likelihood
ones. We highlight the key differences between Algorithm
•Shared Transformer cache.Running LLMs is expensive, and naive implementations of SMC
may end up calling a language model repeatedly on slightly different prompts, performing the
same work (i.e., processing the same tokens in the same order) many times. For example, Figure
shows a collection of prompts generated in the first few steps of SMC, applied to theconstraints
model from Figure. Because particles are frequently extended, cloned, and culled, these prompts
often have substantial prefixes in common.
To avoid duplicated work, both across time steps and across particles, Algorithm
a sharedCachedTransformerwhich handles all LLM queries, and maintains a trie of tokens
(as in Figure) as a cache layer to mediate requests from MtandGtto the Transformer model
itself. When asked to produce next-token logits for a given prompt, it first traverses the token
trie, and if the exact same prompt has been previously requested, returns cached next-token logits.
1
In all our examples, the distribution ofStdepends only on the value ofSt−1and the time steptitself, so
Mtis well-defined.LLaMPPLdoes support programs that maintain and depend on other state, which could be
formalized as Feynman-Kac models on extended state spaces.
2
In all our examples, this product is uniquely determined byt,st−1, andst, soGtis well-defined. But as
in the previous footnote,LLaMPLdoes support the more general case, which could be formalized by extending
the state space of the Feynman-Kac model.
6

Algorithm 1Sequential Monte Carlo Transformer Steering
1:Input:N(# particles),K(factor), Feynman-Kac Transformer model(s0,{Mt}t≥1,{Gt}t≥1)
2:Output:Weighted particle approximation(xi, wi)i=1,...,Nof the posteriorP
3:Output:Unbiased estimate
ˆ
Zof the partition functionZ=EM[
Q
T
t=1
Gt(st−1, st, fθ)]
4:Initializefθ←CachedTransformer()
5:Initialize(xi, wi)←(s0,1)fori= 1, . . . , N
6:Initializet←1
7:whilexi̸∈ Ffor somei∈ {1, . . . , N}do
8:SetKi←K(1−1F(xi)) + 1F(xi)fori= 1, . . . , N
9:SetN
′
←
P
N
i=1
Ki
10:fori∈ {1, . . . , N}do
11: ifxi∈ Fthen
12: Set(x
(i,1), w
(i,1))←(xi, wi·
N
′
N
)
13: else
14: Generatex
(i,k)∼Mt(· |xi, fθ)fork= 1, . . . , K
15: Setw
(i,k)←
N
′
KN
·wi·Gt(xi, x
(i,k), fθ)fork= 1, . . . , K
16: end if
17:end for
18:Set normalized weightsˆw
(i,k)←
w
(i,k)
P
N
j=1
PK
j
l=1
w
(j,l)
fori= 1, . . . , Nandk= 1, . . . , Ki
19:Setc
∗
←inf{c∈R>0|
P
N
i=1
P
Ki
k=1
(1∧cˆw
(i,k))> N}
20:Set(Idet, Istoch, Istrat)←({(i, k)|c
∗
ˆw
(i,k)≥1},{(i, k)|c
∗
ˆw
(i,k)<1},{})
21:Setα←
P
i∈I
stoch
ˆwi
N−|Idet|
and generateU∼Uniform([0, α])
22:fori∈Istochdo
23: SetU←U−ˆwi
24: ifU <0then
25: SetIstrat←Istrat∪ {i}
26: SetU←U+α
27: end if
28:end for
29:Set particles(xi, wi)
i=1,...,|Idet|← {(xj, wj·
N
N
′)|j∈Idet}
30:Set particles(xi, wi)
i=|Idet|+1,...,N← {(xj,
N
c
∗
N
′
P
N
l=1
P
Kl
k=1
w
(l,k))|j∈Istrat}
31:end while
32:
33:return
ˇ
(xi, wi)i=1,...,N,
ˆ
Z=
1
N
P
N
i=1
wi
ı
Otherwise, it runs the Transformer model on anynewprompt tokens only. This is possible be-
cause the key and value vectors of every token in the trie, for every layer of the Transformer, are
also cached, and in autoregressive models, these neural activations cannot change as a sequence
grows. We note that caching these activations is a common Transformer optimization (called “KV
caching”) in single-particle settings, e.g. to enable conversational interfaces without re-evaluating
the entire conversation history with each new message. But we have found that extending it to
the multi-particle setting makes inference in language model probabilistic programs significantly
cheaper, compared to previous approaches to integrating Transformer models into probabilistic
programs [Lew et al.,,,,,].
•Without-replacement resampling.Standard sequential Monte Carlo implementations maintain
a fixed number of particles throughout their execution, and use randomized resampling strategies
to clone some particles and cull others. Because of this, it is common for the same stateSt=stto
be represented multiple times within a particle collection. To maintain better particle diversity, we
instead apply a resampling strategy that more closely resembles beam search, while maintaining
the unbiasedness and posterior consistency properties we expect of SMC. At each step, any active
particles (i.e., those not already terminated byEOS) are clonedKtimes, and each clone is indepen-
dently extended usingMt. This larger collection of particles is then down-sampled back to sizeN,
using a without-replacement down-sampling algorithm to ensure uniqueness of theNresampled
particles. The down-sampling algorithm is close to that of2003], except
7

Prompts:
“My favorite writer is probably”
“My favorite physicist is probably”
Sample completions:
“ Ray Bradbury.”
“ Richard Feynman.”
“ most excited by his draft.”
“ Margaret Atwood.”
“ Neil deGrasse Tyson.”
“ whoever I'm reading at the moment.” Figure 4: Results of SMC steering on theprompt intersectiontask from §2.2, modified to emit
EOSafter one sentence.Left:We plot mean values oflog
ˆ
Zacross 10 runs of SMC steering,
with varying numbers of particlesN, fixed expansion factorK= 3, and the two Feynman-Kac
models for prompt intersection given in §2.2. In the first model, the Markov kernel Mtproposes
tokens according to only the first prompt (“My favorite writer is probably”), and the potentialGt
conditions on agreement from the second prompt. In the second model, the Markov kernelMt
samples a locally optimal proposal distribution based on logits from both prompts, andGtserves as
an importance weight.Right:HigherE[log
ˆ
Z]corresponds to qualitatively better samples. Indeed,
by Jensen’s inequality,E[log
ˆ
Z]is a lower bound onlogZ, and thegapis itself an upper bound on
the KL divergence between SMC steering’s sampling distribution and the true posteriorP.
that (1) we update the weights slightly differently to ensure they remain unbiased estimates of the
partition functionZ, and (2) we include new weight corrections to account for the fact that in our
setting, some particles arestoppedand thus not cloned during the expansion step.
Accuracy.Under mild assumptions, SMC isconsistentin the sense that as the number of particles
grows, the marginal distribution over returned particles approaches the true posterior. It also pro-
ducesunbiasedestimates of marginal likelihoods, and of unnormalized posterior integrals: for any
f, the expected value of the weighted average
1
N
P
wif(xi)is the posterior expectation off, times
a normalizing constant that does not depend onf. However, the variance of these estimates, and the
accuracy of posterior sampling, depend on the Feynman-Kac model. Figure
nomenon in theprompt intersectiontask from §2.2. Recall that we formulated two Feynman-Kac
models for this task, which targeted the same posteriorPbut made distinct choices ofMtandGt.
One way of understanding their differences is that they encode differentproposal distributionsfor
the task: the first model proposes according to a single prompt, whereas the second model proposes
based on logits from all the prompts. As Figure
A general design principle is that the proposal (i.e., the distribution from whichMtsamples) should
be as close to the posterior as possible. Nearly any heuristics for solving the generation task can be
incorporated into the proposal, so long as the potentialsGtare appropriately modified to perform
the proper importance weighting. One approach to developing better proposals for challenging gen-
eration tasks could be to propose tokens from an LLM with an auxiliary prompt that describes the
task in natural language (where the prompt itself could be automatically generated via inference,
as in2022] and2023]); then SMC steering would be responsible only for
correcting mistakes made by the prompted proposal, rather than solving the task from scratch. An-
other approach could be to use “chain-of-thought” proposals, i.e., LLMs prompted to perform some
reasoning before proposing tokens to solve the task [Wei et al.,]. Because the marginal proba-
bility of proposing a token would not be tractable for such a proposal, the potentialsGtwould not be
able to incorporate exact importance weights, and would instead need to approximate them unbias-
edly. Future versions ofLLaMPPLcould incorporate recently introduced probabilistic programming
techniques for automating this unbiased proposal density estimation [Lew et al.,,].
8

4 Related Work and Discussion
Probabilistic programming with language models.To our knowledge, the idea of integrating
language models as primitives into a probabilistic programming system was first proposed by
et al.2020], who showed that in certain verbal reasoning tasks, the posteriors of such programs
were better models of human behavior than unconstrained language models. More recently,
et al.2022] proposed unifying various approaches to “chaining” LLMs by understanding them as
graphical models or probabilistic programs with string-valued random variables. But in the “chain-
of-thought”-style applications they explore, there are typically no unknown variables with non-trivial
likelihood terms, so no inference algorithm is required—“forward” or “ancestral” sampling suffices.
Our examples, by contrast, induce non-trivial posteriors that require more powerful inference algo-
rithms to sample.2022]’s GenGPT3.jllibrary integrates OpenAI’s GPT-3 models into
theGen.jlprobabilistic programming system [Cusumano-Towner et al.,], and includes exam-
ples of usingGen.jl’s posterior inference machinery to perform structured infilling (e.g., inferring
which of a set of questions was likely to lead an observed answer, similar to automatic prompt engi-
neering2023]) and constrained semantic parsing. However, the sequential Monte Carlo
algorithms we describe here would be difficult to implement efficiently usingGenGPT3.jl. One
challenge is that the OpenAI API is stateless, and so “one-token-at-a-time” generation and condi-
tioning would require prohibitively many calls (each with growing numbers of context tokens).
Steering language models with programs.Beurer-Kellner et al.2022] recently coined the term
language model programming, to refer to the use of specialized programming languages to guide the
behavior of LLMs. Several such programming languages have since been introduced, including their
SQL-inspiredlanguage model query language(LMQL),2023]’s Guidance
language, and2023]’s Outlines language. All three of these provide
high-level DSLs that make chaining multiple calls to LLMs more ergonomic, and Outlines and
LMQL also expose some features for generation subject to hard constraints. However, they do not
support sampling the posteriorgiventhese constraints, only beam search (in the case of LMQL) and
greedy decoding, using token masking to enforce the constraints. Furthermore, the constraint DSLs
supported by these languages are limited, and cannot, for example, encode our prompt intersection
task. That said, they support many features that would be useful to include in future versions of
LLaMPPL(or higher-level DSLs built on it): a unified frontend to a variety of Transformer backends,
automated computation of token masks for enforcing common constraints, and high-level syntax for
chaining prompts together that does not require explicit token-by-token processing logic.
Controlled generation and probabilistic inference in language models.Many recent papers have
proposed methods for more controlled text generation from language models, either through the lens
of optimization [Kumar et al.,] or probabilistic inference [Kumar et al.,]. Approaches ap-
plied during fine-tuning include direct preference optimization [Rafailov et al.,], reinforcement
learning from human feedback [Ouyang et al.,], and generation with distributional control
[Khalifa et al.,], all of which can be viewed as forms of variational Bayesian inference due
to their use of KL-divergence penalties [Korbak et al.,]. Finetuning methods have the benefit
of avoiding increased runtime during decoding, but they typically cannot handle hard constraints,
motivating the use of controlled generation at decoding time.
Among decoding-time approaches, many are focused on optimization, either through beam search
[Meister et al.,] and heuristic search [Lu et al.,,,], or through gradient-
based optimization in embedding space [Dathathri et al.,,,]. Other ap-
proaches focus on sampling from a constrained or modified distribution [Zhang et al.,], in-
cluding naive rejection sampling [Poesia et al.,], but also more sophisticated Markov Chain
Monte Carlo (MCMC) samplers [Miao et al.,,,] that make use of specialized
proposal distributions [Zhang et al.,] or the gradients of continuous embeddings [Qin et al.,
2022,,]. However, a downside of MCMC methods is that they require potentially
many iterations before producing a sample from the desired distribution, limiting their speed and
usefulness. In contrast, SMC steering maintains the autoregressive nature of both regular decoding
and beam search while still allowing constraints to be applied. As such, our method achieves the
same overhead as beam search, while continuing to sample from the desired distribution instead of
optimizing. This enables SMC steering to generate a diversity of constrained completions without
the need for additional machinery [Vijayakumar et al.,].
9

Researchers have also proposed using the posterior distributions of probabilistic generative models
defined in part using Transformers for tasks beyond constrained generation, e.g. semantic segmen-
tation and household navigation, where the general world knowledge learned by the LLM is used
to inform priors [Li et al.,]. Probabilistic programming tools like LLaMPPL, which support
building models that use LLMs and performing efficient inference in them, could help make such
approaches more accessible and scalable.
References
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry
Tworek, and Mark Chen. Efficient training of language models to fill in the middle.arXiv
preprint arXiv:2207.14255, 2022.
Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query lan-
guage for large language models.arXiv preprint arXiv:2212.06094, 2022.
Marco F Cusumano-Towner, Feras A Saad, Alexander K Lew, and Vikash K Mansinghka. Gen: a
general-purpose probabilistic programming system with programmable inference. InProceedings
of the 40th acm sigplan conference on programming language design and implementation, pages
221–236, 2019.
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosin-
ski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text
generation.arXiv preprint arXiv:1912.02164, 2019.
Pierre Del Moral.Feynman-kac formulae. Springer, 2004.
David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes,
Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model
cascades.arXiv preprint arXiv:2207.10342, 2022.
Chris Donahue, Mina Lee, and Percy Liang. Enabling language models to fill in the blanks.arXiv
preprint arXiv:2005.05339, 2020.
Paul Fearnhead and Peter Clifford. On-line inference for hidden markov models via particle filters.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(4):887–899, 2003.
Brian Hie, Salvatore Candido, Zeming Lin, Ori Kabeli, Roshan Rao, Nikita Smetanin, Tom Sercu,
and Alexander Rives. A high-level programming language for generative protein design.bioRxiv,
pages 2022–12, 2022.
Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled
text generation. InInternational Conference on Learning Representations, 2021. URLhttps:
//openreview.net/forum?id=jWkw45-9AbL.
Tomasz Korbak, Ethan Perez, and Christopher Buckley. RL with KL penalties is better viewed
as Bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP
2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, December 2022. Association for
Computational Linguistics. URLhttps://aclanthology.org/2022.findings-emnlp.77.
Sachin Kumar, Eric Malmi, Aliaksei Severyn, and Yulia Tsvetkov. Controlled text generation as
continuous optimization with multiple constraints.Advances in Neural Information Processing
Systems, 34:14542–14554, 2021.
Sachin Kumar, Biswajit Paria, and Yulia Tsvetkov. Constrained sampling from language models via
Langevin dynamics in embedding spaces.arXiv preprint arXiv:2205.12558, 2022.
Alexander K Lew, Michael Henry Tessler, Vikash K Mansinghka, and Joshua B Tenenbaum. Lever-
aging unstructured statistical knowledge in a probabilistic language of thought. InProceedings of
the Annual Conference of the Cognitive Science Society, 2020.
Alexander K Lew, Marco Cusumano-Towner, and Vikash K Mansinghka. Recursive monte carlo
and variational inference with auxiliary variables. InUncertainty in Artificial Intelligence, pages
1096–1106. PMLR, 2022.
10

Alexander K Lew, Matin Ghavamizadeh, Martin Rinard, and Vikash K Mansinghka. Probabilistic
programming with stochastic probabilities. InProceedings of the 44th ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation, 2023.
Belinda Z Li, William Chen, Pratyusha Sharma, and Jacob Andreas. Lampp: Language models as
probabilistic priors for perception and action.arXiv e-prints, pages arXiv–2302, 2023.
Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras,
Lianhui Qin, Youngjae Yu, Rowan Zellers, et al. Neurologic A* esque decoding: Constrained
text generation with lookahead heuristics.arXiv preprint arXiv:2112.08726, 2021.
Clara Meister, Tim Vieira, and Ryan Cotterell. If beam search is the answer, what was the question?
arXiv preprint arXiv:2010.02650, 2020.
Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. CGMH: Constrained sentence generation by
metropolis-hastings sampling. InProceedings of the AAAI Conference on Artificial Intelligence,
volume 33, pages 6834–6842, 2019.
Microsoft and Scott Lundberg. Guidance: A guidance language for controlling large language
models., 2023. URLhttps://github.com/microsoft/guidance.
Kenton Murray and David Chiang. Correcting length bias in neural machine translation. InPro-
ceedings of the Third Conference on Machine Translation: Research Papers, pages 212–223,
Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/
W18-6322. URLhttps://aclanthology.org/W18-6322.
Normal Computing and Remi Louf. Outlines: Generative model programming, May 2023. URL
https://github.com/normal-computing/outlines.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback.Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Damian Pascual, Beni Egressy, Florian Bolli, and Roger Wattenhofer. Directed beam search: Plug-
and-play lexically constrained language generation.arXiv preprint arXiv:2012.15416, 2020.
Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and
Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models.arXiv
preprint arXiv:2201.11227, 2022.
Peng Qian and Roger Levy. Flexible generation from fragmentary linguistic input. InProceedings
of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 8176–8196, 2022.
Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. COLD decoding: Energy-based
constrained text generation with Langevin dynamics.arXiv preprint arXiv:2202.11705, 2022.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea
Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv
preprint arXiv:2305.18290, 2023.
Allen Roush, Sanjay Basu, Akshay Moorthy, and Dmitry Dubovoy. Most language models can be
poets too: An ai writing assistant and constrained text generation studio. InProceedings of the
Second Workshop on When Creative AI Meets Conversational AI, pages 9–15, 2022.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee
Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David
Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural se-
quence models.arXiv preprint arXiv:1610.02424, 2016.
11

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi,
Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language
models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Ad-
vances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/
forum?id=_VjQlMeSB_J.
Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck. Tractable control for
autoregressive language generation.arXiv preprint arXiv:2304.07438, 2023a.
Maosen Zhang, Nan Jiang, Lei Li, and Yexiang Xue. Language generation via combinato-
rial constraint satisfaction: A tree search enhanced Monte-Carlo approach.arXiv preprint
arXiv:2011.12334, 2020.
Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan.
Planning with large language models for code generation.arXiv preprint arXiv:2303.05510,
2023b.
Tan Zhi-Xuan. GenGPT3.jl: GPT-3 as a generative function in Gen.jl, November 2022. URL
https://github.com/probcomp/GenGPT3.jl.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and
Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh Inter-
national Conference on Learning Representations, 2023. URLhttps://openreview.net/
forum?id=92gvk82DE-.
12