PIPPET: A Bayesian framework for generalized entrainment to stochastic rhythms

When presented with complex rhythmic auditory stimuli, humans are able to track underlying temporal structure (e.g., a “beat”), both covertly and with their movements. This capacity goes far beyond that of a simple entrained oscillator, drawing on contextual and enculturated timing expectations and adjusting rapidly to perturbations in event timing, phase, and tempo. Here we propose that the problem of rhythm tracking is most naturally characterized as a problem of continuously estimating an underlying phase and tempo based on precise event times and their correspondence to timing expectations. We formalize this problem as a case of inferring a distribution on a hidden state from point process data in continuous time: either Phase Inference from Point Process Event Timing (PIPPET) or Phase And Tempo Inference (PATIPPET). This approach to rhythm tracking generalizes to non-isochronous and multi-voice rhythms. We demonstrate that these inference problems can be approximately solved using a variational Bayesian method that generalizes the Kalman-Bucy filter to point-process data. These solutions reproduce multiple characteristics of overt and covert human rhythm tracking, including period-dependent phase corrections, illusory contraction of unexpectedly empty intervals, and failure to track excessively syncopated rhythms, and could could be plausibly approximated in the brain. PIPPET can serve as the basis for models of performance on a wide range of timing and entrainment tasks and opens the door to even richer predictive processing and active inference models of rhythmic timing.

and an inhomogeneous point process that generates events with probability 154 λ(φ), a function of phase. We will refer to λ(φ) as a "temporal expectation 155 template," though it can also be understood as a hazard function for events. To The temporal expectation template.
In the PIP-PET/PATIPPET generative model, λ(φ) represents the instantaneous rate of events occurring when the underlying temporal process is at phase φ. This is assumed to be a sum of Gaussian-shaped functions with means φ i representing the phases at which specific events are expected, variances v i representing the (inverse of) the temporal precision expected of those events, and scales λ i representing the strength of the expectations. A constant λ 0 is also added, representing the instantaneous rate of events unrelated to the underlying phase.
of phase φ at every time t > 0. onto a Gaussian at each dt time-step. We do this by moment-matching: we use σ θ is set to zero. At each event, the distribution of phase and tempo is dis-242 continuously updated to a 2D Gaussian posterior, which evolves continuously 243 between events. This scheme is similar to [30], which estimates phase and tempo 244 by updating a 2D Gaussian posterior, but is updated in continuous time and 245 is significantly more flexible in its capacity to track phase based on arbitrary 246 temporal expectation templates. 247 2.3 PIPPET with multiple event streams (multi-PIPPET) 248 Finally, we generalize PIPPET to include multiple types of events (indexed by 249 j), each generated as point processes with rates determined by functions λ j (φ) 250 of a single underlying phase: The Kalman-Bucy estimate of phase for this model is described by mean µ 253 and variance Σ evolving according to the ODE and resetting to µ t+ =μ j and Σ t+ =Σ j when an event occurs in stream j, 255 where we defineΛ j ,μ j , andΣ j as we definedΛ,μ, andΣ above but in reference 256 only to event stream j.

257
The same adjustment can be made to the PATIPPET generative model, and 258 the PATIPPET filter can be similarly generalized to account for multiple event 259 streams.
In this section we conduct a series of simulations to explore parallels between the 262 behavior of the the PIPPET and PATIPPET filters and human entrainment.

263
Parameters for these simulations are listed in Appendix 6.2.  The PIPPET framework describes entrainment to "stochastic" rhythms in which 283 each expected event phase may or may not be populated by an event. Fur-284 ther, PIPPET is formulated in sufficient generality to describe entrainment to 285 Figure 2: Characterizing PIPPET's behavior at events A) Phase transition curve for PIPPET with expectation of three isochronous events. Note that events occurring when the phase estimate µ t− is between expected event phases φ i have little corrective effect on the posterior mean phase µ t+ , as indicated by a diagonal phase transition curve, whereas events occurring when the estimated phase is near an expected event phase tend draw the phase estimate toward the expected phase, as indicated by plateaus in the phase transition curve. B) Phase and variance response curves. Note that events occurring when estimated phase is very close to an expected event phase cause the variance of the posterior on phase to decrease, whereas events occurring slightly offset from an expected event phase cause the variance to increase. Events occurring far from any expected event phase have little effect on posterior variance. PIPPET is equipped to model entrainment to a very wide range of rhythmic 292 structures with any degree of predictability.

293
As an example of entrainment to a stochastic rhythm based on a temporal 294 structure with non-integer duration ratios, we simulated entrainment to a swing 295 rhythm. The rhythm is based on an underlying grid of "swung" eighth notes, 296 where the first eight note of every pair is given a slightly longer duration than 297 the second. Though the "swing" feel is often caricatured using eighth note 298 pairs with a 2:1 duration ratio, this value has been shown to vary by player 299 and tempo and is certainly not limited to small integer ratios [33]. We used a 300 temporal expectation template with a swing ratio close to 3:2 and associated the 301 first eighth note in each pair with a stronger expectation than the second. The   When an event is strongly expected but no event occurs, an optimal Bayesian 335 observer should initially be biased to believe that in spite of their current esti-336 mate, the stimulus may not have reached the expected event phase yet. When 337 Figure 4: Too much syncopation causes rhythm tracking failure. Syncopation combined with imprecise and weak timing expectations on at weak time points can lead to a failure to track phase accurately. In this example, phase uncertainty Σ increases over a long silence. At the next event, this high uncertainty leads the model to partially attribute a weakly expected event to the nearby phase at which an event is strongly expected. As a result, the model ends up aligning the fifth event with a strong phase rather than a weak one.
scaling up λ, PIPPET's behavior at each event was unchanged; however, when 339 strongly expected events were omitted, the mean phase estimate slowed down 340 at each expected event phase, leading to an overall slowing in estimated phase 341 advance ( Figure 5).

342
There is evidence of such an effect in human perception. The "filled dura-343 tion" illusion is the impression that an isochronous sequence has changed tempo 344 when it is initially subdivided by additional predictable events and then sub-  Black curve tracks the estimated mean phase µ over time. Red lines mark event times; blue lines mark expected event phases. Grey shading represents uncertainty about phase, quantified in the model as variance Sigma and displayed by shading two standard deviations up and down. PIPPET is given strong expectations for four isochronous events. Above: when the strongly expected events occur as expected, mean phase stays on track, advancing (on average) at a rate of 1. Below: the first three expected events are omitted. When the strongly expected events do not occur, the advance of µ slows around the expected event phase and then speeds back up. On average over the interval, µ advances at a rate slower than 1. As a results, when the fourth event does occur at time t = 1, it occurs when µ t is still substantially short of µ = 1. The event is thus perceived as occurring at an earlier phase than expected. 20 tempo near the upper end of that range. In these conditions and with the parameter set we chose, the model established the appropriate tempo and phase 365 to within a tight range over the course of the first two events ( Figure 6).

366
In addition to its value as a model of human rhythmic cognition, the PATIP-

367
PET filter shows promise as a general-purpose tempo tracking algorithm for   This fraction is called α. In human subjects, α has repeatedly been observed 377 to increase linearly with metronome period ("inter-onset interval," or IOI), ex-378 ceeding 1 (i.e., over-correction) for sufficiently long IOIs [42, 43].

379
The PIPPET framework offers a principled explanation for α increasing 380 with IOI. During an event-free interval, phase uncertainty increases over time.

381
When an event does occur, the precision of the prior distribution on phase and Here, PATIPPET is initialized with a high variance in its estimate of tempo. The first event occurs relatively early, causing the posterior mean tempo µ θ to increase. Each subsequent event occurs close to the time expected based on the mean estimated phase µ and tempo µ θ , causing, the posterior to contract in both the phase and variance direction as its prediction of event time is fulfilled and its phase and tempo estimates are corroborated. Ultimately, PATIPPET settles on a narrow distribution around the appropriate tempo as it continues to accurately estimate phase.

22
for α values above 1. However, it has been previously suggested that α may   The distribution on phase and tempo leading up to and following a phase shift at the fourth event in an isochronous sequence for two different metronome tempi (i.e., two different inter-onset intervals). See Figure 6 for color key. Note that when the IOI is short, PATIPPET arrives at the phase-shifted event with a high degree of phase and tempo certainty. C) PATIPPET makes a proportionally larger correction to phase and tempo for long IOIs than for short IOIs due to the greater degree of uncertainty preceding each event. D) Alpha (α) is the proportion of a phase shift that is corrected at the next tap time. With this set of parameters, PATIPPET reproduces the empirical observation from [43] that the phase shift is undercorrected when IOIs are short and overcorrected α > 1 when IOIs are long. In this illustration, bass drum hits are expected more strongly on the first of each cycle of four eighth notes, and are expected with high timing precision such that misplaced bass drum hits will exert a strong influence on phase. Snare drum hits are expected more strongly on the third eighth note of each cycle, and are expected with higher variance such that a misplaced snare hit exerts less influence on estimated phase. Hi-hat hits are evenly expected across all eighth note positions, but they are expected with low precision, so misplaced hi-hat hits will not exert a strong influence on estimated phase.
types. Thus, we could create a forward model in which it is more likely for notes   If the brain is indeed performing an optimal estimation of phase and tempo, 533 then this estimate should be legible in neural activity somewhere in the brain.

579
The brain must also learn noise and precision parameters for the model. Note even different instruments (e.g., kick drum, snare, hi-hat, as discussed above).

596
The precision of a beat-based temporal expectation is closely related to the 597 width of a "beat bin," the window of time (rather than a single time point) that 598 is proposed to constitute the "beat" in [67], and to the width of the temporal 599 "expectancy region" described in dynamic attending theory [10]; in both cases, 600 this width is increased by imprecision in the immediately preceding stimulus.

601
When the brain is exposed to a rhythmic stimulus, it must first recognize 602 that a predictable pattern exists and select an appropriate temporal expectation 603 template from its learned repertoire. This is its own process of inference, and 604 may be amenable to a Bayesian description. Since the PIPPET filter maintains An event can only be experienced after it occurs, so (as pointed out in [24]) the likelihood function on underlying phase associated with this type of uncertainty should be asymmetrical. The analytically tractable incarnation of our framework presented here uses Gaussian likelihood peaks, so cannot account for the effect of asymmetrical likelihoods; however, we could posit a λ function with asymmetrical peaks and use numerical methods rather than the explicit solution derived here to estimate underlying phase at each time step.  model, e.g., introducing the belief that tempo changes occur in jumps or ramps 649 rather than as random drift, or modification of the objective of the task, e.g., by 650 including additional cost functions or priors associated with perceptual report 651 or motor output as discussed above.

652
Once we are satisfied with the PIPPET framework's utility in describing 653 to human behavior, we can use it to model and analyze experimental data.

654
Given a perceptual or behavioral task, we can suppose that motor or perceptual 655 human entrainment behavior is optimally solving an inference problem, and 656 determine the parameters of that problem by fitting them with appropriate 657 methods. We can study the changes in these parameters over the course of an see, e.g., [13,5]. However, since it is sufficiently general to model both, it could 666 guide an exploration of parameter differences between the performance of similar 667 tasks in periodic and aperiodic contexts. 668 We can also let the PIPPET framework guide a search for the brain bases which generates observable point process events at rate λ(φ): where dN t is the increment in the event count over each dt time step (assumed 693 to be either 1 or 0 with probability 1), E p denotes expectation under distribution 694 p t (φ), and L is the Kolmogorov forward operator associated with (9): Here we project p onto a Gaussian distribution at each time step by matching 696 moments µ and Σ, which is also the projection with minimal KL divergence. 697 We can do this by finding the moments of dp, which are dµ and dΣ, and using 698 these to drive the evolution of µ and Σ.
Let x 2 A denote x T Ax. For both PIPPET and PATIPPET, we can write with scalar-valued Σ = Σ, and in PATIPPET we set We will make use of the following result, a generalized form of a well-known 705 result about quadratic forms (see [71] for proof and similar application): In order to calculate the expectations in (11) and (12), we derive a simple 37 expression for p(φ)λ(φ): Applying (13), where we define K i := (P i + Σ −1 ) −1 . For both PIPPET and PATIPPET, we (14) can be written in terms of normal distributions: We use this expression and the moments of normal distributions to calculate 38 the following expectations and defineΛ,μ, andΣ: Substituting into (11) and (12), we have Calculating the moments of L[p t (φ)] for the PIPPET SDE (1), we derive 712 the PIPPET filter: which is equivalent to equation (3) with its accompanying reset rule at events.

714
Similarly, calculating the moments for the PATIPPET SDE (4), we derive the 715 PATIPPET filter: For multiple event streams j,: This follows directly from application of the derivation above to equation 718 (5) in [72] with a discrete spatial dimension. By the methods above, it yields 719 the multi-PIPPET filter: and the multi-PATIPPET filter: 6.2 Simulation parameters.