(Updated Dec 14, 2022 with clarifications and a subscriber implementation note.)
The publish-subscribe pattern is well known, but in some circumstances, it suffers from two important problems:
When a subscriber is temporarily not present, or cannot be reached, sent events are often lost. This can happen, for example, if the subscriber computer reboots, falls off the network, goes to sleep, has DNS problems and the like. Once the subscriber recovers, it is generally not clear what needs to happen for the subscriber to catch up to the events it may have missed. It is not even clear whether it has missed any. Similarly, it is unclear for how long the publisher needs to retry to send a message; it may be that the subscriber has permanently gone away.
Subscriptions are often set up as part of the following pattern:
- A resource on the Web is accessed. For example, a user reads an article on a website, or a software agent fetches a document.
- Based on the content of the obtained resource, a decision is made to subscribe to updates to that resource. For example, the user may decide that they are interested in updates to the article on the website they just read.
- There is a time lag between the time the resource has been accessed, and when the subscription becomes active, creating a race condition during which update events may be missed.
While these two problems are not always significant, there are important circumstances in which they are, and this proposal addresses those circumstances.
Approach to the solution
We augment the publish-subscribe pattern in the following way:
All events, as well as the content of the resource whose changes are supposed to be tracked are time-stamped. Also, each event identifies the event that directly precedes it (that way, the subscriber can detect if it missed something). Alternatively, a monotonically increasing sequence number could be used.
The publisher stores the history of events emitted so far. For efficiency reasons, this may be shortened to some time window reaching to the present, as appropriate for the application; for example, all events in the last month. (Similar to how RSS/Atom feeds are commonly implemented.)
The publisher provides a query interface to the subscriber to that history, with a “since” time parameter, so the subscriber can obtain the sequence of events emitted since a certain time. (Actually, since “right after” the provided time not including the provided time itself.)
When subscribing, in addition to the callback address, the subscriber provides to the publisher:
- a time stamp, and
- a subscription id.
Further, the actual sending of an event from the publisher to the subscriber is considered to be a performance optimization, rather than core to the functionality. This means that if the event cannot be successfully conveyed (see requirements above), it is only an inconvenience and inefficiency rather than a cause of lost data.
About the race condition
The future subscriber accesses resource R and finds time stamp T0. For example, a human reads a web page whose publication date is April 23, 2021, 23:00:00 UTC.
After some time passes, the subscriber decides to subscribe. It does this with the well-known subscription pattern, but in addition to providing a callback address, it also provides time stamp T0 and a unique (can be random) subscription id. For example, a human’s hypothetical news syndication app may provide an event update endpoint to the news website, and time T0.
The publisher sets up the subscription, and immediately checks whether any events should have been sent between (after) T0 and the present. (It can do that because it stores the update history.) If so, it emits those events to the subscriber, in sequence, before continuing with regular operations. As a result, there is no more race condition between subscription and event.
When sending an event, the publisher also sends the subscription id.
About temporary unavailability of the subscriber
After a subscription is active, assume the subscriber disappears and new events cannot be delivered. The publisher may continue to attempt to deliver events for as long as it likes, or stop immediately.
When the subscriber re-appears, it finds the time of the last event it had received from the publisher, say time T5. It queries the event history published by the publisher with parameter T5 to find out what events it missed. It processes those events and then re-subscribes with a later starting time stamp corresponding to the last event it received (say T10). When it re-subscribes, it uses a different subscription id and cancels the old subscription.
After the subscriber has re-appeared, it ignores/rejects all incoming events with the old subscription id.
Subscriber implementation notes
The subscriber receives events exclusively through a single queue for incoming events. This makes implementing an incoming-event handler very simple, as it can simply process events in order.
The event queue maintains the timestamp of the last event it successfully added. When a new event arrives, the queue accepts this event but only if the new event is the direct follower of the last event it successfully added. If it is not, the incoming event is discarded. (This covers both repeatedly received events and when some events were missed.)
The subscriber also maintains a timer with a countdown from the last time an event was successfully added to the incoming queue. (The time constant of the timer is application-specific, and may be adaptive.) When the timeout occurs, the subscriber queries the publisher, providing the last successful timestamp. If no updates are being found, nothing happens. If updates are being found, it is fair to consider the existing subscription to have failed. Then:
- The subscriber itself inserts the obtained “missed” events into its own incoming event queue from where they are processed.
- The subscriber cancels the existing subscription.
- The subscriber creates a new subscription, with the timestamp of the most recent successfully-inserted event.
Publishers do not need to remember subscriber-specific state. (Thanks, Kafka, for showing us!) That makes it easy to implement the publisher side.
From the perspective of the publisher, delivery of events to subscribers that can receive callbacks, and those that need to poll, both works. (It sort of emulates RSS except that a starting time parameter is provided by the client, instead of a uniform window decided on by the publisher as in RSS)
Subscribers only need to keep a time stamp as state, something they probably have already anyway.
Subscribers can implement a polling or push strategy, or dynamically change between those, without the risk of losing data.
Publishers are not required to push out events at all. If they don’t, this protocol basically falls back to polling. This is inefficient but much better than the alternative and can also be used in places where, for example, firewalls prevent event pushing.
Would love your thoughts!