Browse Source

RFC 006: Event Subscription (#7184)

This is intended to document some ergonomic and reliability issues with the
existing implementation of the event subscription service on the Tendermint
node, and to discuss possible approaches to improving them.
pull/7248/head
M. J. Fromberger 3 years ago
committed by GitHub
parent
commit
432067cc0e
No known key found for this signature in database GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 205 additions and 0 deletions
  1. +1
    -0
      docs/rfc/README.md
  2. +204
    -0
      docs/rfc/rfc-006-event-subscription.md

+ 1
- 0
docs/rfc/README.md View File

@ -43,5 +43,6 @@ sections.
- [RFC-003: Performance Taxonomy](./rfc-003-performance-questions.md)
- [RFC-004: E2E Test Framework Enhancements](./rfc-004-e2e-framework.md)
- [RFC-005: Event System](./rfc-005-event-system.rst)
- [RFC-006: Event Subscription](./rfc-005-event-system.rst)
<!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->

+ 204
- 0
docs/rfc/rfc-006-event-subscription.md View File

@ -0,0 +1,204 @@
# RFC 006: Event Subscription
## Changelog
- 30-Oct-2021: Initial draft (@creachadair)
## Abstract
The Tendermint consensus node allows clients to subscribe to its event stream
via methods on its RPC service. The ability to view the event stream is
valuable for clients, but the current implementation has some deficiencies that
make it difficult for some clients to use effectively. This RFC documents these
issues and discusses possible approaches to solving them.
## Background
A running Tendermint consensus node exports a [JSON-RPC service][rpc-service]
that provides a [large set of methods][rpc-methods] for inspecting and
interacting with the node. One important cluster of these methods are the
`subscribe`, `unsubscribe`, and `unsubscribe_all` methods, which permit clients
to subscribe to a filtered stream of the [events generated by the node][events]
as it runs.
Unlike the other methods of the service, the methods in the "event
subscription" cluster are not accessible via [ordinary HTTP GET or POST
requests][rpc-transport], but require upgrading the HTTP connection to a
[websocket][ws]. This is necessary because the `subscribe` request needs a
persistent channel to deliver results back to the client, and an ordinary HTTP
connection does not reliably persist across multiple requests. Since these
methods do not work properly without a persistent channel, they are _only_
exported via a websocket connection, and are not routed for plain HTTP.
## Discussion
There are some operational problems with the current implementation of event
subscription in the RPC service:
- **Event delivery is not valid JSON-RPC.** When a client issues a `subscribe`
request, the server replies (correctly) with an initial empty acknowledgement
(`{}`). After that, each matching event is delivered "unsolicited" (without
another request from the client), as a separate [response object][json-response]
with the same ID as the initial request.
This matters because it means a standard JSON-RPC client library can't
interact correctly with the event subscription mechanism.
Even for clients that can handle unsolicited values pushed by the server,
these responses are invalid: They have an ID, so they cannot be treated as
[notifications][json-notify]; but the ID corresponds to a request that was
already completed. In practice, this means that general-purpose JSON-RPC
libraries cannot use this method correctly -- it requires a custom client.
The Go RPC client from the Tendermint core can support this case, but clients
in other languages have no easy solution.
This is the cause of issue [#2949][issue2949].
- **Subscriptions are terminated by disconnection.** When the connection to the
client is interrupted, the subscription is silently dropped.
This is a reasonable behavior, but it matters because a client whose
subscription is dropped gets no useful error feedback, just a closed
connection. Should they try again? Is the node overloaded? Was the client
too slow? Did the caller forget to respond to pings? Debugging these kinds
of failures is unnecessarily painful.
Websockets compound this, because websocket connections time out if no
traffic is seen for a while, and keeping them alive requires active
cooperation between the client and server. With a plain TCP socket, liveness
is handled transparently by the keepalive mechanism. On a websocket,
however, one side has to occasionally send a PING (if the connection is
otherwise idle). The other side must return a matching PONG in time, or the
connection is dropped. Apart from being tedious, this is highly susceptible
to CPU load.
The Tendermint Go implementation automatically sends and responds to pings.
Clients in other languages (or not wanting to use the Tendermint libraries)
need to handle it explicitly. This burdens the client for no practical
benefit: A subscriber has no information about when matching events may be
available, so it shouldn't have to participate in keeping the connection
alive.
- **Mismatched load profiles.** Most of the RPC service is mainly important for
low-volume local use, either by the application the node serves (e.g., the
ABCI methods) or by the node operator (e.g., the info methods). Event
subscription is important for remote clients, and may represent a much higher
volume of traffic.
This matters because both are using the same JSON-RPC mechanism. For
low-volume local use, the ergonomics of JSON-RPC are a good fit: It's easy to
issue queries from the command line (e.g., using `curl`) or to write scripts
that call the RPC methods to monitor the running node.
For high-volume remote use, JSON-RPC is not such a good fit: Even leaving
aside the non-standard delivery protocol mentioned above, the time and memory
cost of encoding event data matters for the stability of the node when there
can be potentially hundreds of subscribers. Moreover, a subscription is
long-lived compared to most RPC methods, in that it may persist as long the
node is active.
- **Mismatched security profiles.** The RPC service exports several methods
that should not be open to arbitrary remote callers, both for correctness
reasons (e.g., `remove_tx` and `broadcast_tx_*`) and for operational
stability reasons (e.g., `tx_search`). A node may still need to expose
events, however, to support UI tools.
This matters, because all the methods share the same network endpoint. While
it is possible to block the top-level GET and POST handlers with a proxy,
exposing the `/websocket` handler exposes not _only_ the event subscription
methods, but the rest of the service as well.
### Possible Improvements
There are several things we could do to improve the experience of developers
who need to subscribe to events from the consensus node. These are not all
mutually exclusive.
1. **Split event subscription into a separate service**. Instead of exposing
event subscription on the same endpoint as the rest of the RPC service,
dedicate a separate endpoint on the node for _only_ event subscription. The
rest of the RPC services (_sans_ events) would remain as-is.
This would make it easy to disable or firewall outside access to sensitive
RPC methods, without blocking access to event subscription (and vice versa).
This is probably worth doing, even if we don't take any of the other steps
described here.
2. **Use a different protocol for event subscription.** There are various ways
we could approach this, depending how much we're willing to shake up the
current API. Here are sketches of a few options:
- Keep the websocket, but rework the API to be more JSON-RPC compliant,
perhaps by converting event delivery into notifications. This is less
up-front change for existing clients, but retains all of the existing
implementation complexity, and doesn't contribute much toward more serious
performance and UX improvements later.
- Switch from websocket to plain HTTP, and rework the subscription API to
use a more conventional request/response pattern instead of streaming.
This is a little more up-front work for existing clients, but leverages
better library support for clients not written in Go.
The protocol would become more chatty, but we could mitigate that with
batching, and in return we would get more control over what to do about
slow clients: Instead of simply silently dropping them, as we do now, we
could drop messages and signal the client that they missed some data ("M
dropped messages since your last poll").
This option is probably the best balance between work, API change, and
benefit, and has a nice incidental effect that it would be easier to debug
subscriptions from the command-line, like the other RPC methods.
- Switch to gRPC: Preserves a persistent connection and gives us a more
efficient binary wire format (protobuf), at the cost of much more work for
clients and harder debugging. This may be the best option if performance
and server load are our top concerns.
Given that we are currently using JSON-RPC, however, I'm not convinced the
costs of encoding and sending messages on the event subscription channel
are the limiting factor on subscription efficiency, however.
3. **Delegate event subscriptions to a proxy.** Give responsibility for
managing event subscription to a proxy that runs separately from the node,
and switch the node to push events to the proxy (like a webhook) instead of
serving subscribers directly. This is more work for the operator (another
process to configure and run) but may scale better for big networks.
I mention this option for completeness, but making this change would be a
fairly substantial project. If we want to consider shifting responsibility
for event subscription outside the node anyway, we should probably be more
systematic about it. For a more principled approach, see point (4) below.
4. **Move event subscription downstream of indexing.** We are already planning
to give applications more control over event indexing. By extension, we
might allow the application to also control how events are filtered,
queried, and subscribed. Having the application control these concerns,
rather than the node, might make life easier for developers building UI and
tools for that application.
This is a much larger change, so I don't think it is likely to be practical
in the near-term, but it's worth considering as a broader option. Some of
the existing code for filtering and selection could be made more reusable,
so applications would not need to reinvent everything.
## References
- [Tendermint RPC service][rpc-service]
- [Tendermint RPC routes][rpc-methods]
- [Discussion of the event system][events]
- [Discussion about RPC transport options][rpc-transport] (from RFC 002)
- [RFC 6455: The websocket protocol][ws]
- [JSON-RPC 2.0 Specification](https://www.jsonrpc.org/specification)
[rpc-service]: https://docs.tendermint.com/master/rpc/
[rpc-methods]: https://github.com/tendermint/tendermint/blob/master/internal/rpc/core/routes.go#L12
[events]: ./rfc-005-event-system.rst
[rpc-transport]: ./rfc-002-ipc-ecosystem.md#rpc-transport
[ws]: https://datatracker.ietf.org/doc/html/rfc6455
[json-response]: https://www.jsonrpc.org/specification#response_object
[json-notify]: https://www.jsonrpc.org/specification#notification
[issue2949]: https://github.com/tendermint/tendermint/issues/2949

Loading…
Cancel
Save