You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

204 lines
10 KiB

  1. # RFC 006: Event Subscription
  2. ## Changelog
  3. - 30-Oct-2021: Initial draft (@creachadair)
  4. ## Abstract
  5. The Tendermint consensus node allows clients to subscribe to its event stream
  6. via methods on its RPC service. The ability to view the event stream is
  7. valuable for clients, but the current implementation has some deficiencies that
  8. make it difficult for some clients to use effectively. This RFC documents these
  9. issues and discusses possible approaches to solving them.
  10. ## Background
  11. A running Tendermint consensus node exports a [JSON-RPC service][rpc-service]
  12. that provides a [large set of methods][rpc-methods] for inspecting and
  13. interacting with the node. One important cluster of these methods are the
  14. `subscribe`, `unsubscribe`, and `unsubscribe_all` methods, which permit clients
  15. to subscribe to a filtered stream of the [events generated by the node][events]
  16. as it runs.
  17. Unlike the other methods of the service, the methods in the "event
  18. subscription" cluster are not accessible via [ordinary HTTP GET or POST
  19. requests][rpc-transport], but require upgrading the HTTP connection to a
  20. [websocket][ws]. This is necessary because the `subscribe` request needs a
  21. persistent channel to deliver results back to the client, and an ordinary HTTP
  22. connection does not reliably persist across multiple requests. Since these
  23. methods do not work properly without a persistent channel, they are _only_
  24. exported via a websocket connection, and are not routed for plain HTTP.
  25. ## Discussion
  26. There are some operational problems with the current implementation of event
  27. subscription in the RPC service:
  28. - **Event delivery is not valid JSON-RPC.** When a client issues a `subscribe`
  29. request, the server replies (correctly) with an initial empty acknowledgement
  30. (`{}`). After that, each matching event is delivered "unsolicited" (without
  31. another request from the client), as a separate [response object][json-response]
  32. with the same ID as the initial request.
  33. This matters because it means a standard JSON-RPC client library can't
  34. interact correctly with the event subscription mechanism.
  35. Even for clients that can handle unsolicited values pushed by the server,
  36. these responses are invalid: They have an ID, so they cannot be treated as
  37. [notifications][json-notify]; but the ID corresponds to a request that was
  38. already completed. In practice, this means that general-purpose JSON-RPC
  39. libraries cannot use this method correctly -- it requires a custom client.
  40. The Go RPC client from the Tendermint core can support this case, but clients
  41. in other languages have no easy solution.
  42. This is the cause of issue [#2949][issue2949].
  43. - **Subscriptions are terminated by disconnection.** When the connection to the
  44. client is interrupted, the subscription is silently dropped.
  45. This is a reasonable behavior, but it matters because a client whose
  46. subscription is dropped gets no useful error feedback, just a closed
  47. connection. Should they try again? Is the node overloaded? Was the client
  48. too slow? Did the caller forget to respond to pings? Debugging these kinds
  49. of failures is unnecessarily painful.
  50. Websockets compound this, because websocket connections time out if no
  51. traffic is seen for a while, and keeping them alive requires active
  52. cooperation between the client and server. With a plain TCP socket, liveness
  53. is handled transparently by the keepalive mechanism. On a websocket,
  54. however, one side has to occasionally send a PING (if the connection is
  55. otherwise idle). The other side must return a matching PONG in time, or the
  56. connection is dropped. Apart from being tedious, this is highly susceptible
  57. to CPU load.
  58. The Tendermint Go implementation automatically sends and responds to pings.
  59. Clients in other languages (or not wanting to use the Tendermint libraries)
  60. need to handle it explicitly. This burdens the client for no practical
  61. benefit: A subscriber has no information about when matching events may be
  62. available, so it shouldn't have to participate in keeping the connection
  63. alive.
  64. - **Mismatched load profiles.** Most of the RPC service is mainly important for
  65. low-volume local use, either by the application the node serves (e.g., the
  66. ABCI methods) or by the node operator (e.g., the info methods). Event
  67. subscription is important for remote clients, and may represent a much higher
  68. volume of traffic.
  69. This matters because both are using the same JSON-RPC mechanism. For
  70. low-volume local use, the ergonomics of JSON-RPC are a good fit: It's easy to
  71. issue queries from the command line (e.g., using `curl`) or to write scripts
  72. that call the RPC methods to monitor the running node.
  73. For high-volume remote use, JSON-RPC is not such a good fit: Even leaving
  74. aside the non-standard delivery protocol mentioned above, the time and memory
  75. cost of encoding event data matters for the stability of the node when there
  76. can be potentially hundreds of subscribers. Moreover, a subscription is
  77. long-lived compared to most RPC methods, in that it may persist as long the
  78. node is active.
  79. - **Mismatched security profiles.** The RPC service exports several methods
  80. that should not be open to arbitrary remote callers, both for correctness
  81. reasons (e.g., `remove_tx` and `broadcast_tx_*`) and for operational
  82. stability reasons (e.g., `tx_search`). A node may still need to expose
  83. events, however, to support UI tools.
  84. This matters, because all the methods share the same network endpoint. While
  85. it is possible to block the top-level GET and POST handlers with a proxy,
  86. exposing the `/websocket` handler exposes not _only_ the event subscription
  87. methods, but the rest of the service as well.
  88. ### Possible Improvements
  89. There are several things we could do to improve the experience of developers
  90. who need to subscribe to events from the consensus node. These are not all
  91. mutually exclusive.
  92. 1. **Split event subscription into a separate service**. Instead of exposing
  93. event subscription on the same endpoint as the rest of the RPC service,
  94. dedicate a separate endpoint on the node for _only_ event subscription. The
  95. rest of the RPC services (_sans_ events) would remain as-is.
  96. This would make it easy to disable or firewall outside access to sensitive
  97. RPC methods, without blocking access to event subscription (and vice versa).
  98. This is probably worth doing, even if we don't take any of the other steps
  99. described here.
  100. 2. **Use a different protocol for event subscription.** There are various ways
  101. we could approach this, depending how much we're willing to shake up the
  102. current API. Here are sketches of a few options:
  103. - Keep the websocket, but rework the API to be more JSON-RPC compliant,
  104. perhaps by converting event delivery into notifications. This is less
  105. up-front change for existing clients, but retains all of the existing
  106. implementation complexity, and doesn't contribute much toward more serious
  107. performance and UX improvements later.
  108. - Switch from websocket to plain HTTP, and rework the subscription API to
  109. use a more conventional request/response pattern instead of streaming.
  110. This is a little more up-front work for existing clients, but leverages
  111. better library support for clients not written in Go.
  112. The protocol would become more chatty, but we could mitigate that with
  113. batching, and in return we would get more control over what to do about
  114. slow clients: Instead of simply silently dropping them, as we do now, we
  115. could drop messages and signal the client that they missed some data ("M
  116. dropped messages since your last poll").
  117. This option is probably the best balance between work, API change, and
  118. benefit, and has a nice incidental effect that it would be easier to debug
  119. subscriptions from the command-line, like the other RPC methods.
  120. - Switch to gRPC: Preserves a persistent connection and gives us a more
  121. efficient binary wire format (protobuf), at the cost of much more work for
  122. clients and harder debugging. This may be the best option if performance
  123. and server load are our top concerns.
  124. Given that we are currently using JSON-RPC, however, I'm not convinced the
  125. costs of encoding and sending messages on the event subscription channel
  126. are the limiting factor on subscription efficiency, however.
  127. 3. **Delegate event subscriptions to a proxy.** Give responsibility for
  128. managing event subscription to a proxy that runs separately from the node,
  129. and switch the node to push events to the proxy (like a webhook) instead of
  130. serving subscribers directly. This is more work for the operator (another
  131. process to configure and run) but may scale better for big networks.
  132. I mention this option for completeness, but making this change would be a
  133. fairly substantial project. If we want to consider shifting responsibility
  134. for event subscription outside the node anyway, we should probably be more
  135. systematic about it. For a more principled approach, see point (4) below.
  136. 4. **Move event subscription downstream of indexing.** We are already planning
  137. to give applications more control over event indexing. By extension, we
  138. might allow the application to also control how events are filtered,
  139. queried, and subscribed. Having the application control these concerns,
  140. rather than the node, might make life easier for developers building UI and
  141. tools for that application.
  142. This is a much larger change, so I don't think it is likely to be practical
  143. in the near-term, but it's worth considering as a broader option. Some of
  144. the existing code for filtering and selection could be made more reusable,
  145. so applications would not need to reinvent everything.
  146. ## References
  147. - [Tendermint RPC service][rpc-service]
  148. - [Tendermint RPC routes][rpc-methods]
  149. - [Discussion of the event system][events]
  150. - [Discussion about RPC transport options][rpc-transport] (from RFC 002)
  151. - [RFC 6455: The websocket protocol][ws]
  152. - [JSON-RPC 2.0 Specification](https://www.jsonrpc.org/specification)
  153. [rpc-service]: https://docs.tendermint.com/master/rpc/
  154. [rpc-methods]: https://github.com/tendermint/tendermint/blob/master/internal/rpc/core/routes.go#L12
  155. [events]: ./rfc-005-event-system.rst
  156. [rpc-transport]: ./rfc-002-ipc-ecosystem.md#rpc-transport
  157. [ws]: https://datatracker.ietf.org/doc/html/rfc6455
  158. [json-response]: https://www.jsonrpc.org/specification#response_object
  159. [json-notify]: https://www.jsonrpc.org/specification#notification
  160. [issue2949]: https://github.com/tendermint/tendermint/issues/2949