You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

100 lines
4.6 KiB

  1. From afbf56b951967e8fa4d509e423fdcb11c27d40e2 Mon Sep 17 00:00:00 2001
  2. From: Willy Tarreau <w@1wt.eu>
  3. Date: Tue, 14 Mar 2017 20:19:29 +0100
  4. Subject: [PATCH 7/7] BUG/MAJOR: connection: update CO_FL_CONNECTED before
  5. calling the data layer
  6. Matthias Fechner reported a regression in 1.7.3 brought by the backport
  7. of commit 819efbf ("BUG/MEDIUM: tcp: don't poll for write when connect()
  8. succeeds"), causing some connections to fail to establish once in a while.
  9. While this commit itself was a fix for a bad sequencing of connection
  10. events, it in fact unveiled a much deeper bug going back to the connection
  11. rework era in v1.5-dev12 : 8f8c92f ("MAJOR: connection: add a new
  12. CO_FL_CONNECTED flag").
  13. It's worth noting that in a lab reproducing a similar environment as
  14. Matthias' about only 1 every 19000 connections exhibit this behaviour,
  15. making the issue not so easy to observe. A trick to make the problem
  16. more observable consists in disabling non-blocking mode on the socket
  17. before calling connect() and re-enabling it later, so that connect()
  18. always succeeds. Then it becomes 100% reproducible.
  19. The problem is that this CO_FL_CONNECTED flag is tested after deciding to
  20. call the data layer (typically the stream interface but might be a health
  21. check as well), and that the decision to call the data layer relies on a
  22. change of one of the flags covered by the CO_FL_CONN_STATE set, which is
  23. made of CO_FL_CONNECTED among others.
  24. Before the fix above, this bug couldn't appear with TCP but it could
  25. appear with Unix sockets. Indeed, connect() was always considered
  26. blocking so the CO_FL_WAIT_L4_CONN connection flag was always set, and
  27. polling for write events was always enabled. This used to guarantee that
  28. the conn_fd_handler() could detect a change among the CO_FL_CONN_STATE
  29. flags.
  30. Now with the fix above, if a connect() immediately succeeds for non-ssl
  31. connection with send-proxy enabled, and no data in the buffer (thus TCP
  32. mode only), the CO_FL_WAIT_L4_CONN flag is not set, the lack of data in
  33. the buffer doesn't enable polling flags for the data layer, the
  34. CO_FL_CONNECTED flag is not set due to send-proxy still being pending,
  35. and once send-proxy is done, its completion doesn't cause the data layer
  36. to be woken up due to the fact that CO_FL_CONNECT is still not present
  37. and that the CO_FL_SEND_PROXY flag is not watched in CO_FL_CONN_STATE.
  38. Then no progress is made when data are received from the client (and
  39. attempted to be forwarded), because a CF_WRITE_NULL (or CF_WRITE_PARTIAL)
  40. flag is needed for the stream-interface state to turn from SI_ST_CON to
  41. SI_ST_EST, allowing ->chk_snd() to be called when new data arrive. And
  42. the only way to set this flag is to call the data layer of course.
  43. After the connect timeout, the connection gets killed and if in the mean
  44. time some data have accumulated in the buffer, the retry will succeed.
  45. This patch fixes this situation by simply placing the update of
  46. CO_FL_CONNECTED where it should have been, before the check for a flag
  47. change needed to wake up the data layer and not after.
  48. This fix must be backported to 1.7, 1.6 and 1.5. Versions not having
  49. the patch above are still affected for unix sockets.
  50. Special thanks to Matthias Fechner who provided a very detailed bug
  51. report with a bisection designating the faulty patch, and to Olivier
  52. Houchard for providing full access to a pretty similar environment where
  53. the issue could first be reproduced.
  54. (cherry picked from commit 7bf3fa3c23f6a1b7ed1212783507ac50f7e27544)
  55. ---
  56. src/connection.c | 11 +++++++----
  57. 1 file changed, 7 insertions(+), 4 deletions(-)
  58. diff --git a/src/connection.c b/src/connection.c
  59. index 26fc5f6..1e4c9aa 100644
  60. --- a/src/connection.c
  61. +++ b/src/connection.c
  62. @@ -131,6 +131,13 @@ void conn_fd_handler(int fd)
  63. }
  64. leave:
  65. + /* Verify if the connection just established. The CO_FL_CONNECTED flag
  66. + * being included in CO_FL_CONN_STATE, its change will be noticed by
  67. + * the next block and be used to wake up the data layer.
  68. + */
  69. + if (unlikely(!(conn->flags & (CO_FL_WAIT_L4_CONN | CO_FL_WAIT_L6_CONN | CO_FL_CONNECTED))))
  70. + conn->flags |= CO_FL_CONNECTED;
  71. +
  72. /* The wake callback may be used to process a critical error and abort the
  73. * connection. If so, we don't want to go further as the connection will
  74. * have been released and the FD destroyed.
  75. @@ -140,10 +147,6 @@ void conn_fd_handler(int fd)
  76. conn->data->wake(conn) < 0)
  77. return;
  78. - /* Last check, verify if the connection just established */
  79. - if (unlikely(!(conn->flags & (CO_FL_WAIT_L4_CONN | CO_FL_WAIT_L6_CONN | CO_FL_CONNECTED))))
  80. - conn->flags |= CO_FL_CONNECTED;
  81. -
  82. /* remove the events before leaving */
  83. fdtab[fd].ev &= FD_POLL_STICKY;
  84. --
  85. 2.10.2