From afbf56b951967e8fa4d509e423fdcb11c27d40e2 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Tue, 14 Mar 2017 20:19:29 +0100 Subject: [PATCH 7/7] BUG/MAJOR: connection: update CO_FL_CONNECTED before calling the data layer Matthias Fechner reported a regression in 1.7.3 brought by the backport of commit 819efbf ("BUG/MEDIUM: tcp: don't poll for write when connect() succeeds"), causing some connections to fail to establish once in a while. While this commit itself was a fix for a bad sequencing of connection events, it in fact unveiled a much deeper bug going back to the connection rework era in v1.5-dev12 : 8f8c92f ("MAJOR: connection: add a new CO_FL_CONNECTED flag"). It's worth noting that in a lab reproducing a similar environment as Matthias' about only 1 every 19000 connections exhibit this behaviour, making the issue not so easy to observe. A trick to make the problem more observable consists in disabling non-blocking mode on the socket before calling connect() and re-enabling it later, so that connect() always succeeds. Then it becomes 100% reproducible. The problem is that this CO_FL_CONNECTED flag is tested after deciding to call the data layer (typically the stream interface but might be a health check as well), and that the decision to call the data layer relies on a change of one of the flags covered by the CO_FL_CONN_STATE set, which is made of CO_FL_CONNECTED among others. Before the fix above, this bug couldn't appear with TCP but it could appear with Unix sockets. Indeed, connect() was always considered blocking so the CO_FL_WAIT_L4_CONN connection flag was always set, and polling for write events was always enabled. This used to guarantee that the conn_fd_handler() could detect a change among the CO_FL_CONN_STATE flags. Now with the fix above, if a connect() immediately succeeds for non-ssl connection with send-proxy enabled, and no data in the buffer (thus TCP mode only), the CO_FL_WAIT_L4_CONN flag is not set, the lack of data in the buffer doesn't enable polling flags for the data layer, the CO_FL_CONNECTED flag is not set due to send-proxy still being pending, and once send-proxy is done, its completion doesn't cause the data layer to be woken up due to the fact that CO_FL_CONNECT is still not present and that the CO_FL_SEND_PROXY flag is not watched in CO_FL_CONN_STATE. Then no progress is made when data are received from the client (and attempted to be forwarded), because a CF_WRITE_NULL (or CF_WRITE_PARTIAL) flag is needed for the stream-interface state to turn from SI_ST_CON to SI_ST_EST, allowing ->chk_snd() to be called when new data arrive. And the only way to set this flag is to call the data layer of course. After the connect timeout, the connection gets killed and if in the mean time some data have accumulated in the buffer, the retry will succeed. This patch fixes this situation by simply placing the update of CO_FL_CONNECTED where it should have been, before the check for a flag change needed to wake up the data layer and not after. This fix must be backported to 1.7, 1.6 and 1.5. Versions not having the patch above are still affected for unix sockets. Special thanks to Matthias Fechner who provided a very detailed bug report with a bisection designating the faulty patch, and to Olivier Houchard for providing full access to a pretty similar environment where the issue could first be reproduced. (cherry picked from commit 7bf3fa3c23f6a1b7ed1212783507ac50f7e27544) --- src/connection.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/src/connection.c b/src/connection.c index 26fc5f6..1e4c9aa 100644 --- a/src/connection.c +++ b/src/connection.c @@ -131,6 +131,13 @@ void conn_fd_handler(int fd) } leave: + /* Verify if the connection just established. The CO_FL_CONNECTED flag + * being included in CO_FL_CONN_STATE, its change will be noticed by + * the next block and be used to wake up the data layer. + */ + if (unlikely(!(conn->flags & (CO_FL_WAIT_L4_CONN | CO_FL_WAIT_L6_CONN | CO_FL_CONNECTED)))) + conn->flags |= CO_FL_CONNECTED; + /* The wake callback may be used to process a critical error and abort the * connection. If so, we don't want to go further as the connection will * have been released and the FD destroyed. @@ -140,10 +147,6 @@ void conn_fd_handler(int fd) conn->data->wake(conn) < 0) return; - /* Last check, verify if the connection just established */ - if (unlikely(!(conn->flags & (CO_FL_WAIT_L4_CONN | CO_FL_WAIT_L6_CONN | CO_FL_CONNECTED)))) - conn->flags |= CO_FL_CONNECTED; - /* remove the events before leaving */ fdtab[fd].ev &= FD_POLL_STICKY; -- 2.10.2