Eliminate send timeout race between sender and listener threads.
Previously, a window could occur where: 1) The sender delivers a HOLD[fd, seq_id] message to the listener. 2) The sender successfully sends its payload, but due to a stressed network, overloaded simulator, etc., the payload takes a while to finish sending. 3) The listener times out the HOLD message, which had a timeout of the normal message timeout + 1 (intended to close the window caused by one thread reacting crossing the second boundary before the other, but insufficinet). 4) The listener receives the EXPECT message from the sender, but doesn't have a corresponding HOLD message to pair it with, because the HOLD has just timed out. In this case, the sender would not trigger normal timeout error handling because it still successfully passed the handler callback to the listener (albeit just barely), but the listener is missing its hold message (which may have already saved a response), so it waits further for a response which never arrives. This code path did not appear to trigger the timeout properly, leading to a possible deadlock. Now, if the hold message is not found, the listener constructs an EXPECT event message, but immediately times it out with a new error case -- BUS_SEND_RX_TIMEOUT_EXPECT. (This is a recoverable timeout error.) Also, eliminate a secondary error handling path for retries for messages between the sender and listener. This has probably not been triggered, but adds ambiguity to the error handling -- the normal code path for timeouts is sufficient. Rename KineticController_HandleExpectedResponse to KineticController_HandleResult (since it also handles error codes) and KineticController_HandleUnexecpectedResponse to KineticController_HandleUnexpectedResponse (fix typo), as part of clarifying the overall error handling dataflow.
Loading
Please sign in to comment