This issue is related to #3107
This is a first renaming/refactoring step before reworking and removing heartbeats.
As discussed with @Liamsi , we preferred to go for a couple of independent and separate PRs to simplify review work.
The changes:
Help to clarify the relation between the validator and remote signer endpoints
Differentiate between timeouts and deadlines
Prepare to encapsulate networking related code behind RemoteSigner in the next PR
My intention is to separate and encapsulate the "network related" code from the actual signer.
SignerRemote ---(uses/contains)--> SignerValidatorEndpoint <--(connects to)--> SignerServiceEndpoint ---> SignerService (future.. not here yet but would like to decouple too)
All reconnection/heartbeat/whatever code goes in the endpoints. Signer[Remote/Service] do not need to know about that.
I agree Endpoint may not be the perfect name. I tried to find something "Go-ish" enough. It is a common name in go-kit, kubernetes, etc.
Right now:
SignerValidatorEndpoint:
handles the listener
contains SignerRemote
Implements the PrivValidator interface
connects and sets a connection object in a contained SignerRemote
delegates PrivValidator some calls to SignerRemote which in turn uses the conn object that was set externally
SignerRemote:
Implements the PrivValidator interface
read/writes from a connection object directly
handles heartbeats
SignerServiceEndpoint:
Does most things in a single place
delegates to a PrivValidator IIRC.
* cleanup
* Refactoring step 1
* Refactoring step 2
* move messages to another file
* mark for future work / next steps
* mark deprecated classes in docs
* Fix linter problems
* additional linter fixes
* node: decrease retry conn timeout in test
Should fix#3256
The retry timeout was set to the default, which is the same as the
accept timeout, so it's no wonder this would fail. Here we decrease the
retry timeout so we can try many times before the accept timeout.
* p2p: increase handshake timeout in test
This fails sometimes, presumably because the handshake timeout is so low
(only 50ms). So increase it to 1s. Should fix#3187
* privval: fix race with ping. closes#3237
Pings happen in a go-routine and can happen concurrently with other
messages. Since we use a request/response protocol, we expect to send a
request and get back the corresponding response. But with pings
happening concurrently, this assumption could be violated. We were using
a mutex, but only a RWMutex, where the RLock was being held for sending
messages - this was to allow the underlying connection to be replaced if
it fails. Turns out we actually need to use a full lock (not just a read
lock) to prevent multiple requests from happening concurrently.
* node: fix test name. DelayedStop -> DelayedStart
* autofile: Wait() method
In the TestWALTruncate in consensus/wal_test.go we remove the WAL
directory at the end of the test. However the wal.Stop() does not
properly wait for the autofile group to finish shutting down. Hence it
was possible that the group's go-routine is still running when the
cleanup happens, which causes a panic since the directory disappeared.
Here we add a Wait() method to properly wait until the go-routine exits
so we can safely clean up. This fixes#2852.
* Consolidates deadline tests for privval Unix/TCP
Following on from #3115 and #3132, this converts fundamental timeout
errors from the client's `acceptConnection()` method so that these can
be detected by the test for the TCP connection.
Timeout deadlines are now tested for both TCP and Unix domain socket
connections.
There is also no need for the additional secret connection code: the
connection will time out at the `acceptConnection()` phase for TCP
connections, and will time out when attempting to obtain the
`RemoteSigner`'s public key for Unix domain socket connections.
* Removes extraneous logging
* Adds IsConnTimeout helper function
This commit adds a helper function to detect whether an error is either
a fundamental networking timeout error, or an `ErrConnTimeout` error
specific to the `RemoteSigner` class.
* Adds a test for the IsConnTimeout() helper function
* Separates tests logically for IsConnTimeout
* Adds a random suffix to temporary Unix sockets during testing
* Adds Unix domain socket tests for client (FAILING)
This adds Unix domain socket tests for the privval client. Right now,
one of the tests (TestRemoteSignerRetry) fails, probably because the
Unix domain socket state is known instantaneously on both sides by the
OS. Committing this to collaborate on the error.
* Removes extraneous logging
* Completes testing of Unix sockets client version
This completes the testing of the client connecting via Unix sockets.
There are two specific tests (TestSocketPVDeadline and
TestRemoteSignerRetryTCPOnly) that are only relevant to TCP connections.
* Renames test to show TCP-specificity
* Adds testing into closures for consistency (forgot previously)
* Moves test specific to RemoteSigner into own file
As per discussion on #3132, `TestRemoteSignerRetryTCPOnly` doesn't
really belong with the client tests. This moves it into its own file
related to the `RemoteSigner` class.
This cuts out two tests by constructing test cases and iterating through
them, rather than having separate sets of tests for TCP and Unix listeners.
This is as per the feedback from #3121.
* Close and recreate a RemoteSigner on err
* Update changelog
* Address Anton's comments / suggestions:
- update changelog
- restart TCPVal
- shut down on `ErrUnexpectedResponse`
* re-init remote signer client with fresh connection if Ping fails
- add/update TODOs in secret connection
- rename tcp.go -> tcp_client.go, same with ipc to clarify their purpose
* account for `conn returned by waitConnection can be `nil`
- also add TODO about RemoteSigner conn field
* Tests for retrying: IPC / TCP
- shorter info log on success
- set conn and use it in tests to close conn
* Tests for retrying: IPC / TCP
- shorter info log on success
- set conn and use it in tests to close conn
- add rwmutex for conn field in IPC
* comments and doc.go
* fix ipc tests. fixes#2677
* use constants for tests
* cleanup some error statements
* fixes#2784, race in tests
* remove print statement
* minor fixes from review
* update comment on sts spec
* cosmetics
* p2p/conn: add failing tests
* p2p/conn: make SecretConnection thread safe
* changelog
* IPCVal signer refactor
- use a .reset() method
- don't use embedded RemoteSignerClient
- guard RemoteSignerClient with mutex
- drop the .conn
- expose Close() on RemoteSignerClient
* apply IPCVal refactor to TCPVal
* remove mtx from RemoteSignerClient
* consolidate IPCVal and TCPVal, fixes#3104
- done in tcp_client.go
- now called SocketVal
- takes a listener in the constructor
- make tcpListener and unixListener contain all the differences
* delete ipc files
* introduce unix and tcp dialer for RemoteSigner
* rename files
- drop tcp_ prefix
- rename priv_validator.go to file.go
* bring back listener options
* fix node
* fix priv_val_server
* fix node test
* minor cleanup and comments
* split immutable and mutable parts of priv_validator.json
* fix bugs
* minor changes
* retrig test
* delete scripts/wire2amino.go
* fix test
* fixes from review
* privval: remove mtx
* rearrange priv_validator.go
* upgrade path
* write tests for the upgrade
* fix for unsafe_reset_all
* add test
* add reset test
Ref: #2563
I added IPC as an unencrypted alternative to SocketPV.
Besides I fixed the following aspects of SocketPV:
Added locking since we are operating on a single socket
The connection deadline is extended every time a successful packet exchange happens; otherwise the connection would always die permanently x seconds after the connection was established.
Added a ping/heartbeat mechanism to keep the connection alive; native TCP keepalives do not work in this use-case
* Extend the SecureConn socket to extend its deadline
* Add locking & ping/heartbeat packets to SocketPV
* Implement IPC PV and abstract socket signing
* Refactored IPC and SocketPV
* Implement @melekes comments
* Fixes to rebase
* WIP: switching to fixed offsets for SignBytes
* add version field to sign bytes and update order
* more comments on test-cases and add a tc with a chainID
* remove amino:"write_empty" tag
- it doesn't affect if default fixed size fields ((u)int64) are
written or not
- add comment about int->int64 casting
* update CHANGELOG_PENDING
* update documentation
* add back link to issue #1622 in documentation
* remove JSON tags and add (failing test-case)
* fix failing test
* update test-vectors due to added `Type` field
* change Type field from string to byte and add new type alias
- SignedMsgType replaces VoteTypePrevote, VoteTypePrecommit and adds new
ProposalType to separate votes from proposal when signed
- update test-vectors
* fix remains from rebasing
* use SignMessageType instead of byte everywhere
* fixes from review