|
========================================
|
|
RFC 004: E2E Test Framework Enhancements
|
|
========================================
|
|
|
|
Changelog
|
|
---------
|
|
|
|
- 2021-09-14: started initial draft (@tychoish)
|
|
|
|
Abstract
|
|
--------
|
|
|
|
This document discusses a series of improvements to the e2e test framework
|
|
that we can consider during the next few releases to help boost confidence in
|
|
Tendermint releases, and improve developer efficiency.
|
|
|
|
Background
|
|
----------
|
|
|
|
During the 0.35 release cycle, the E2E tests were a source of great
|
|
value, helping to identify a number of bugs before release. At the same time,
|
|
the tests were not consistently passing during this time, thereby reducing
|
|
their value, and forcing the core development team to allocate time and energy
|
|
to maintaining and chasing down issues with the e2e tests and the test
|
|
harness. The experience of this release cycle calls to mind a series of
|
|
improvements to the test framework, and this document attempts to capture
|
|
these improvements, along with motivations, and potential for impact.
|
|
|
|
Projects
|
|
--------
|
|
|
|
Flexible Workload Generation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Presently the e2e suite contains a single workload generation pattern, which
|
|
exists simply to ensure that the test networks have some work during their
|
|
runs. However, the shape and volume of the work is very consistent and is very
|
|
gentle to help ensure test reliability.
|
|
|
|
We don't need a complex workload generation framework, but being able to have
|
|
a few different workload shapes available for test networks, both generated and
|
|
hand-crafted, would be useful.
|
|
|
|
Workload patterns/configurations might include:
|
|
|
|
- transaction targeting patterns (include light nodes, round robin, target
|
|
individual nodes)
|
|
|
|
- variable transaction size over time.
|
|
|
|
- transaction broadcast option (synchronously, checked, fire-and-forget,
|
|
mixed).
|
|
|
|
- number of transactions to submit.
|
|
|
|
- non-transaction workloads: (evidence submission, query, event subscription.)
|
|
|
|
Configurable Generator
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The nightly e2e suite is defined by the `testnet generator
|
|
<https://github.com/tendermint/tendermint/blob/master/test/e2e/generator/generate.go#L13-L65>`_,
|
|
and it's difficult to add dimensions or change the focus of the test suite in
|
|
any way without modifying the implementation of the generator. If the
|
|
generator were more configurable, potentially via a file rather than in
|
|
the Go implementation, we could modify the focus of the test suite on the
|
|
fly.
|
|
|
|
Features that we might want to configure:
|
|
|
|
- number of test networks to generate of various topologies, to improve
|
|
coverage of different configurations.
|
|
|
|
- test application configurations (to modify the latency of ABCI calls, etc.)
|
|
|
|
- size of test networks.
|
|
|
|
- workload shape and behavior.
|
|
|
|
- initial sync and catch-up configurations.
|
|
|
|
The workload generator currently provides runtime options for limiting the
|
|
generator to specific types of P2P stacks, and for generating multiple groups
|
|
of test cases to support parallelism. The goal is to extend this pattern and
|
|
avoid hardcoding the matrix of test cases in the generator code. Once the
|
|
testnet configuration generation behavior is configurable at runtime,
|
|
developers may be able to use the e2e framework to validate changes before
|
|
landing changes that break e2e tests a day later.
|
|
|
|
In addition to the autogenerated suite, it might make sense to maintain a
|
|
small collection of hand-crafted cases that exercise configurations of
|
|
concern, to run as part of the nightly (or less frequent) loop.
|
|
|
|
Implementation Plan Structure
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
As a development team, we should determine the features should impact the e2e
|
|
testing early in the development cycle, and if we intend to modify the e2e
|
|
tests to exercise a feature, we should identify this early and begin the
|
|
integration process as early as possible.
|
|
|
|
To facilitate this, we should adopt a practice whereby we exercise specific
|
|
features that are currently under development more rigorously in the e2e
|
|
suite, and then as development stabilizes we can reduce the number or weight
|
|
of these features in the suite.
|
|
|
|
As of 0.35 there are essentially two end to end tests: the suite of 64
|
|
generated test networks, and the hand crafted `ci.toml` test case. The
|
|
generated test cases help provide systemtic coverage, while the `ci` run
|
|
provides coverage for a large number of features.
|
|
|
|
Reduce Cycle Time
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
One of the barriers to leveraging the e2e framework, and one of the challenges
|
|
in debugging failures, is the cycle time of running a single test iteration is
|
|
quite high: 5 minutes to build the docker image, plus the time to run the test
|
|
or tests.
|
|
|
|
There are a number of improvements and enhancements that can reduce the cycle
|
|
time in practice:
|
|
|
|
- reduce the amount of time required to build the docker image used in these
|
|
tests. Without the dependency on CGo, the tendermint binaries could be
|
|
(cross) compiled outside of the docker container and then injected into
|
|
them, which would take better advantage of docker's native caching,
|
|
although, without the dependency on CGo there would be no hard requirement
|
|
for the e2e tests to use docker.
|
|
|
|
- support test parallelism. Because of the way the testnets are orchestrated
|
|
a single system can really only run one network at a time. For executions
|
|
(local or remote) with more resources, there's no reason to run a few
|
|
networks in parallel to reduce the feedback time.
|
|
|
|
- prune testnet configurations that are unlikely to provide good signal, to
|
|
shorten the time to feedback.
|
|
|
|
- apply some kind of tiered approach to test execution, to improve the
|
|
legibility of the test result. For example order tests by the dependency of
|
|
their features, or run test networks without perturbations before running
|
|
that configuration with perturbations, to be able to isolate the impact of
|
|
specific features.
|
|
|
|
- orchestrate the test harness directly from go test rather than via a special
|
|
harness and shell scripts so e2e tests may more naively fit into developers
|
|
existing workflows.
|
|
|
|
Many of these improvements, particularly, reducing the build time will also
|
|
reduce the time to get feedback during automated builds.
|
|
|
|
Deeper Insights
|
|
~~~~~~~~~~~~~~~
|
|
|
|
When a test network fails, it's incredibly difficult to understand _why_ the
|
|
network failed, as the current system provides very little insight into the
|
|
system outside of the process logs. When a test network stalls or fails
|
|
developers should be able to quickly and easily get a sense of the state of
|
|
the network and all nodes.
|
|
|
|
Improvements in persuit of this goal, include functionality that would help
|
|
node operators in production environments by improving the quality and utility
|
|
of the logging messages and other reported metrics, but also provide some
|
|
tools to collect and aggregate this data for developers in the context of test
|
|
networks.
|
|
|
|
- Interleave messages from all nodes in the network to be able to correlate
|
|
events during the test run.
|
|
|
|
- Collect structured metrics of the system operation (CPU/MEM/IO) during the
|
|
test run, as well as from each tendermint/application process.
|
|
|
|
- Build (simple) tools to be able to render and summarize the data collected
|
|
during the test run to answer basic questions about test outcome.
|
|
|
|
Flexible Assertions
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
Currently, all assertions run for every test network, which makes the
|
|
assertions pretty bland, and the framework primarily useful as a smoke-test
|
|
framework, but it might be useful to be able to write and run different
|
|
tests for different configurations. This could allow us to test outside of the
|
|
happy-path.
|
|
|
|
In general our existing assertions occupy a fraction of the total test time,
|
|
so the relative cost of adding a few extra test assertions would be of limited
|
|
cost, and could help build confidence.
|
|
|
|
Additional Kinds of Testing
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The existing e2e suite, exercises networks of nodes that have homogeneous
|
|
tendermint version, stable configuration, that are expected to make
|
|
progress. There are many other possible test configurations that may be
|
|
interesting to engage with. These could include dimensions, such as:
|
|
|
|
- Multi-version testing to exercise our compatibility guarantees for networks
|
|
that might have different tendermint versions.
|
|
|
|
- As a flavor or mult-version testing, include upgrade testing, to build
|
|
confidence in migration code and procedures.
|
|
|
|
- Additional test applications, particularly practical-type applciations
|
|
including some that use gaiad and/or the cosmos-sdk. Test-only applications
|
|
that simulate other kinds of applications (e.g. variable application
|
|
operation latency.)
|
|
|
|
- Tests of "non-viable" configurations that ensure that forbidden combinations
|
|
lead to halts.
|
|
|
|
References
|
|
----------
|
|
|
|
- `ADR 66: End-to-End Testing <../architecture/adr-66-e2e-testing.md>`_
|