The importance of trusting your tests: How we improved our testing and CI setup at TalkJS
You know something's off when developers start complaining that they’re unhappy with your testing setup. At TalkJS, we experienced this recently when our developers noted that our end-to-end tests were slow, inconsistent, and would sometimes break when things didn't break locally. Worse, they said: ‘We don't trust the tests any more.’
That aspect of trust was the most important for me. If you don't trust your tests, then everything else suffers. You’ll start to second-guess everything. You’ll be doing a lot more manual checking. With every release, you worry about what's going to happen. So something needed to be done.
Hence my colleague Matti and I took on the challenge and asked: How can we make our testing faster, more useful, and have it help us solve problems more quickly? In this article, I’ll discuss what changes we made to make our tests faster and more reliable. Which may also be helpful to your situation if you’d like to get more value from your tests.
TalkJS testing and CI setup
For context: At TalkJS we work with GitHub Actions for our testing and continuous integration (CI) setup. In GitHub Actions we have both a build pipeline and a testing pipeline that runs all of our tests. Those tests contain both unit tests and end-to-end (E2E) tests. We don't have any integration tests, because in my experience integration tests are difficult to get a lot of value from. They can be flaky and slow like E2E tests, but at the same time, they don’t give you the same level of confidence that everything will work after deployment.
The unit tests—front-end as well as back-end unit tests—tell us whether each function is working as intended. The unit tests are quick, low-level, and atomic. On the complete opposite end of the scale, we have our E2E tests, which go through some complete user flows to check that the whole app is working as expected. The E2E tests won’t tell you in which function or on what line the problem is, but they will say: there’s a problem, don't release. E2E tests can be slow, but they give you the confidence that your entire system is working and that it should be safe to release.
With this setup, what could we change to make our testing setup faster, and more helpful when something goes wrong? Here are some of the big, high-level changes that we made.
Changes to speed up and improve testing
1. Record your end-to-end tests
When one of your tests fails, you want to know what went wrong. Previously, we used to work with error logs, which show you anything that broke in the code. And we also used to capture screenshots of how things looked when any failed test had finished. But often a screenshot will just show you an empty chat UI, with no indication of what actually went wrong. In addition, to make our tests more helpful, we started to record our whole E2E testing process with TestCafe, which has a recording feature built-in. With a recording of an E2E test, instead of just an empty chat UI, you might see a button do nothing when it’s clicked—which tells you exactly where to look for the error. That additional information from a recording makes it a lot easier to track down any issues.
Enabling video recordings isn’t free, of course. It did make our tests slower to run. So we made recording optional, with a label you can add to your pull request (PR) to enable it. We don't need it very often. But when you're facing an issue, having a recording can make it much easier to figure out what's going on. One fewer roadblock to shipping your feature.
2. Scrap some await
s
Slow tests can be awful to work with. At one time, our E2E tests took about 40 minutes to complete—which meant a 40-minute feedback loop for every change you make. What we discovered was that when some of our individual tests were seeding the database, those tests would send a hundred chat messages one by one. There was no need for those requests to happen sequentially. Just by removing an await
and firing all the requests in parallel, we could easily remove 20 seconds from a single test.
3. Parallelize your tests
When you’re running multiple tests concurrently, you might at a certain point run into some limitations. At least we did. With recording enabled and tests running concurrently, we were pushing against the performance constraints of our CI runners. The process would sometimes completely ground to a halt. To resolve our CI runner performance limitations, we just threw money at the problem. We split our test suite into three, and we've now got these three E2E test suites all running in parallel on three different E2E runners. Essentially we’re running with triple concurrency, but even better, because we do so without increasing the load on any individual runner.
Running tests concurrently is not an option in all circumstances. In our earlier, ‘legacy’ test setup, we still had some dependencies between the individual tests. Dependencies between tests meant that tests had to run in a specific order, which can be a real obstacle to running them concurrently. To resolve this, we took a pragmatic, iterative approach: whenever someone would touch an old test, they would rework it, so that step-by-step each test would be isolated from all the others. Moreover, as we removed dependencies, we could immediately optimize things and split them up. Initially, our tests took around 40 minutes to run. Now it's under 15 minutes with many tests running in parallel—that’s more than twice as fast.
4. Dockerize everything … eventually
Ideally, you want your local environment and your CI environment to be as close as possible to the production environment. However, at a certain point, we were getting failures in our E2E tests that we couldn't replicate locally. We were somewhat flying in the dark. We were running the code in production mode, but we weren't running it in a Docker image as a stand-alone thing. While we were certainly testing the code, we weren't testing the product. That made it hard actually to track down and fix the problems found by the tests.
To mitigate this discrepancy, initially we tried running our E2E tests against our production Docker images. But this turned out to be really slow. Running the website, the docs, and everything else alongside TalkJS was too much for our little CI runner. So we put the plan to dockerize everything on hold for the moment. Instead, we added a smoke test with that fully-dockerized setup. Basically our smoke test loads the UI, writes ‘Hello’, clicks Send, checks that a message has appeared, and checks that it can navigate around the website. It looks at the product as a whole and asks whether it works.
Eventually, we would love to dockerize everything in the local environment as well. To have the full setup running on all of our end-to-end tests, not just the smoke test. We're moving towards that.
5. Make caching more granular
Caching the right things can win you quite some time, also when it comes to testing. Previously, for each test we would cache all of the dependencies and third-party libraries for our entire codebase. The front-end unit tests would download the dependencies for the back-end as well, with some of these caches being gigabytes in size. Moreover, caches could get invalidated quite easily: if a back-end dependency changed, that would invalidate the cache for the frontend too, so it would have to re-download all dependencies from scratch. To address this, we decided to make caches more granular. This means that you only download the dependencies that are relevant to you, and the caches don't get invalidated as often.
Another small improvement is that we started using Docker BuildX for building the Docker images. BuildX has a new build pipeline for building Docker images with built-in support for caching with GitHub Actions. BuildX loads the file system from the last time it built the image and then reruns the build. As a result, the build time for one of our Docker images went down from six minutes to only 16 seconds.
Rebuilding trust in tests
When your end-to-end tests are broken and take forever, people are (sadly) more likely to make the tests worse. They won't take the same level of care and attention as they might have, had the tests been well-maintained and working consistently. Whereas if, conversely, when people trust and have confidence in the tests, they are more willing to work on them and improve them. It’s almost the inverse of the ‘broken windows theory’ in sociology: a setup with useful, reliable tests is much more likely to have its tests attended to.
As a result of the improvements in terms of splitting, parallelization, caching and dockerization, our tests now run faster, are less flaky, and are also much more useful to work with. We’ve reduced our long feedback loop, and are solving the problems more quickly. Of course, upgrading our CI setup hasn’t stopped issues from happening. We still have buggy code to fix. But we're able to figure out what's wrong and resolve it much faster. All of which goes to show that venting in retros can have real-world effects.