Continuous Delivery for Hardware IPHigh Scale Service Deployment: Taboola’s Recommended Flow
Submitted By Jenkins User Tidhar Klein Orbach
When a recommendation engine has to respond to hundreds of thousands of requests per second, there is no room for development downtime.
Organization: Taboola, www.taboola.com
Industry: Continuous Delivery
Programming Language: Groovy
Version Control System: BitBucket Server
Build Tool: Gradle
Community Support: Jenkins.io websites and blogs. Spoke with colleagues and peers
Reducing developer frustration with an automated solution that
works with a variety of circumstances and operating systems.
Background: Taboola’s recommendation engine responds to hundreds of thousands of requests per second. The service has to be fast – so fast that its p95 should be below 500 milliseconds per request. Which means we can’t have any downtime at all, or even afford slower responses.
In addition, it’s critical to prevent the installation of a faulty version. A faulty version could lead to downtime or degraded performance, which can directly result in a loss of revenue. For this reason, we have multiple testing gateways during development — to help prevent a bad version. However, based on our experience, sometimes when the software meets production, unexpected and often bad things can happen. We need to be ready to prevent that. Another important requirement is to deploy during office hours, when most of the engineers will be available to assist should something go wrong.
Goals: To deploy a highly sophisticated Java service, one that is very actively developed on a daily basis, to thousands of servers in multiple data centers around the world.
Solution & Results: To meet the objectives, we designed a flow for the deployment. The following are the flow stages at high level:
- Is today a deployment day? — We don’t deploy on holidays 🙂
- Is today’s version valid? — Validate the version using canary testing which is implemented in another Jenkins flow
- Data center verification — Deploy on a single data center and verify
- New version for all — Deploy on the rest of the data centers (6 out of 7) in parallel
The deployment procedure on a single data center goes like this:
- Get the list of servers to be deployed
- Calculate the size of the server batch (using metrics and math 🙂
- For each server in the batch
- Silence all alerts
- Stop the old version and remove it
- Install the new version
- Start the service
- Verify that the service started correctly
- Unsilence all alerts
- Run a batch verification to check various metrics of the domain
- Wait for a minute for the next server batch
- Repeat until no servers are left
For reference, the flow is detailed at: https://engineering.taboola.com/high-scale-service-deployment/
All of the logic is implemented with Jenkins Pipelines and Groovy support. We created a large shared libs repository with our deployment flow infrastructure. It made the process easy to maintain, extend and generalize to other services as well. As for Jenkins Plugins, we use different plugins during the flow run to report metrics and alert. For example, we integrated the Pager Duty Plugin to trigger an alert in case of a failure. The alert is triggered and resolved automatically by code.
All in all, we saw great results, including:
- a deployment flow with high reliability
- it’s easier to maintain and extend it with Jenkins Pipelines and Groovy
- we’re able to deploy higher amount of servers in the same or even less time, due to the Jenkins Pipeline flow