Reliable Network Operations Center with Jenkins in the AdTech Industry
Submitted By Jenkins User Alexander Leibzon
A mobile performance marketing platform was struggling with a network operations center that was receiving trigger alerts for failed builds or poor performance but was not acting on them expediently. Turning to Jenkins, they were able to significantly reduce communication cycles and keep builds on track.
Organization: A leading provider of mobile performance and advertising technology.
Project Website: https://plugins.jenkins.io/pagerduty/
Programming Language: Java
Platform: Docker, Kubernetes, Linux
Version Control System: GitHub
Build Tool: Maven
Community Support: Jenkins Users Google Group or IRC Chat, Jenkins.io websites & blogs
Getting reliable, in-time alerts in PagerDuty.
Background: When delivering real-time ads in the marketplace, ensuring your developers were on call to respond to fixes as needed is a critical component in any software situation. We found that our developer alerts were being generated using a “mediator.” This meant that builds that failed or yielded metrics that should have been triggering an alert were doing so by sending an email to the Network Operations Center. The NOC, in turn, was then contacting the responsible team or developer. This pass-along process caused a delay in attending to the issues at hand, correcting the failures, and resuming the build.
In many cases, the cycle of an alert having to run through a NOC eliminates any sense of urgency. It could take up to 30 minutes before a party responsible might be alerted an appropriate action took place. This delay can significantly affect builds and pipelines. It took me by surprise that there was no PagerDuty plugin for Jenkins yet, so I decided to develop and open-source it.
Goals: Build a new Jenkins plugin to alert developers of issues and accelerate response and resolution times.
Solution & Results: We took inspiration from PagerDuty, an incident management platform that provides reliable notifications, automatic escalations, on-call scheduling, and other functionality to help teams detect and fix infrastructure problems quickly. The mobile app allows you to trigger, acknowledge, and resolve incidents.
After figuring out the need for a Jenkins PagerDuty plugin, it was an easy choice to turn to Jenkins documentation and codebase. This is actually a really awesome way to get to know Jenkins in-depth, much more than just using it to execute pipelines or run ad hoc jobs. After going over possible plugin options, PagerDuty was created as a post-build notifier plugin.
One of the critical capabilities to getting this done is the overall extendability of all the components. Plus, the Jenkins community and documentation played a considerable role in the quick development and adoption of the plugin.
We now have an option to trigger and resolve PagerDuty incidents directly from builds and pipelines. In addition, we shortened the “Alert to Resolution” cycle from one half-hour to just a few minutes. Best of all, we now have a better and holistic understanding of Jenkins internals.
With the open source PagerDuty plugin, we achieved our goals and more, including:
- The ability to trigger incidents on various job statuses: Success, Failure, Aborted, Unstable, & Not Built
- The ability to trigger incidents based on number of consecutive build results
- The ability to automatically resolve incidents when job is back to normal
- Being pipeline compatible