About the Client
A major North American telecommunications company is expanding cloud-native operations with the aim of improving customer satisfaction while reducing call centre use. Sourced Group (Sourced) an Amdocs company has worked with the telco since its first migration to AWS ten years ago. This case study centres on a project for its digital self-service division which focuses on continuous modernisation of the technical stack and release process for its data services platform
In 2021 we incubated an innovative cloud-native ‘data capture and replay’ testing solution for the platform. The idea has since progressed through incubation to beta rollout and is poised for general availability. Headline achievements to date include:
- Identification and resolution of an autoscaling misconfiguration (surfaced through the tool’s automated replay performance testing feature), avoiding production outages.
- 60-98% reduction in time taken to troubleshoot production issues when using the tool.
- Automated regression testing of microservices earlier in the process to reduce change failure rates per release.
Challenge: Volume and Complexity of Testing Hindered Progress and Threatened Customer Satisfaction
The telco’s digital self-service division orchestrates hundreds of systems which generate hundreds of millions of transactions per day. This creates a demanding and highly dynamic environment for application development, testing, and troubleshooting. Testing new features and bug fixes frequently took days, weeks, and sometimes months. Consequently, long wait times for new self-service features were the norm, putting customer satisfaction at risk.
One factor contributing to software time-to-market difficulties was the labour-intensive, time-consuming nature of front-end scenario-based testing. It was also performed late in the development process, so reworking any problems was costly and resulted in release cycle delays and rollbacks. Modernisation of the testing process had become a critical priority. Business leaders were eager to resolve the issues and wanted investment in this area to deliver rapid incremental progress with measurable proof of business value.
Solution: A Revolutionary Traffic Driven Testing Platform
Addressing the telco’s challenges called for an ambitious approach to the testing of new features and bug fixes. We worked with in-house teams to reimagine production troubleshooting, root cause investigations, and proactive and reactive testing. This underpinned our development of an innovative traffic driven testing (TDT) solution rooted in data capture and replay.
Put simply, the solution address two main pain points within the software development lifecycle (SDLC): proactive testing and reactive troubleshooting.
Proactive Testing
Issues related to software and infrastructure changes are proactively identified earlier in the SDLC process before they reach customers, using real-world scenario-based testing. The solution harnesses carefully curated API request and response data from production traffic and automates the replay of such transactions against the new release to compare the differences.
Real-world transactions represent authentic customer and platform conditions, thereby improving test quality and eliminating the need for test data creation. This availability of production quality test transactions for use in regression automation results in a faster, more reliable release testing cycle. It also improves the developer experience, enabling real-time debugging of new API development efforts using historical scenarios based on actual transaction datasets.
The regression testing phase was a specific area of focus for this project. Our goal was to provide an automated, self-discovering regression and integration testing solution for a group of services that had both tight and loose coupling in their interactions. We achieved this through the development of a portable platform and software solution designed to intercept and capture comprehensive information about customers’ application use and the resultant internal transactions needed to complete requests. This information yields valuable insights which inform sophisticated tools for rapid development, troubleshooting, and validation during testing.
Reactive Troubleshooting
Clearly, proactive testing is vital to reduce customer impact and catch defects before they reach end users. However, anyone in the software engineering space recognises that occasionally production defects do happen, and they may slip through to customers. In these instances, the TDT tool also provides self-service observability and debugging features for reactive troubleshooting.
Using the tool, developers can access, analyse, and accurately simulate real-world scenarios on-demand in a safe, controlled environment. The platform provides remote debugging facilities so developers can use the tools they’re already comfortable with to debug software running in TDT environments. Combined, these capabilities enable powerful troubleshooting and iterative development workflows allowing developers to reproduce, isolate, and resolve issues quickly and confidently. This happens without the expense, risk, and inefficiency associated with excessive logging, guesswork, and other existing troubleshooting techniques.
Under the Hood
The capture and replay mechanism at the heart of the TDT platform plays a critical role. Customer-facing digital self-service applications are powered by hundreds of microservice transactions spanning a large collection of interconnected software components. The capture apparatus unobtrusively collects data flowing between these components during transactions. It is then aggregated, catalogued, and stored in a secure database. The replay element involves automated self-service tools reproducing realistic scenarios using the data in a controlled, observable, production replica environment. Captured transactions can be replayed on-demand against past, present, and upcoming application revisions.
Specific test cases can be reproduced reliably, repeatedly, and efficiently. Accounts don’t need to be readjusted and there’s no need to wait for lengthy account and/or test data setup or provisioning requests. The solution also enables rapid iteration with a tighter feedback loop after a feature is ‘code complete’. So, developers can deploy new builds to replay environments to get early feedback on in-flight changes. All this is achieved while retaining the high security standards that telco operators must adhere to with regards to data security and compliance.
Outcome: Faster, Cheaper, Better Resolution of Customer Issues
Traditional approaches to scenario-based testing can be expensive, tedious, and unreliable. However, developers that have used the new TDT solution say it allows them to troubleshoot and pinpoint issues much faster during application testing. They also report that production investigations are quicker and more effective. In short, developers can release high quality updates and fixes in a matter of hours, rather than days, weeks, or months.
Alpha phase rollout saw one scenario involving largescale use of captured data to load test a microservice using more than 100,000 replays. During regression testing, the approach identified an autoscaling misconfiguration which was resolved then regression tested again prior to production. Previously, a misconfiguration like this would likely have resulted in a production outage, but the issue was addressed and the microservice’s startup latency improved by 87%. With traditional scenario-based testing, this simulation would have been a lengthy process involving the creation of mock requests for multiple services. Using the capture and replay mechanism, the test was crafted in less than one hour, using a simulation of the exact production load.
Additional successes during the alpha phase include a 67% improvement in triaging device data discrepancy and the triage of a remote cache issue which resulted in a 91% faster mean time to recover (MTTR). Another scenario saw a production issue with an upstream service resolved in two hours whereas diagnosis alone would previously have been subject to a two-week release cycle.
Taking an agile, incremental approach to this modernisation exercise enabled Sourced to demonstrate tangible value to business leaders ahead of wider rollout. It has also been beneficial from a change management perspective, with word about the TDT platform’s benefits spreading organically between in-house teams. One developer involved in the pilot recommended the platform’s root cause identification functionality to customer support colleagues who were struggling to resolve a longstanding issue with a customer’s router. Once the team was onboarded, the issue – which had been ongoing for two months – was resolved in just ten minutes.
These are small wins in the context of a telco enterprise. Nevertheless, each represents a step towards heightened operational maturity in the cloud. Successes to date have energised the teams involved in the pilot. They also indicate the scale of time and cost savings that could be realised with further rollout across the enterprise. What’s more, the gains are firmly aligned with the telco’s cloud-native vision. Empowering developers to release high-quality features and bug fixes faster improves customer satisfaction while contributing to reductions in call centre use.
The next stage is to roll the platform out to more application teams ahead of general availability. Recently one such team used the platform to solve an issue it had been debugging for three months. During that time, the team had made extra log changes, pushing unsuccessful code changes which they had thought would fix the issue. Within ten minutes of using the TDT tool, the team identified the issue. Cases like this continue to demonstrate business value and secure executive support.