On Reliability of Software in General and Contact Center Software in Particular
by Vladimir K. Dudchenko
We often complain about software bugs, which can be very unfortunate and cause significant financial losses. Our most common conclusion is that the developer (supplier, integrator) does not care about their reputation and does not debug the software properly.
Let’s examine how true this statement is, what we can expect from new developments or releases in terms of reliability, and how we should treat failures, malfunctions, and losses arising from them.
Although almost all current hardware systems are based on programmable elements (from processor microprograms and microcontrollers to quite large software complexes such as BIOS), in what follows I will only talk about the layer of application software, not touching upon issues of system reliability and hardware fault tolerance.
Unfortunately, the causes of application software failures are often difficult to separate from OS or hypervisor bugs, or from the faults in any of the software entities that form the foundation of application systems, including various frameworks, libraries, etc.
But let us move on to the main point: the reliability of the application software itself.
Mistakes and failures are everywhere. Elon Musk’s Starship has had several failed (or at least less than perfect) launches, and he did not immediately learn how to land the SpaceX booster stages safely, either. On the other hand, such complex systems as the Space Shuttle or the Russian Buran already ran smoothly on their first flights. From this, we can conclude that it is possible to achieve an exceptionally high level of software reliability in some cases (and in the examples mentioned, software plays the key role, of course), while in others you may tolerate mistakes.
There have been plenty of examples of human error in our practice of installing and configuring contact center software. Once, an IVR branch was inadvertently left with an incorrect ending, which meant the call center was working in a wrong way. Another time, the logic of holiday and weekend shift management was not taken into account — the call center began working according to the weekday logic on weekends; and so on. It is possible to adjust the development and debugging processes in some way to reduce the number of errors, but it is quite hard to completely eliminate the human factor.
Not just suppliers or developers, but also customers should have a proper understanding of the causes and logic of failures, as well as the level of preparedness for them. So: Are all software failures the same in nature, or can they be categorized in order to effectively allocate prevention efforts?
One approach is to look at the problem in terms of scale: the more instances of a system will be used, the more attention needs to be paid to debugging. While I generally agree, I would not overestimate this approach, as this parameter may only matter on an ‘other things being equal’ basis, and those ‘other things’ may have a greater impact by an order of magnitude. Space systems, for example, have an extremely low number of instances — so what? Don’t you have to debug them? That would be too expensive…
That, of course, is the crucial issue.
Back in the eighties, I was involved in developing software for one of Buran’s critical systems (as a programmer, and also as an analyst, an architect, a project manager, etc.). I also participated in some other projects of a similar class. Later, I was involved in commercial projects for more than a quarter of a century. After 50 years in the industry, I have plenty of information to analyze and compare: different approaches and methods, as well as different results.
Let’s start with space systems. Their main distinguishing feature is that the cost of error tends to be almost infinite. Consequently, the cost of error avoidance measures is also virtually unlimited. Here, the question was not in the economic side of the project, but in building a multilevel and multilayer debugging system so that any potential error and inaccuracy could be correctly diagnosed and eliminated. That meant controlling all layers of the architecture.
That’s easy to say; but what about operating systems and libraries? (The term “framework” was not used back then, at least not in Russia.) The solution was simple: they all were developed from scratch strictly for that one task. And what about the dynamic behavior of systems that experience statistical-level external influences, so that it is impossible to properly reproduce their behavior (which is a necessary condition for effective debugging)? There were separate methods for this which enabled a combination of debugging under conditions of reproducible system behavior with modeled external influences and a registration of their effects under real conditions.
Everything looks logical, but I think experts can imagine that diagnostic and debugging systems of this level exceeded the cost of the software development itself many times over, that is, by more than an order of magnitude (and this doesn’t only apply to the software — hardware debugging equipment was involved as well).
From this, we can draw the first conclusion: in “ordinary” life such approaches are inapplicable; you would never be able to release a product.
Let’s take another look at Elon Musk. His space systems also sometimes fail (crash, explode). Didn’t he learn how to use debugging and diagnostic systems? Hardly. I assume that he balances the cost of failures and the cost of debugging. In other words: the cost of debugging is nearly equal to the cost of potential losses on failure, and there is no sense in building up the cost of systems and procedures of preliminary debugging further. It is cheaper to try the system in real tests (that’s what the debugging being “many times” more expensive, as mentioned above, means in real life). And, by the way, the cost of the actual time spent on debugging is also worth considering — including the cost of not fulfilling a contract on time, the cost of losses from being overtaken by competitors, etc.
This is all statistical speculation, of course; Mr. Musk’s balance is probably based on intuition. But the principle still applies.
Let’s get back down to earth, to our call center software.
You have to distinguish between three types of components which have different influence on the reliability of the system as a whole.
The first and most critical component type in terms of its impact on reliability is the core. This is what handles call control, the telephone provider connection, and the exchange (of audio, data, sometimes video) with agent workstations.
We speak here of a high number of system instances (the core is used, unchanged, in all installations) and of a direct impact on the availability of the system using this software, that is, the system providing the actual call center functions.
The requirements are as high as they can get, perhaps as high as for actual space systems (well, not quite — but not too far off, either). The cost of failure can be high, not only because downtime of any real installation is expensive, but also because the scaling increases this cost in proportion to the number of instances in use. Another multiplier of the cost of failure is related to the fact that it is difficult to build any workarounds for the core, and only the authors of the system can find and fix the problem.
The second component type are various service functions, including the agent interface, standard IVR blocks, all kinds of integration APIs, etc. There are a lot of such functions, and not all of them are used in every particular project, so the number of instances of each function is generally much lower than that of the core. Also, in each particular case, it is usually possible to compensate for failures of such components by implementing temporary functions and other workarounds, giving the developer time to eliminate the defect without incurring high losses.
Reliability requirements for this type of function are generally lower, as is the range of debugging options (such components vary widely, and the high development speed does not allow for sufficiently deep debugging).
Here are some ways of protecting against errors in these components:
a) First of all, it is important to test the system as configured to meet customer requirements, under conditions as close to real as possible (during the roll-out phase, i.e. in preparation for operation). This means that all call processing routes must be tested. If there are too many combinations of parameters that the agents will be working with, you have at the very least to select the most likely ones for testing (keeping in mind that any untested combination means potential failure). The same considerations apply to other features.
b) Organize a pilot operation period aimed at identifying errors. It is during this period that the most likely failures will occur. Therefore, the pilot operation should cover all possible modes (all inbound and outbound projects, types of integration, etc.) This is essentially a complex debugging of the software in a real-world environment. Ignoring this approach increases the risk of financial and reputational losses later on.
c) The roll-out team (the system integrator or the vendor’s Professional Services team) should be on high alert during trial operation, and make decisions on emergency measures together with the customer when errors or inaccuracies are detected.
d) The vendor has to be ready to make corrections on short notice. They should realize that the probability of errors in components of the second type is significant, and design a procedure for correcting failures within a very limited timeframe.
The third component type is the most error prone. It is the customization of the system to customer requirements (IVR, agent screens, call routes, reports, custom scripts, etc.). Of course, there’s no excuse for an inattentiveness of the roll-out team or a lack of debugging, but the influence of the notorious “human factor” is much higher since the debugging options at the level of system configuration are often very limited.
Recall the cost of debugging tools used in the space industry. Of course, the situation for this part of call-center projects is incomparable, primarily because in this case the budget does not usually provide for special debugging means (such as simulation of all types of calls, automatic testing of different operation modes, or simulation of all types of real situations). Thus, in the end, debugging consists of testing based on individual examples, with manual settings, manual calls (which might not cover the entire dial plan, for example), incomplete testing of the IVR condition combinations, integration parameters and other conditions. It is possible to use a second pair of eyes for both code and settings tests, but this significantly increases the budget and isn’t always effective.
So what do we do with this component type? How do we guard against errors? In general, the approaches are the same as for the second type. The good news is that all solutions covering type three components were created by the implementers themselves, so that problems do not need to be escalated to the vendor. The bad news is that the probability of errors is usually higher for type 3 components than for type 2 components.
What conclusions can we draw from all of this?
1. The cost of losses from system errors is balanced by the cost of testing and debugging. This is a universal rule that works for space systems as well as call centers, and generally for any type of software solution.
2. The customer’s careful attention to acceptance testing as well as their willingness to allocate time and funds to additional testing can significantly increase system reliability.
3. A willingness to eliminate mistakes and setting issues, and to correctly plan holistic debugging (aka pilot operation), is the key to minimizing damage caused by errors.
4. If the cost of errors is high, as with emergency service systems, the project should include special tooling (both software and hardware) for various types of testing and test automation used in debugging.
There are many other topics like hardware failover, platform-level failures (of OS, hypervisors, container management tools, etc.), or dynamic software failover. And I have not touched upon the use of monitoring, either automatic or manual, for notifications of pre-critical and critical situations. Backup issues are a separate area which plays a huge role in ensuring uninterrupted operations. All these topics overlap with, but are distinct from the considerations in this article; however, they are just as critical in terms of reducing the cost of failures and the time it takes to fix them.