I went along to a meeting of the BCS Kingston & Croydon branch last Tuesday, at which a group of people from BT's Design group, who specialise in Systems and Application Monitoring and Management tools, revealed some astonishing achievements in a very low-key way, as if they had no idea how important they were.
These people have distilled their dozens of years of experience of managing increasingly complex distributed systems with few staff and fewer tools into a powerful yet spare vocabulary (or ontology, to use a fancy term) that efficiently describes the universe of discourse. It includes such concepts as server, virtual machine, date, time, business process, transaction, event-type and (very important) end-to-end correlation key, which precisely locates a reported event in a specific application component. All this, logically enough, is aligned with the ITIL standard for service delivery.
But not only that, they've defined binary, textual and graphical representations of log entries or event notifications that capture all this information, and a set of libraries that implement all of this and are accessed via a very simple API, which has been implemented by a standard code library (I understand that a Java implementation is available, but there may be support for other languages too). Not least, there is a defined process for integrating an application into the service monitoring and management framework.
Most applications already generate copious log information, and most commercial monitoring tools work by scanning the log files for interesting events. You have to configure patterns that allow the monitoring software to recognise different events. As a result, all large-scale monitoring infrastructures are permanently out of date with respect to the log formats and events generated by the applications, which are continually evolving. Moreover, the sheer volumes of log information generated mean that monitoring products that take this approach tend to be overwhelmed by the deluge of data and can find it difficult to react in a timely manner to real problem situations when they arise.
BT's BPTM takes a different approach: the application is "instrumented" by wrapping existing calls to the system logging facility, when it's much easier to identify the meaning of the logged information in terms of the underlying data model and to add any missing properties (such as system identifier, timestamp and e2e correlation key). As a result, team boss Ian Johnston claims that an average application can be instrumented in one day (preceded by a one-day workshop to identify the requirements of managing that application, and followed by another day to roll out and test the instrumented version of the code).
The BPTM library takes a "reporting by exception" approach to cut down on the amount of communication required. For example, events that are expected and that duly occur are merely logged locally by the application. This measure alone reduces the management data traffic by a factor of 20:1 on average. Then there are event correlation rules that can recognise typical failure scenarios and offer scripted diagnostic and remediation advice to support staff, many of whom are offshore.
By using this combination of approaches, the design group has already equipped over 80 separate applications in the "BT Matrix" or Service Oriented Architecture to be centrally monitored and managed. Newly instrumented applications are auto-discovered by the BPTM infrastructure - they simply hook themselves into the reporting network and pop up on the monitoring console (which is of course a rich Internet application).
Operators are alerted to emergency situations, such as service bottlenecks, via a variety of mechanisms. The primary user interface is a mimic diagram, which shows the flow of messages that make up an end-to-end business transaction through a series of components. The user can drill in to see both more detail and historical trend information, so that e.g. new server capacity can be brought on-stream before a bottleneck becomes critical.
It's obviously in BT's interest to publicise the BPTM standard so that more suppliers will start using it and building it into their products from the outset. But I don't think that Ian and his team are going about this in the right way yet. To build up momentum, it is not enough to hold occasional talks to BCS branches, where you reach at most 20 interested individuals at a time. You need to convince the solution architects and other decision makers that this is the right way to go. The first thing to do is to publish the standard, and simultaneously or not long afterwards, make the libraries that implement it Open Source. This should create a community of interest across the industry. After all, large service-oriented architectures are becoming increasingly common, in all market sectors, not just in telecoms, so the management headache is shared by all projects. Then some judiciously targeted white papers and articles should appear in the appropriate journals, and the trade press needs to be made aware.
If publicised in the right way, I can't see how this technology can fail to make waves.