Thursday 13 February 2014

The future of Proactive Monitoring


PROACTIVE MONITORING


This article tries as briefly as possible to explain the history, current position and the potential future of real time proactive monitoring of IT infrastructures, from the perspective of someone (Con Blackett) who has lived and breathed the subject for the last 24 years.

TERMS & PHRASES I LIKE TO USE


Whenever anyone asks me what I have been doing for the last 24 years of my career I try to explain that I have been responsible for implementing and supporting "proactive monitoring", “systems monitoring”, “network monitoring” and "trouble ticketing" solutions to maintain large computer infrastructures (Networks, Systems and Services). I tend to avoid the newer ITIL term like “Service Assurance” "IT Operations" and "Service Management" unless the people who ask me the question are fully conversant with ITIL.

The people who worked for me 5 years ago would probably find this aversion to ITIL terms strange, as back then I reorganized all my departments (90 people) into the ITIL disciplines and spent most of my training budget putting them through ITIL courses and certification. The best explanation I can give you for this discrepancy is by analogy, in that 99.9% people in the UK still refer to Miles Per Gallon (MPG) rather than Miles per Liter, despite the fact that you have not been able to buy fuel in Gallons in the UK for over 20 years.

I have also found that people who work outside, or on the periphery of “IT Operations” tend to find the ITIL terms less than intuitive. Or worse still 10 people who believe they understand ITIL will give you 10 different definitions of a term like “Service Assurance”. So for the most part I will stick with older terminology in this article.

THE BALANCE BETWEEN REACTIVE AND PROACTIVE MONITORING


One thing I have found to be universally true in in this area over these 24 years has been the ever changing balance between the reliance of "reactive monitoring" versus "proactive monitoring" of computer infrastructures. Let me explain!

HISTORY

IN THE BEGINNING


Back in 1990 when I first started implementing solutions to monitor computer infrastructures (back then VMS systems and DECnet networks) the practical balance of monitoring tools was almost exclusively in favour of reactive monitoring tools. Basically customers reporting problems via phone into Service Centres who then entered the details into "trouble ticketing" systems. These "tickets" were then progressed by technical teams until eventually the problem was resolved.

Back then there was a desire to supplement reactive monitoring with the emerging proactive monitoring tools. Mainly because Customers were unhappy with the idea that they were the ones having to report failures in the computer infrastructure. They understandably argued that it would be far better if any failures were fixed before they (ie the service they were using) were affected.

The other driver for proactive monitoring was that the realization that if a failure was reported by a customer a significant amount of time and effort would be needed to isolate the failed component. Alternatively if the proactive monitoring tool report on the failed components directly it should be far quicker and easier (cheaper) to isolate and fix that failed component.

These early proactive monitoring tools, such as Sunnet Manager, IBM's NetView, HP's OpenView, TimeView etc worked reasonably well for small networks and small systems.

HP OpenView

In 1991 I moved from the UK out to Altanta GA and part of my job there was to monitor and maintain some small but very high profile Global networks and systems. I discovered I could implement and more importantly maintain these early tools cost effectively.

Simply put I could prove that the cost of implementation and ongoing maintenance of these tools was far less expensive than the value they provided back to the business.

I was so successful, over the 3 years with this small infrastructure (100 Cisco and Unix boxs) that when I came back to the UK some bright spark put me in charge of monitoring BT’s internal corporate IP network consisting of thousands of Cisco routers running all of BTs computing support services (billing etc etc). That's when my headache really started.

THE BREAKTHROUGH (1994)


After quickly realising that the existing proactive monitoring tools I knew and loved would not scale without an army of people to support them I looked for alternatives. That's when I had a piece of luck, I bumped into a guy called Phil Tee who had two radical ideas, namely de-duplication (many repeat events automatically summaries into one event) and stateless monitoring (no resource expensive model required like HP OpenView) resulting in a system called Omnibus. By this point we (I had a small team of 3 very good people) had figured out that all the existing large scale solutions like MaxM would not scale efficiently to our size of infrastructure.

Omnibus/Netcool Event List
Less than 5 minutes into Phil Tees demo of Omnibus I had a Eureka moment (I can still remember it clearly 20 years later). I knew this was what we desperately needed. Luckily despite the fact that Micromuse was a completely unknown vendor, with no track record or reference sites I managed to scrape together enough money to get what I believed to be the first and certainly the largest implementation of Omnibus live by September 1994. Thankfully after a few initial teething problems the solution payback period for this outlay was measured in weeks, not months or years!


THE GLORY DAYS (1994-1999)


These simple yet radical basic concepts enabled me and my small team to utilise Netcool and go from strength to strength. Making proactive monitoring of large complex infrastructures an affordable reality. We started with IP networks, then expanded to replace 12 different flavors of element managers and slowly moved on to take over Systems Monitoring (thousands of large Unix boxes and up), Application Monitoring and finally Business Service Monitoring.

At the time we measured success by the reduction in reported trouble tickets that related to the infrastructure failures of the devices we monitored. Thankfully this measurement was performed independently by number crunching data from our trouble ticketing system based on Clarify. To be fair we were lucky in that the Service Desks had been categorizing resolved failures by different device types for years, making the task relatively painless.

Because of our incremental and provable success I was able to cherry pick the best people to join my little team. This was a fantastic time for me and my enthusiastic team. Happy days!

During the early part of this period we had direct access to the Micromuse development team (often on site) and they would quickly improve the product to meet our ever expanding needs. Sometimes improvements would be delivered in days.

Later I became the founding member of the Micromuse Customer Executive Board, which proved very useful for sharing problems and ideas with like minded peers in other companies and across the Globe. In return for this I delivered key Netcool User presentations in both Europe and in the US. All this resulted in the balance of reactive versus proactive monitoring swinging firmly in favour of proactive. This pleased our customers at the time immensely. Life was good!

I was promoted to "Head of Service Management Tools" in BT and covered far more areas such as "Software Distribution", "Trouble Ticketing" Clarify Development, Application Monitoring, Performance Monitoring etc and far more people. Unfortunately during this period I was not able to dedicate quite enough of my time to my first love "Proactive Monitoring". Luckily I had a lot of very good people who kept the momentum going and my other areas were interesting with lots of synergy and the pay was good :-)


IMPROVEMENTS BUT LITTLE INNOVATION (2000 to Present)


After years of going from strength to strength, the world of computer infrastructure started to change radically rather than by slow evolution. With the introduction of things like Virtualisation, B2B gateways (into other companies) etc we were seeing a massive increase in challenges like complexity, dynamic constant change in the infrastructure and lack of viability into the infrastructures of our partners, who provided key elements of our customers applications.

The result of all these revolutionary changes in our infrastructure was that Netcool itself started becoming less effective, and hence we started to see a swing back to reactive management at the expense of proactive.

To give you a flavor one of our Data Centre managers said to me Netcool is still OK but its not good when customers visit and see a "SEA of RED" on the big screens. He went on to tell me "I just don't have enough people to make sense of what the systems are telling me". Proactive Management was loosing ground for large complex infrastructures, almost going back to the pre 1994 position.

We realized that the Netool approach was insufficient for this new world so we increased our efforts looking for replacements. I cant begin to list out all the evaluations of vendors we tried over a 12 year period, some of which have since gone bust. We tried "Application Discovery" "Transaction Monitoring" "Root Cause Analysis", BSM, literly every new idea anyone could show us. Although we eventually had limited success with in house BSM and our own application monitoring standards not one tool had that Eureka idea to get proactive monitoring firmly back in the game.

To be fair there were a few good ideas however nothing game changing.

 

THE FUTURE


I now believe I know a lot about what does not work, such as "Application Discovery" "BSM" based on CMDB's and I have a few high level thoughts about what I believe this industry needs.

I believe the next "Big Thing" in proactive monitoring needs to have the following characteristics:

Be able to Derive Root cause of infrastructure failures in real time without the need for complex Models, Manual rules, Filters, Discovery or Historical Pattern matching.

Specifically no reliance on accurate CMDB's or complex sets of manually maintained rules or relationships. The computer infrastructure world is just too dynamic, complex, ever changing and chaotic to work with any of these offerings. These ideas often work in small scale but quickly fall apart when you try to implement them in large dynamically changing complex infrastructures.

Going back to absolute basics, the New Tool will need to "provide more value to the business that the effort needed to maintain it". Hence my rant about rules, BSM models and accurate CMDB's.

If you ever find the Next Big Thing, that's not smoke and mirrors PLEASE let me know.

UPDATE: 27th February 2014

One of the comments I have received is from Mike Silvey, one of the co-founders of Micromuse. He has his own personal views published on his blog at http://moogsoft.com/service-management-itil-itsm-oss-etom-root-cause-event-management-service-desk-industry-comment/ . At first glance this appears to make sense and their offering looks to have potential.

Given his pedigree I will take him up on his offer to look into this in more detail. If I discover this is more than "smoke and mirrors" I will provide an update later.

3 comments:

Anonymous said...

I totally agree with you about the effort required to maintain rules, BSM models and accurate CMDBs outweighing the benefit they provide in the event of an unexpected problem occurring in a complex system. For 9 years I worked in the development team at RiverSoft, then Micromuse and finally IBM Tivoli, so I've seen the problems with trying to diagnose faults using complicated models from the other side of the fence to you. In my experience, there are always a small number of overworked experts who have enough of an idea how things really fit together that they can solve the difficult problems, and when something bad happens they get called in to sort it out with little help from monitoring tools. A more realistic goal than trying to determine root cause based on some perfect model of how everything fits together is simply to know at any time what is happening that's unusual. If your environment is working most of the time, a problem will be caused by something unusual happening. The company I work for now, Prelert, has a product called Anomaly Detective that can tell you in real-time what the anomalies are in your environment. Then your overworked experts just need to look at the Prelert dashboard rather than trawling through raw data to find likely causes. Anomaly Detective sits on top of Splunk, so you'll need to centralise monitoring data from your systems into Splunk, but if you're still relying on Netcool that's probably a good thing to do anyway - de-duplication destroys a lot of the information needed to accurately detect anomalies, and hardware is a lot more powerful now than it was in 1994! If you use or evaluate Splunk then try out Prelert Anomaly Detective too. It's free to download for a 30 day trial and certainly falls into the category of a tool that can "provide more value to the business that the effort needed to maintain it".

Unknown said...

Thanks for the comment. Although I am aware of Prelert I have not looked at it in detail. One of my old teams did use Splunk about 5 years ago in the security area for historical analysis. At the time Splunk was thought to be more of a reactive tool that was suited to analysis and drilling down to find problems rather than a real time proactive tool suitable for monitoring. However it was a while ago so I will try to get some more recent feedback :-) Anyone else have any thoughts/experience with this approach ?

Unknown said...

Con

You are spot on here. Building upon your theme, as one of the co-founders of Micromuse (and then RiverSoft with RCA), i've posted my [possibly highly personal] view on our site:

http://moogsoft.com/service-management-itil-itsm-oss-etom-root-cause-event-management-service-desk-industry-comment/

I look forward to some robust discussions with you :)

Glad you see you sharing your wisdom on this topic.