PROACTIVE MONITORING
This article tries as briefly as
possible to explain the history, current position and the potential
future of real time proactive monitoring of IT infrastructures, from the perspective of
someone (Con Blackett) who has lived and breathed the subject for the
last 24 years.
TERMS & PHRASES I LIKE TO USE
Whenever anyone asks me what I have been doing for the last 24 years of my career I try to explain that I have been responsible for implementing and supporting "proactive monitoring", “systems monitoring”, “network monitoring” and "trouble ticketing" solutions to maintain large computer infrastructures (Networks, Systems and Services). I tend to avoid the newer ITIL term like “Service Assurance” "IT Operations" and "Service Management" unless the people who ask me the question are fully conversant with ITIL.
The people who worked for me 5 years
ago would probably find this aversion to ITIL terms strange, as back
then I reorganized all my departments (90 people) into the ITIL disciplines
and spent most of my training budget putting them through ITIL
courses and certification. The best explanation I can give you for
this discrepancy is by analogy, in that 99.9% people in the UK still
refer to Miles Per Gallon (MPG) rather than Miles per Liter, despite
the fact that you have not been able to buy fuel in Gallons in the UK
for over 20 years.
I have also found that people who work
outside, or on the periphery of “IT Operations” tend to find the
ITIL terms less than intuitive. Or worse still 10 people who believe
they understand ITIL will give you 10 different definitions of a term
like “Service Assurance”. So for the most part I will stick with
older terminology in this article.
THE BALANCE BETWEEN REACTIVE AND PROACTIVE MONITORING
One thing I have found to be universally true in in this area over these 24 years has been the ever changing balance between the reliance of "reactive monitoring" versus "proactive monitoring" of computer infrastructures. Let me explain!
HISTORY
IN THE BEGINNING
Back in 1990 when I first started implementing solutions to monitor computer infrastructures (back then VMS systems and DECnet networks) the practical balance of monitoring tools was almost exclusively in favour of reactive monitoring tools. Basically customers reporting problems via phone into Service Centres who then entered the details into "trouble ticketing" systems. These "tickets" were then progressed by technical teams until eventually the problem was resolved.
Back then there was a desire to
supplement reactive monitoring with the emerging proactive monitoring
tools. Mainly because Customers were unhappy with the idea that they
were the ones having to report failures in the computer
infrastructure. They understandably argued that it would be far
better if any failures were fixed before they (ie the service they
were using) were affected.
The other driver for proactive
monitoring was that the realization that if a failure was reported by
a customer a significant amount of time and effort would be needed to
isolate the failed component. Alternatively if the proactive
monitoring tool report on the failed components directly it should
be far quicker and easier (cheaper) to isolate and fix that failed
component.
These early proactive monitoring tools,
such as Sunnet Manager, IBM's NetView, HP's OpenView, TimeView etc
worked reasonably well for small networks and small systems.
HP OpenView |
In 1991 I moved from the UK out to
Altanta GA and part of my job there was to monitor and maintain some
small but very high profile Global networks and systems. I discovered
I could implement and more importantly maintain these early tools
cost effectively.
Simply put I could prove that the cost of implementation and ongoing maintenance of these tools was far less expensive than the value they provided back to the business.
Simply put I could prove that the cost of implementation and ongoing maintenance of these tools was far less expensive than the value they provided back to the business.
I was so successful, over the 3 years with
this small infrastructure (100 Cisco and Unix boxs) that when I came
back to the UK some bright spark put me in charge of monitoring BT’s
internal corporate IP network consisting of thousands of Cisco routers running
all of BTs computing support services (billing etc etc). That's when
my headache really started.
THE BREAKTHROUGH (1994)
After quickly realising that the existing proactive monitoring tools I knew and loved would not scale without an army of people to support them I looked for alternatives. That's when I had a piece of luck, I bumped into a guy called Phil Tee who had two radical ideas, namely de-duplication (many repeat events automatically summaries into one event) and stateless monitoring (no resource expensive model required like HP OpenView) resulting in a system called Omnibus. By this point we (I had a small team of 3 very good people) had figured out that all the existing large scale solutions like MaxM would not scale efficiently to our size of infrastructure.
Omnibus/Netcool Event List |
Less than 5 minutes into Phil Tees demo
of Omnibus I had a Eureka moment (I can still remember it clearly 20
years later). I knew this was what we desperately needed. Luckily
despite the fact that Micromuse was a completely unknown vendor, with
no track record or reference sites I managed to scrape together
enough money to get what I believed to be the first and certainly the
largest implementation of Omnibus live by September 1994. Thankfully
after a few initial teething problems the solution payback period for this outlay was
measured in weeks, not months or years!
THE GLORY DAYS (1994-1999)
These simple yet radical basic concepts enabled me and my small team to utilise Netcool and go from strength to strength. Making proactive monitoring of large complex infrastructures an affordable reality. We started with IP networks, then expanded to replace 12 different flavors of element managers and slowly moved on to take over Systems Monitoring (thousands of large Unix boxes and up), Application Monitoring and finally Business Service Monitoring.
At the time we measured success by the
reduction in reported trouble tickets that related to the
infrastructure failures of the devices we monitored. Thankfully this
measurement was performed independently by number crunching data from
our trouble ticketing system based on Clarify. To be fair we were
lucky in that the Service Desks had been categorizing resolved
failures by different device types for years, making the task
relatively painless.
Because of our incremental and provable
success I was able to cherry pick the best people to join my little team. This was a fantastic time for me and my enthusiastic team. Happy days!
During the early part of this period we
had direct access to the Micromuse development team (often on site)
and they would quickly improve the product to meet our ever expanding
needs. Sometimes improvements would be delivered in days.
Later I became the founding member of
the Micromuse Customer Executive Board, which proved very useful for
sharing problems and ideas with like minded peers in other companies
and across the Globe. In return for this I delivered key Netcool User
presentations in both Europe and in the US. All this resulted in the
balance of reactive versus proactive monitoring swinging firmly in
favour of proactive. This pleased our customers at the time
immensely. Life was good!
I was promoted to "Head of Service
Management Tools" in BT and covered far more areas such as
"Software Distribution", "Trouble Ticketing"
Clarify Development, Application Monitoring, Performance Monitoring
etc and far more people. Unfortunately during this period I was not
able to dedicate quite enough of my time to my first love "Proactive
Monitoring". Luckily I had a lot of very good people who kept
the momentum going and my other areas were interesting with lots of
synergy and the pay was good :-)
IMPROVEMENTS BUT LITTLE INNOVATION (2000 to Present)
After years of going from strength to strength, the world of computer infrastructure started to change radically rather than by slow evolution. With the introduction of things like Virtualisation, B2B gateways (into other companies) etc we were seeing a massive increase in challenges like complexity, dynamic constant change in the infrastructure and lack of viability into the infrastructures of our partners, who provided key elements of our customers applications.
The result of all these revolutionary
changes in our infrastructure was that Netcool itself started
becoming less effective, and hence we started to see a swing back to
reactive management at the expense of proactive.
To give you a flavor one of our Data
Centre managers said to me Netcool is still OK but its not good when
customers visit and see a "SEA of RED" on the big screens.
He went on to tell me "I just don't have enough people to make
sense of what the systems are telling me". Proactive Management
was loosing ground for large complex infrastructures, almost going
back to the pre 1994 position.
We realized that the Netool approach
was insufficient for this new world so we increased our efforts
looking for replacements. I cant begin to list out all the
evaluations of vendors we tried over a 12 year period, some of which
have since gone bust. We tried "Application Discovery"
"Transaction Monitoring" "Root Cause Analysis",
BSM, literly every new idea anyone could show us. Although we
eventually had limited success with in house BSM and our own
application monitoring standards not one tool had that Eureka idea to
get proactive monitoring firmly back in the game.
To be fair there were a few good ideas however nothing game changing.
THE FUTURE
I now believe I know a lot about what does not
work, such as "Application Discovery" "BSM" based
on CMDB's and I have a few high level thoughts about what I believe
this industry needs.
I believe the next "Big Thing" in
proactive monitoring needs to have the following characteristics:
Be able to Derive Root cause of
infrastructure failures in real time without the need for complex
Models, Manual rules, Filters, Discovery or Historical Pattern
matching.
Specifically no reliance on accurate
CMDB's or complex sets of manually maintained rules or relationships.
The computer infrastructure world is just too dynamic, complex, ever
changing and chaotic to work with any of these offerings. These ideas
often work in small scale but quickly fall apart when you try to
implement them in large dynamically changing complex infrastructures.
Going back to absolute basics, the New
Tool will need to "provide more value to the business that the
effort needed to maintain it". Hence my rant about rules, BSM
models and accurate CMDB's.
If you ever find the Next Big Thing,
that's not smoke and mirrors PLEASE let me know.
UPDATE: 27th February 2014
One of the comments I have received is from Mike Silvey, one of the co-founders of Micromuse. He has his own personal views published on his blog at http://moogsoft.com/service-management-itil-itsm-oss-etom-root-cause-event-management-service-desk-industry-comment/ . At first glance this appears to make sense and their offering looks to have potential.
Given his pedigree I will take him up on his offer to look into this in more detail. If I discover this is more than "smoke and mirrors" I will provide an update later.
3 comments:
I totally agree with you about the effort required to maintain rules, BSM models and accurate CMDBs outweighing the benefit they provide in the event of an unexpected problem occurring in a complex system. For 9 years I worked in the development team at RiverSoft, then Micromuse and finally IBM Tivoli, so I've seen the problems with trying to diagnose faults using complicated models from the other side of the fence to you. In my experience, there are always a small number of overworked experts who have enough of an idea how things really fit together that they can solve the difficult problems, and when something bad happens they get called in to sort it out with little help from monitoring tools. A more realistic goal than trying to determine root cause based on some perfect model of how everything fits together is simply to know at any time what is happening that's unusual. If your environment is working most of the time, a problem will be caused by something unusual happening. The company I work for now, Prelert, has a product called Anomaly Detective that can tell you in real-time what the anomalies are in your environment. Then your overworked experts just need to look at the Prelert dashboard rather than trawling through raw data to find likely causes. Anomaly Detective sits on top of Splunk, so you'll need to centralise monitoring data from your systems into Splunk, but if you're still relying on Netcool that's probably a good thing to do anyway - de-duplication destroys a lot of the information needed to accurately detect anomalies, and hardware is a lot more powerful now than it was in 1994! If you use or evaluate Splunk then try out Prelert Anomaly Detective too. It's free to download for a 30 day trial and certainly falls into the category of a tool that can "provide more value to the business that the effort needed to maintain it".
Thanks for the comment. Although I am aware of Prelert I have not looked at it in detail. One of my old teams did use Splunk about 5 years ago in the security area for historical analysis. At the time Splunk was thought to be more of a reactive tool that was suited to analysis and drilling down to find problems rather than a real time proactive tool suitable for monitoring. However it was a while ago so I will try to get some more recent feedback :-) Anyone else have any thoughts/experience with this approach ?
Con
You are spot on here. Building upon your theme, as one of the co-founders of Micromuse (and then RiverSoft with RCA), i've posted my [possibly highly personal] view on our site:
http://moogsoft.com/service-management-itil-itsm-oss-etom-root-cause-event-management-service-desk-industry-comment/
I look forward to some robust discussions with you :)
Glad you see you sharing your wisdom on this topic.
Post a Comment