Thursday, February 19, 2015

Hey SCOM! Automatic Alert Resolution Isn’t Working?!

Issue
Sometimes SCOM admins from different customers contact me stating Automatic Alert Resolution in SCOM is broken. However, there is something else at play here so I’ve decided to write this posting.

Automatic Alert Resolution seems to broken…
This is the default setting for Automatic Alert Resolution (Administration > Settings > Alerts > 2nd tab):
image

The highlighted setting states that any active alert with the Resolution State New will be automatically resolved after 30 days. But the same SCOM admins who tell me this isn’t working show me SCOM Consoles with Alerts way older then 30 days, and yet their status is still New, like this:
image

The Alert Veeam VMware: VM Deploy Failed is 377 days old, and has the Resolution State New. So one might think the Automatic Alert Resolution functionality is broken indeed. But there is more to it.

What’s really happening here
There are Monitors and Rules. And both are capable of triggering Alerts. However, a Monitor will only generate an Alert once when it changes state, like Healthy > Warning, or Healthy > Critical or Warning > Healthy, depending on what kind of Monitor it is (2 or 3 state).

So a Monitor won’t flood the Console with the same Alert. However, a Rule will raise an Alert and keep on doing that when the same critical condition is detected. This is by design. But it has the potential to flood the SCOM Console (and the Notification Model) with many Alerts all about the same issue.

So a Rule uses an Alert Suppression technology. Instead of triggering Alert after Alert, the Rule checks whether it has already triggered an Alert and when it already did, it won’t fire a new one. Instead it will raise the Repeat Count by one increment of that same Alert.

Normally this Repeat Count column isn’t shown but you can modify the Active Alerts View so it shows that column as well (right click that View > Properties > 2nd tab > select the option Repeat Count):
image

And when you take a new look at the same Alert which is still New after so many days this is what you’ll see:
image

So this Alert which is fired for the FIRST time 377 days ago, has a Repeat Count of 22615 (!) times. And here it comes: Every time the Repeat Count is raised, SCOM looks upon that Alert as a fresh one, except for the Notification Model that is.

So every time the Repeat Count is raised by one increment the counting of 30 days starts all over. And after those 30 days SCOM will groom that Alert out of the OpsMgr database.

But some basic calculations learns us this: The Alert is 377 days old, with a Repeat Count of 22615, so it’s 377/22615 = 0,016 day. This is the average ‘life span’ of a single increment of that Alert…. So it will NEVER reach the 30 days and it will never be groomed out by SCOM itself. Instead it needs some help from YOU…

Recap
Automatic Alert Resolution works. But when an Alert is triggered by a Rule and the Repeat Count is raised by one increment every time, it will start the counter for Automatic Alert Resolution all over again. So stay on the ball and in control. Manage your Alerts in a normal manner and you’ll see everything works out as intended.

1 comment:

Unknown said...

You can add to the view "Last Modified" because that is the field that Automatic Alert Resolution uses as the timestamp to count from.