Incident management, EM’s close encounter with ITIL

EM 11 and earlier releases provided us an alerting framework based on Metrics and Thresholds. EM 12c introduces a complete revised framework based on Incidents and Events.

Events

An event is a single occurrence detected by EM and related to a single entity (well that is what the Administrator Guide tells us). Such a single entity could be: a Target, a Configuration File, a Job etc. Examples of an event include: Database Instance is down, Configuration File has been changed, Job executions ended in failure, Host exceeded a given percentage of CPU, Tablespace Space is exhausted etc.
No doubt this immediately makes the comparison with Metric Alerts in EM 11 (and before).

Incidents

When working in an IT environment that uses the ITIL (Information Technology Infrastructure Library) Best Practice processes and procedures, the term Incident does ring a bell doesn’t it? When searching for a definition of “Incidents” according ITIL in Wikipedia, you will find: “Incidents are the result of failures or errors in the IT infrastructure. The cause of Incidents may be apparent and the cause may be addressed without the need for further investigation, resulting in a repair, a Work-around or a request for change (RFC) to remove the error.”

EM 12c now allows us to make a definition of an Incident as a single or closely correlated set of events that identify a disturbance within our Data Center. So an Incident Definition might be as simple as the relation with a single Event “Available space in Tablespace has gone down a specified limit” or as more complex as an Incident “Server is running out of resources” that would be related to a set of Events relating to the usage of CPU, I/O and Memory Resources.

Integration with 3rd Party Helpdesk systems
Like we were used in EM 11 (and earlier release) EM 12c allows you to integrate with 3rd Party systems to for instance create a Ticket as result of an Incident occurrence.

Incident Rule Sets

As in EM 11 (and earlier) we used Notification Rules to initiate notification as a result of Alert creation for specific Events and specific Targets, EM 12c now provides us with Incident Rules and Incident Rule Sets to do so.

Incident Rules

Incident Rules actually perform the same role as Notification Rules

Incident Rule Sets

An Incident Rule Set allows you to combine multiple Incident Rules to a more extended control on Incident situations. As Incident Rules apply to objects like: databases, hosts, groups, jobs, web applications etc., by combining them in a Incident Rule Set you are able to take specific actions on a combination of objects and criteria.

Rules within a Rule Set can have a specific order in which each of them must be performed.

This example (based on an example in the Administrators Guide) shows the usage of Incident Rule Sets. The Set applies to a Group of Targets (of different types) and initiates multiple actions based on different criteria.

Out-of-Box Rule Sets

After installation of EM 12c the following “out-of-box” Rules Sets are available:

Incident Management Rule Sets for All Targets

  • Incident creation Rule for target down.
  • Incident creation Rule for target unreachable (for Agents and hosts).
  • Incident creation Rule for metric alerts (for critical severity only).
  • Out-of-box Incident creation rule for Service Level Agreement Alerts.
  • Incident creation rule for compliance score violation
  • Incident creation rule for high-availability events.
  • Auto-clear Rule for metric alerts older than 7 days.
  • Auto clear Rule for job status change terminal status events older than 7 days.
  • Clear Application Dependency and Performance (ADP) alerts after without incidents after 7 days.

The Rules that are part of an Incident Rules Set are made out of two parts:

  • Criteria
    The events/incidents/problems on which the Rule applies
  • Action
    The Actions that should be performed in case the Criteria are met

The Administrators Guide gives some example Rule Set possibilities:
Rule Set:

    • Applies to Target: Group Target G
    • Rules in the Rule Set:
      1. Rule(s) to create Incidents for specified Events
      2. Rule(s) that send Notifications on the Incidents
      3. Rule(s) that escalate Incidents based on some condition (e.g. length of time the Incident is open)
      Rule Set (Details)

      • Rule Set for Production Group G
        • Target: Production Group G
        • Rule 1: Create Incident for all Target down Events
        • Rule 2: Create Incident for specific DB, Host or WLS Metric alert events of critical or warning severity
        • Rule 3: Create Incident for any problem Job events
        • Rule 4: For all critical Incidents, send page, for all warning Incidents send e-mail
        • Rule 5: If Fatal Incident is op for more than 12 hours set escalation level to 1 and e-mail Manager

Event prioritizing

We all know that when using EM in a large IT landscape, having hundreds to thousands of targets, a heavy load of Incidents might be the result.

In such a case we would want to be able to prioritize. To allow such prioritizing EM 12c introduces: Target Life-cycle Status and Incident/Event Type.

Target Life-cycle
      • Mission Critical (Highest priority)
      • Production
      • Stage
      • Test
      • Development (Lowest priority)
Incident/Event Type
      • Availability (Highest priority)
      • All Events/Incidents (Fatal severity)
      • All Events/Incidents (Warning and Critical severity’s)
      • All Events/Incidents (Informational) (Lowest priority)
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s