Incident Management Process
ITIL-aligned incident management lifecycle from detection through resolution and post-incident review, featuring AI-assisted triage and automated remediation.
Incident Lifecycle
Incident Detection
Incidents enter the process through three primary channels:
Classification & Prioritisation
Incident Categories
| Category | Sub-Categories | Typical Priority |
|---|---|---|
| Network | Connectivity, DNS, DHCP, firewall, SD-WAN, Wi-Fi | P1-P3 |
| Server/Compute | Server down, high utilisation, OS crash, service failure | P1-P3 |
| Cloud Services | Azure/AWS outage, resource failures, config issues | P1-P3 |
| Email & Collaboration | M365 issues, Exchange, Teams, SharePoint | P1-P3 |
| Application | LOB app errors, performance, access issues | P2-P4 |
| Desktop/Endpoint | Hardware failure, OS issues, software installation | P3-P4 |
| Security | Malware detected, phishing, unauthorised access | P1-P2 (escalate to CSIRT) |
| Backup/DR | Backup failure, restore request, DR failover | P2-P3 |
| Print/Peripheral | Printer issues, scanner, AV equipment | P3-P4 |
| Access/Identity | Password reset, account lockout, MFA, permissions | P3-P4 |
AI-Assisted Triage
ASI AI Sentinel provides automated initial diagnosis for all incoming incidents:
- Natural language classification: AI reads user-submitted descriptions and auto-categorises the incident with 92% accuracy
- Suggested priority: AI recommends priority based on historical patterns, affected services, and user role (VIP flagging)
- Knowledge match: AI searches the knowledge base and attaches the top 3 relevant articles to the ticket
- Similar incident detection: AI identifies if the issue matches an active P1/P2 incident (potential child incident) and links them
- Auto-remediation check: AI determines if the incident can be resolved automatically (e.g., service restart, DNS flush, certificate renewal) and executes if confidence > 95%
Escalation Matrix
Automated remediation
Resolution: 0-5 min
General IT support
Resolution: 5 min - 4 hrs
Domain expert (Cloud, Security, Network)
Resolution: 1-8 hrs
Vendor support or Solutions Architect
Resolution: as per vendor SLA
Escalation Triggers
| From | To | Trigger | Max Time Before Escalation |
|---|---|---|---|
| L1 AI | L2 Service Desk | Auto-remediation failed or confidence < 95% | 5 minutes |
| L2 Service Desk | L3 Specialist | Unable to resolve within timeframe, requires domain expertise | P1: 15 min, P2: 30 min, P3: 2 hrs |
| L3 Specialist | L4 Vendor | Vendor product defect, requires vendor intervention | P1: 1 hr, P2: 2 hrs |
| Any Level | Major Incident Manager | P1 incident or multiple related P2 incidents | Immediately upon identification |
Resolution & Recovery
- Identify resolution: Apply fix (workaround or permanent) based on diagnosis
- Implement fix: Execute the resolution through the appropriate change process (emergency change for P1 in production)
- Validate resolution: Confirm the service is restored and functioning correctly
- User confirmation: Contact the affected user/s to confirm the issue is resolved
- Document resolution: Record the root cause, resolution steps, and any knowledge article updates in ServiceNow
- Close incident: Set status to "Resolved". Auto-closes after 3 business days if no user objection.
Major Incident Process
Declaration (0-5 min)
Any L2+ engineer or SDM can declare a Major Incident. The Major Incident Manager (MIM) is paged immediately via PagerDuty. A bridge call/Teams channel is opened.
Assembly (5-15 min)
MIM assembles the response team: relevant L3 specialists, Solutions Architect (if needed), and communications lead. All parties join the bridge call.
Assessment & Communication (15-30 min)
MIM assesses impact, confirms priority, and issues the first client notification. Internal status page updated. All related tickets linked as child incidents.
Resolution Work (Ongoing)
Technical team works to resolve. MIM provides updates every 30 minutes to affected clients and internal stakeholders. Vendor engaged if needed.
Resolution & Confirmation
Service restored. MIM confirms with all affected parties. Resolution notification sent. Monitoring elevated for 24 hours post-resolution.
Post-Incident Review (within 5 BD)
PIR conducted within 5 business days. Report distributed to all stakeholders. Improvement actions tracked to completion.
Post-Incident Review (PIR) Template
📄 Post-Incident Review Report
First Response: [DD/MM/YYYY HH:MM] (TTA: XX min)
Major Incident Declared: [DD/MM/YYYY HH:MM] (if applicable)
Resolution: [DD/MM/YYYY HH:MM] (TTR: XX hrs XX min)
Full Recovery: [DD/MM/YYYY HH:MM]
2. [Action] - Owner: [Name] - Due: [Date]
3. [Action] - Owner: [Name] - Due: [Date]
Communication Templates
Customer Notification — Incident Acknowledged
Subject: [P1/P2] Incident Notification — [Brief Description] — [INC-ID]
Dear [Client Name],
We are writing to inform you that we have identified an issue affecting [service/system name]. Our team is actively investigating.
Incident Details:
- Incident ID: [INC-YYYY-NNNNNN]
- Priority: [P1/P2]
- Impact: [Description of what is affected]
- Detected: [DD/MM/YYYY HH:MM AEST]
We will provide an update within [30 minutes for P1 / 1 hour for P2] or sooner if we have new information.
If you have any questions, please contact your Service Delivery Manager or call 1300-ASI-HELP.
Kind regards,
ASI AI Solutions Service Desk
Internal Escalation — Major Incident
Subject: MAJOR INCIDENT DECLARED — [Brief Description] — [INC-ID]
To: Major Incident Response Team, Head of Service Delivery, VP Operations
A Major Incident has been declared. Please join the bridge call immediately.
- Bridge Call: [Teams meeting link]
- Incident ID: [INC-YYYY-NNNNNN]
- Impact: [X clients / Y users affected]
- Service: [Affected service]
- MIM: [Name]
- Current Status: [Investigating / Containment]
Customer Notification — Incident Resolved
Subject: RESOLVED — [Brief Description] — [INC-ID]
Dear [Client Name],
We are pleased to confirm that the incident reported on [date] has been resolved.
- Incident ID: [INC-YYYY-NNNNNN]
- Resolution Time: [X hours Y minutes]
- Root Cause: [Brief description]
- Resolution: [Brief description of fix applied]
- Preventive Actions: [What we are doing to prevent recurrence]
We are continuing to monitor the affected services closely. A full Post-Incident Review will be shared within 5 business days.
We apologise for any disruption caused. Please don't hesitate to reach out if you experience any further issues.
Kind regards,
[Service Delivery Manager Name]
ASI AI Solutions
Incident Management Metrics
Metric Definitions & Targets
| Metric | Definition | Target | Current (Feb 2026) |
|---|---|---|---|
| MTTA | Time from incident creation to first human response | ≤ 10 min (P1) | 8 min |
| MTTR | Time from incident creation to resolution | ≤ 4 hrs (P1) | 2.4 hrs |
| First Contact Resolution | % of incidents resolved at first interaction without escalation | ≥ 70% | 72% |
| SLA Compliance | % of incidents meeting response and resolution SLA targets | ≥ 95% | 96.2% |
| Reopened Rate | % of resolved incidents reopened within 7 days | ≤ 5% | 3.8% |
| AI Auto-Resolution Rate | % of incidents resolved by AI without human intervention | ≥ 40% | 38% |
| Customer Satisfaction | Post-resolution CSAT survey average | ≥ 4.5/5.0 | 4.6/5.0 |