Incident Management Process - ASI AI Solutions Wiki

ⓘ

Scope: This process covers IT service incidents for managed clients. For security incidents (data breach, ransomware, etc.), refer to the Security Incident Response Plan.

Incident Lifecycle

TriggerIncident Detected

Step 1Logging & Categorisation

Step 2Prioritisation (Impact × Urgency)

Step 3Initial Diagnosis (AI-Assisted)

DecisionCan L1/AI Resolve?

Step 4Escalation (if needed)

Step 5Investigation & Diagnosis

Step 6Resolution & Recovery

Step 7Closure & Documentation

CompleteIncident Closed

Incident Detection

Incidents enter the process through three primary channels:

⚙

AI Monitoring (Automated)

ASI AI Sentinel detects anomalies, threshold breaches, and infrastructure failures. Automated alerts generate ServiceNow tickets with pre-populated diagnostics. AI auto-remediation resolves approximately 40% of detected issues without human intervention.

☎

User Report

End users report incidents via the self-service portal, email (support@asiaisolutions.com.au), phone (1300-ASI-HELP), or Microsoft Teams bot. The AI chatbot attempts to resolve common issues (password resets, how-to queries) before creating a ticket.

⚠

Proactive Detection

Scheduled health checks, vulnerability scans, and predictive analytics identify potential incidents before they cause impact. These are logged as proactive incidents and addressed during maintenance windows where possible.

Classification & Prioritisation

Incident Categories

Category	Sub-Categories	Typical Priority
Network	Connectivity, DNS, DHCP, firewall, SD-WAN, Wi-Fi	P1-P3
Server/Compute	Server down, high utilisation, OS crash, service failure	P1-P3
Cloud Services	Azure/AWS outage, resource failures, config issues	P1-P3
Email & Collaboration	M365 issues, Exchange, Teams, SharePoint	P1-P3
Application	LOB app errors, performance, access issues	P2-P4
Desktop/Endpoint	Hardware failure, OS issues, software installation	P3-P4
Security	Malware detected, phishing, unauthorised access	P1-P2 (escalate to CSIRT)
Backup/DR	Backup failure, restore request, DR failover	P2-P3
Print/Peripheral	Printer issues, scanner, AV equipment	P3-P4
Access/Identity	Password reset, account lockout, MFA, permissions	P3-P4

AI-Assisted Triage

ASI AI Sentinel provides automated initial diagnosis for all incoming incidents:

Natural language classification: AI reads user-submitted descriptions and auto-categorises the incident with 92% accuracy
Suggested priority: AI recommends priority based on historical patterns, affected services, and user role (VIP flagging)
Knowledge match: AI searches the knowledge base and attaches the top 3 relevant articles to the ticket
Similar incident detection: AI identifies if the issue matches an active P1/P2 incident (potential child incident) and links them
Auto-remediation check: AI determines if the incident can be resolved automatically (e.g., service restart, DNS flush, certificate renewal) and executes if confidence > 95%

Escalation Matrix

L1: AI Auto-Resolve
Automated remediation
Resolution: 0-5 min

→

L2: Service Desk Engineer
General IT support
Resolution: 5 min - 4 hrs

→

L3: Specialist Engineer
Domain expert (Cloud, Security, Network)
Resolution: 1-8 hrs

→

L4: Vendor / Architect
Vendor support or Solutions Architect
Resolution: as per vendor SLA

Escalation Triggers

From	To	Trigger	Max Time Before Escalation
L1 AI	L2 Service Desk	Auto-remediation failed or confidence < 95%	5 minutes
L2 Service Desk	L3 Specialist	Unable to resolve within timeframe, requires domain expertise	P1: 15 min, P2: 30 min, P3: 2 hrs
L3 Specialist	L4 Vendor	Vendor product defect, requires vendor intervention	P1: 1 hr, P2: 2 hrs
Any Level	Major Incident Manager	P1 incident or multiple related P2 incidents	Immediately upon identification

Resolution & Recovery

Identify resolution: Apply fix (workaround or permanent) based on diagnosis
Implement fix: Execute the resolution through the appropriate change process (emergency change for P1 in production)
Validate resolution: Confirm the service is restored and functioning correctly
User confirmation: Contact the affected user/s to confirm the issue is resolved
Document resolution: Record the root cause, resolution steps, and any knowledge article updates in ServiceNow
Close incident: Set status to "Resolved". Auto-closes after 3 business days if no user objection.

Major Incident Process

⚠

A Major Incident is declared when a P1 incident affects multiple clients, causes complete loss of a critical business service, or has potential reputational/financial impact exceeding $50,000.

Declaration (0-5 min)

Any L2+ engineer or SDM can declare a Major Incident. The Major Incident Manager (MIM) is paged immediately via PagerDuty. A bridge call/Teams channel is opened.

Assembly (5-15 min)

MIM assembles the response team: relevant L3 specialists, Solutions Architect (if needed), and communications lead. All parties join the bridge call.

Assessment & Communication (15-30 min)

MIM assesses impact, confirms priority, and issues the first client notification. Internal status page updated. All related tickets linked as child incidents.

Resolution Work (Ongoing)

Technical team works to resolve. MIM provides updates every 30 minutes to affected clients and internal stakeholders. Vendor engaged if needed.

Resolution & Confirmation

Service restored. MIM confirms with all affected parties. Resolution notification sent. Monitoring elevated for 24 hours post-resolution.

Post-Incident Review (within 5 BD)

PIR conducted within 5 business days. Report distributed to all stakeholders. Improvement actions tracked to completion.

Post-Incident Review (PIR) Template

📄 Post-Incident Review Report

Incident ID

[INC-YYYY-NNNNNN]

Incident Title

[Brief description of the incident]

Priority / Severity

[P1/P2] / [Major Incident: Yes/No]

Affected Client(s)

[List all affected clients and number of impacted users]

Timeline

Detection: [DD/MM/YYYY HH:MM]
First Response: [DD/MM/YYYY HH:MM] (TTA: XX min)
Major Incident Declared: [DD/MM/YYYY HH:MM] (if applicable)
Resolution: [DD/MM/YYYY HH:MM] (TTR: XX hrs XX min)
Full Recovery: [DD/MM/YYYY HH:MM]

Impact Summary

[What was affected, how many users/services impacted, business impact estimate ($)]

Root Cause

[Detailed root cause analysis. Use 5-Whys technique.]

Resolution Steps

[Step-by-step actions taken to resolve the incident]

What Went Well

[Positive aspects of the response]

What Could Be Improved

[Areas for improvement in detection, response, or communication]

Improvement Actions

1. [Action] - Owner: [Name] - Due: [Date]
2. [Action] - Owner: [Name] - Due: [Date]
3. [Action] - Owner: [Name] - Due: [Date]

PIR Author

[Name, role]

Reviewed By

[Head of Service Delivery sign-off]

Communication Templates

Customer Notification — Incident Acknowledged

Subject: [P1/P2] Incident Notification — [Brief Description] — [INC-ID]

Dear [Client Name],

We are writing to inform you that we have identified an issue affecting [service/system name]. Our team is actively investigating.

Incident Details:

Incident ID: [INC-YYYY-NNNNNN]
Priority: [P1/P2]
Impact: [Description of what is affected]
Detected: [DD/MM/YYYY HH:MM AEST]

We will provide an update within [30 minutes for P1 / 1 hour for P2] or sooner if we have new information.

If you have any questions, please contact your Service Delivery Manager or call 1300-ASI-HELP.

Kind regards,
ASI AI Solutions Service Desk

Internal Escalation — Major Incident

Subject: MAJOR INCIDENT DECLARED — [Brief Description] — [INC-ID]

To: Major Incident Response Team, Head of Service Delivery, VP Operations

A Major Incident has been declared. Please join the bridge call immediately.

Bridge Call: [Teams meeting link]
Incident ID: [INC-YYYY-NNNNNN]
Impact: [X clients / Y users affected]
Service: [Affected service]
MIM: [Name]
Current Status: [Investigating / Containment]

Customer Notification — Incident Resolved

Subject: RESOLVED — [Brief Description] — [INC-ID]

Dear [Client Name],

We are pleased to confirm that the incident reported on [date] has been resolved.

Incident ID: [INC-YYYY-NNNNNN]
Resolution Time: [X hours Y minutes]
Root Cause: [Brief description]
Resolution: [Brief description of fix applied]
Preventive Actions: [What we are doing to prevent recurrence]

We are continuing to monitor the affected services closely. A full Post-Incident Review will be shared within 5 business days.

We apologise for any disruption caused. Please don't hesitate to reach out if you experience any further issues.

Kind regards,
[Service Delivery Manager Name]
ASI AI Solutions

Incident Management Metrics

8 min

Mean Time to Acknowledge (MTTA)

2.4 hrs

Mean Time to Resolve (MTTR)

72%

First Contact Resolution

1,842

Monthly Incident Volume

38%

AI Auto-Resolved

2.1

Major Incidents / Month (avg)

Metric Definitions & Targets

Metric	Definition	Target	Current (Feb 2026)
MTTA	Time from incident creation to first human response	≤ 10 min (P1)	8 min
MTTR	Time from incident creation to resolution	≤ 4 hrs (P1)	2.4 hrs
First Contact Resolution	% of incidents resolved at first interaction without escalation	≥ 70%	72%
SLA Compliance	% of incidents meeting response and resolution SLA targets	≥ 95%	96.2%
Reopened Rate	% of resolved incidents reopened within 7 days	≤ 5%	3.8%
AI Auto-Resolution Rate	% of incidents resolved by AI without human intervention	≥ 40%	38%
Customer Satisfaction	Post-resolution CSAT survey average	≥ 4.5/5.0	4.6/5.0