This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

SMT incident priority mapping

The priority “Blocker” is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

0-CRITICALCritical<= 4 hrs
1-ERRORHigh<= 12hrs
2-WARNINGMedium<= 48hrs (2days)
3 - InformationalLow<= 96hrs (4days)
4 - VerboseNo TicketAction based on the notification and analysis

All ResourcesResource HealthResource HealthPrevious resource status=All, Current resource status=AllAlwaysCurrent status4 - VerboseMS teamsIncluded all future resource groups and future resourcesExcluding “Virtual machine instance from VMSS”
All ResourcesService HealthService HealthEvent types: Service issue, Planned maintenance , Health advisories, Security AdvisoriesAlwaysCurrent status4 - VerboseMS teamsRegions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
Azure SQL DatabaseCPUMetricapp_cpu_percent > 805 mins1 hour2-WARNINGEmail
Azure SQL DatabaseCPUMetricapp_cpu_percent > 955 mins1 hour1-ERRORMS teams & Email
Azure SQL DatabaseMemoryMetricapp_memory_percent > 805 mins1 hour2-WARNINGEmail
Azure SQL DatabaseMemoryMetricapp_memory_percent > 955 mins1 hour1-ERRORMS teams & Email
Azure SQL DatabaseSpaceMetricallocated_data_storage greater or less than dynamic threshold15 mins1 hour2-WARNINGEmail
AKS - NodeNode CPUMetricnode_cpu_usage_percentage > 8015 mins1 hour2-WARNINGEmailName of the node Include True
AKS - NodeNode MemoryMetricnode_memory_working_set_percentage > 8015 mins1 hour2-WARNINGEmailName of the node Include True
AKS - NodeNode DiskMetricnode_disk_usage_percentage > 8015 mins1 hour2-WARNINGEmailName of the node Include True
AKS - NodeNode Status (NotReady,Unknown)Metrickube_node_status_condition > 05 mins15 mins2-WARNINGEmail
AKS - PodsPods phases (Failed,Unknown,Pending)Metrickube_pod_status_phase >= 15 mins30 mins2-WARNINGEmailPhase of the pod Include Failed,Unknown,Pending
AKS - PodsUnschedulable PodsMetricunschedulable > 115 mins1 hour2-WARNINGEmail
AKS - PodsPods ready state percentageMetricpodReadyPercentage(preview)2-WARNINGEmail
AKS - ContainersRestarting ContainersMetricrestarting container count(preview)2-WARNINGEmail
AKS - ContainersOOM killed containersMetricoomKilledContainerCount)preview)2-WARNINGEmail
AKS - ContainersCPU Exceeded PercentageMetriccpuExceededPercentage (preview)2-WARNINGEmail
AKS - ContainersMemory working set exceeded percentageMetricmemoryWorkingSetExceededPercentage(preview)2-WARNINGEmail
Application GatewayUnhealthy backend HostMetricUnhealthyHostCount > 01 min5 mins0-CRITICALMS teams & Email
Application GatewayFailed RequestsMetricFailedRequests > 1005 mins15 mins2-WARNINGEmail
Load balancerSNAT Connection Status CountMetricSnatConnectionCount >= 15 mins15 mins2-WARNINGEmailConnection State = Failed, Pending
Public IP AddressesUnder DDoS attack or notMetricIfUnderDDoSAttack > 01 min5 mins0-CRITICALMS teams & Email
Virtual machine scalesetCPU UsageMetricPercentage CPU > 9015 mins1 hour2-WARNINGEmail
Container RegistryStorage UsedMetricStorageUsed > 90% of Storage size included in the SKU15 mins1 hour3 - InformationalEmailReview this which SKU of ACR has this metric
LogicAppRunsFailedMetricRunsFailed>01 hour12 hours3 - InformationalEmail
Log Analytics WorkspaceContainer SIGKILL ErrorLogsTable rows Count > 015 mins15 mins2-WARNINGEmailSignal KILL error Expand source
Log Analytics WorkspaceWAF_Possible_DDoS_DetectedLogs Querycount_ > 100015 mins15 mins1 - ErrorMS teams & EmailWAF_Possible_DDoS_Detected Expand source
Log Analytics workspaceNode-restart-delayed triggered by KuredLogs Query2-WARNINGEmailNode-restart-delayed Expand source
Log Analytics workspaceNode-restart-successful-Kured ActionLogs QueryOBSOLETENode-restart-successful Expand source
Azure SQL Database / serverVulnerability Scan ReportVulnerability Scan Report
FailureFailure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector


ACRACR - To trigger alert when Create or Update Images from the ACR?
SQL DBSQL DB - Slow / Long running Queries?
Service Principal secret / certificate expiry?
AKSCheck if we can sent an alert if k8s is not able to scale in new workernode
VISUALIZATION KURED/AKS ALERTSCurrently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to

Refer :
Overview diagram of Container insights
Diagram that explains Azure Monitor alerts.





