Tuesday, May 14, 2019

Supplementing AWS CloudWatch Alarm capability - Watch over Lambda function

I haven't heard back from AWS support on the subject of my last message, so I created a Jenkins job to handle this auditing to ensure the Lambda function is running. Here is the bash shell script I used to implement this:

#!/usr/bin/env bash

# Used by a Jenkins job to monitor for an AWS Lambda function failing to fire
# every 90 minutes

# Algorithm:
#   get any cloudwatch events of Lambda invocation in the 
#   last $AlarmTime minutes. If there are none, then
#   the scheduled lambda function which should run every 
#   90 minutes. Once alarm condition is satisfied a file
#   is created to indicate that. Only the first time will
#   this job end in a fail. This fail will result in
#   reporting the problem. So, if the alarm condition
#   is satisfied but the file exists, the job won't fail.
#   However, we leave an escape hatch, in that the alarm
#   file that is present over 24 hours will be deleted.
#   So, if this alarm is neglected, it will come back every day. 

AlarmTime=95  # minutes - if late this much, alarm


Profile='ioce'
Region='us-east-1'
Namespace='AWS/Lambda'
Metric='Invocations'
Stat='SampleCount'
Dime='Name=FunctionName,Value=lambda_chef_converge_check'
AlarmFile='alarmOn'

OffsetExpression="$AlarmTime minutes ago"
StartTime=$(date -u -d "$OffsetExpression" +'%Y-%m-%dT%TZ')
EndTime=$(date -u +'%Y-%m-%dT%TZ')

# Get the metrics and test for DATAPOINTS
aws --profile $Profile --region $Region              \
  cloudwatch  get-metric-statistics                   \
  --namespace $Namespace --metric-name $Metric         \
  --start-time $StartTime --end-time $EndTime           \
  --period 60 --statistics "$Stat" --dimensions "$Dime"  \
  --output text|grep -q '^DATAPOINTS'
if [ 0 -eq $? ];then
  # Found datapoints, things are fine. Clear alarm file
  rm -f $AlarmFile
  
else
  # No datapoints found, we are missing a point, so alarm
  # if we haven't already done so for this episode
  if [ ! -f $AlarmFile ];then
   touch $AlarmFile
    exit 1 # get the job to fail
  else
    # Check if it is time to delete the file
    find -maxdepth 1 -type f -name "$AlarmFile" -cmin +1440 -delete
  fi
fi

I created a Jenkins job to pull this script from GitHub and execute it under this time schedule: H/15 * * * * *. Takes about 1.5 seconds to run from Jenkins trigger.

No comments:

Post a Comment