I haven't heard back from AWS support on the subject of my last message, so I created a Jenkins job to handle this auditing to ensure the Lambda function is running. Here is the bash shell script I used to implement this:
#!/usr/bin/env bash
# Used by a Jenkins job to monitor for an AWS Lambda function failing to fire
# every 90 minutes
# Algorithm:
# get any cloudwatch events of Lambda invocation in the
# last $AlarmTime minutes. If there are none, then
# the scheduled lambda function which should run every
# 90 minutes. Once alarm condition is satisfied a file
# is created to indicate that. Only the first time will
# this job end in a fail. This fail will result in
# reporting the problem. So, if the alarm condition
# is satisfied but the file exists, the job won't fail.
# However, we leave an escape hatch, in that the alarm
# file that is present over 24 hours will be deleted.
# So, if this alarm is neglected, it will come back every day.
AlarmTime=95 # minutes - if late this much, alarm
Profile='ioce'
Region='us-east-1'
Namespace='AWS/Lambda'
Metric='Invocations'
Stat='SampleCount'
Dime='Name=FunctionName,Value=lambda_chef_converge_check'
AlarmFile='alarmOn'
OffsetExpression="$AlarmTime minutes ago"
StartTime=$(date -u -d "$OffsetExpression" +'%Y-%m-%dT%TZ')
EndTime=$(date -u +'%Y-%m-%dT%TZ')
# Get the metrics and test for DATAPOINTS
aws --profile $Profile --region $Region \
cloudwatch get-metric-statistics \
--namespace $Namespace --metric-name $Metric \
--start-time $StartTime --end-time $EndTime \
--period 60 --statistics "$Stat" --dimensions "$Dime" \
--output text|grep -q '^DATAPOINTS'
if [ 0 -eq $? ];then
# Found datapoints, things are fine. Clear alarm file
rm -f $AlarmFile
else
# No datapoints found, we are missing a point, so alarm
# if we haven't already done so for this episode
if [ ! -f $AlarmFile ];then
touch $AlarmFile
exit 1 # get the job to fail
else
# Check if it is time to delete the file
find -maxdepth 1 -type f -name "$AlarmFile" -cmin +1440 -delete
fi
fi
I created a Jenkins job to pull this script from GitHub and execute it under this time schedule: H/15 * * * * *. Takes about 1.5 seconds to run from Jenkins trigger.
No comments:
Post a Comment