Tuesday, May 14, 2019

Supplementing AWS CloudWatch Alarm capability - Watch over Lambda function

I haven't heard back from AWS support on the subject of my last message, so I created a Jenkins job to handle this auditing to ensure the Lambda function is running. Here is the bash shell script I used to implement this:

#!/usr/bin/env bash

# Used by a Jenkins job to monitor for an AWS Lambda function failing to fire
# every 90 minutes

# Algorithm:
#   get any cloudwatch events of Lambda invocation in the 
#   last $AlarmTime minutes. If there are none, then
#   the scheduled lambda function which should run every 
#   90 minutes. Once alarm condition is satisfied a file
#   is created to indicate that. Only the first time will
#   this job end in a fail. This fail will result in
#   reporting the problem. So, if the alarm condition
#   is satisfied but the file exists, the job won't fail.
#   However, we leave an escape hatch, in that the alarm
#   file that is present over 24 hours will be deleted.
#   So, if this alarm is neglected, it will come back every day. 

AlarmTime=95  # minutes - if late this much, alarm


Profile='ioce'
Region='us-east-1'
Namespace='AWS/Lambda'
Metric='Invocations'
Stat='SampleCount'
Dime='Name=FunctionName,Value=lambda_chef_converge_check'
AlarmFile='alarmOn'

OffsetExpression="$AlarmTime minutes ago"
StartTime=$(date -u -d "$OffsetExpression" +'%Y-%m-%dT%TZ')
EndTime=$(date -u +'%Y-%m-%dT%TZ')

# Get the metrics and test for DATAPOINTS
aws --profile $Profile --region $Region              \
  cloudwatch  get-metric-statistics                   \
  --namespace $Namespace --metric-name $Metric         \
  --start-time $StartTime --end-time $EndTime           \
  --period 60 --statistics "$Stat" --dimensions "$Dime"  \
  --output text|grep -q '^DATAPOINTS'
if [ 0 -eq $? ];then
  # Found datapoints, things are fine. Clear alarm file
  rm -f $AlarmFile
  
else
  # No datapoints found, we are missing a point, so alarm
  # if we haven't already done so for this episode
  if [ ! -f $AlarmFile ];then
   touch $AlarmFile
    exit 1 # get the job to fail
  else
    # Check if it is time to delete the file
    find -maxdepth 1 -type f -name "$AlarmFile" -cmin +1440 -delete
  fi
fi

I created a Jenkins job to pull this script from GitHub and execute it under this time schedule: H/15 * * * * *. Takes about 1.5 seconds to run from Jenkins trigger.

Monday, May 13, 2019

Back-filling AWS CloudWatch Events

I learned a lot more about how AWS handles events and alarms in CloudWatch (CW). The lesson is: CW events are posted at the time of the event (as measured at the source) and not at the delivery time. This means the data is always spiky if you are looking (with a CW alarm) for one missing event. It is spiky because there is some delay for delivery of the event. This graphic illustrates this:


You can see the spike over the Δt interval, the delay time in delivering the event. I would not label the AWS approach as "back-filling", but rather offsetting. If it were filling, the glitch would actually be filled in, so it vanished.

AWS' approach is obviously not very useful in cases where you have a delay in the arrival of the events (i.e. always) and you want to monitor for one missing point. I would argue a better approach, which is one suggested by the name "back-FILL", in that when a sample arrives you fill its presence over the entire period between event time and CW delivery time. This could be done by tracking the two times (event creation and CW arrival) and choosing the earlier time when the event is a begging time point and the later time when the event is an ending time point.
The CW implementation of averaging is sample-based. If time-based averaging were available, then this problem would be simple to solve. When averaged over a longer period, the spike would be insignificant. If we expect 4 invocations in the averaging period, the the time average will be 1.0 when no event is missing. When the event is missed, the time-based average would drop from 1.0 linearly until it reaches 0.75 at the point of the next invocation. One could set the threshold to 0.9 or higher or lower. This gives you a slider to tune out noise or increase the responsiveness. All delay effects would be minimized.
If I don't have a different solution using CW, then I will need to come up with a different option. Maybe a Lambda function running in a different region than one I am monitoring. Maybe a Jenkins job that uses the AWS CLI to query cloud watch. It is a shame to not be able to get this done in CW alone.


Wednesday, May 1, 2019

GitHub Repo setup

Problem: We needed a handy way to create GitHub repositories, enforce some naming restrictions and add appropriate team access. So, I automated this with Python 3.6+.

Along the way, I pulled some important tips from Bill Hunt's Gist addressing the same need.

I will discuss my code below, in sections, but look for the full code at the end of this post. The first section includes the comments on requirements to satisfy for this script to do its job. Note this is labelled as Python3.6 at the shebang line. You should really be on 3.7 by now, however!


#!/usr/bin/env python3.6

# Creates a new GitHub repo and added a team owner

# Setting up to run this:
#   1) Install python 3.6 (brew install python). Ensure
#      the shebang line of this script works for your
#      system.
#   2) pip3 install PyGithub (maybe pip or pip3.6 - whatever 
#      works for the python3.6 or 3.7 install)
#   3) Create a GitHub personal access token at 
#      https://github.com. Click your user picture in the
#      upper right, choose settings, then Developer settings
#      and Personal access tokens. Create a new token.
#      Scopes only needs to include repo + read:org scope
#      under the admin:org heading.
#   4) Insert that personal access token (PAT) in the git
#      global configuration file with: 
#        git config --global user.pat 
#   5) Also, ensure you have ssh set up to be used for
#      GitHub access
#   6) Edit this script to change the org_name and team_id
#      constants to match your needs
Next come the import statements. Most refer to Python standard libraries. However, there are two others: requests and PyGithub.

import os
import sys
import argparse
import getpass
import subprocess
import re
import json
import datetime
import getpass
import zipfile
from urllib.parse import urlparse

import requests
from github import Github
Some constants are defined. You need to edit org_name and team_id to match your GitHub configuration.

## Constants
org_name = 'dummy_org'
base_uri = f'https://git@github.com/{org_name}/'
repos_uri = f'https://api.github.com/orgs/{org_name}/repos'
teams_uri = f'https://api.github.com/orgs/{org_name}/teams'
# Find team ID via API call:
#   curl -H "Authorization: token "  \
#         https://api.github.com/orgs/{org-name}/teams
team_id = 12345678
Next comes the code related to getting the list of existing repos. This allows us to ensure that the new repo name does not conflict with an existing one. First, the getLastPage() function parses a URL returned by the GitHub api call to list repos, and from that, determines how many pages of repos exist. That number is used to drive a loop over that many iterations in the function getRepos().

# Used to determine how many pages there are on the repo list
def getLastPage(link_header):
  links = link_header.split(',')
  for link in links:
    link = link.split(';')
    if link[1].strip() == 'rel="last"':
      parsed_url = urlparse(link[0].strip(' <>'))
      link_data = parsed_url.query.split('&')
      for ld in link_data:
        if ld.startswith('page='):
          last_page = int( ld.split('=')[1] )
      return last_page
  return 0

# Make a list of all repos in organization
def getRepos(payload):
  print('getting list of all repos, please wait...')
  repos = list()
  head = requests.get(repos_uri, params = payload)
  last_page = getLastPage(head.headers['link'])
  if last_page != 0:
    for page in range(1, last_page+1):
      repo_payload = payload.copy()
      repo_payload['page'] = page
      resp = requests.get(repos_uri, 
                  params = repo_payload, headers = {})
      repos_raw = json.loads(resp.text)
      new_repos = [ r['name'] for r in repos_raw ]
      repos += new_repos
  return repos
Now, executed in the main body, we retrieve a GitHub personal access token, which has been stored in the Git global configuration (see opening comments on setup required). This access works by running a shell command in a subprocess. From that access token, we create a payload for POST transactions with GitHub to follow.

# Get the github personal access token from the git
# global config
proc = subprocess.Popen(["git","config","--global",
                         "--get","user.pat"],
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT)
access_token = proc.stdout.read().strip().decode()
payload = {'access_token': access_token}
Create the list of repo names. Also parse the command line argument - the repo name.

# Build a list of repo names so we can check for a collision
# repos come in a paged form, so have to loop over them
repos = getRepos(payload)

# Parse command line
parser = argparse.ArgumentParser()
parser.add_argument('repo_name', help='desired repo name')
args = parser.parse_args()

# See if the repo already exists
if args.repo_name in repos:
  print(f'Repository {args.repo_name} already exists!')
  sys.exit(1)
Next, we will actually create the repo via the Python GitHub package. This is part of a large try block.
 
try:
  # create the repo
  g = Github(access_token)
  my_org = g.get_organization(org_name)

  repo_name = args.repo_name
  repo_description = f'repo for {repo_name}'
  new_repo = my_org.create_repo(repo_name,
                               description = repo_description,
                               private = True,
                               auto_init = False)
Now we add team permission to this new repo.

  # add a team to the repo with admin permissions
  teamrepo_uri = f'https://api.github.com/teams/{team_id}' +  \
                 f'/repos/{org_name}/{repo_name}'
  team_payload = payload.copy()
  # choose the permission schema you want for this team
  team_payload['permissions'] = 'admin'
  repo_r = requests.put(teamrepo_uri, params = payload,
                        data = json.dumps(team_payload))
Next, handle any exceptions. This is very bad code, because it is not designed to handle specific exceptions. It handles any exception by printing a message and exiting with a 1 status. It is suitable to run now, but should be refined to handle any specific exceptions that do occur. So far, no exceptions were encountered!

except:
   print("Exception type: %s, Exception arg: " +  \
         "%s\nException Traceback:\n%s" %  \
         (sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2]))
   print('\nError in creating or configuring repo.',
         'Check it out and try again')
   sys.exit(1)
That's it. So, here is the full code of this script:

#!/usr/bin/env python3.6

# Creates a new GitHub repo and added a team owner

# Setting up to run this:
#   1) Install python 3.6 (brew install python). Ensure
#      the shebang line of this script works for your
#      system.
#   2) pip3 install PyGithub (maybe pip or pip3.6 - whatever 
#      works for the python3.6 or 3.7 install)
#   3) Create a GitHub personal access token at 
#      https://github.com. Click your user picture in the
#      upper right, choose settings, then Developer settings
#      and Personal access tokens. Create a new token.
#      Scopes only needs to include repo + read:org scope
#      under the admin:org heading.
#   4) Insert that personal access token (PAT) in the git
#      global configuration file with: 
#        git config --global user.pat 
#   5) Also, ensure you have ssh set up to be used for
#      GitHub access
#   6) Edit this script to change the org_name and team_id
#      constants to match your needs

import os
import sys
import argparse
import getpass
import subprocess
import re
import json
import datetime
import getpass
import zipfile
from urllib.parse import urlparse

import requests
from github import Github

## Constants
org_name = 'dummy_org'
base_uri = f'https://git@github.com/{org_name}/'
repos_uri = f'https://api.github.com/orgs/{org_name}/repos'
teams_uri = f'https://api.github.com/orgs/{org_name}/teams'
# Find team ID via API call:
#   curl -H "Authorization: token "  \
#         https://api.github.com/orgs/{org-name}/teams
team_id = 12345678

# Used to determine how many pages there are on the repo list
def getLastPage(link_header):
  links = link_header.split(',')
  for link in links:
    link = link.split(';')
    if link[1].strip() == 'rel="last"':
      parsed_url = urlparse(link[0].strip(' <>'))
      link_data = parsed_url.query.split('&')
      for ld in link_data:
        if ld.startswith('page='):
          last_page = int( ld.split('=')[1] )
      return last_page
  return 0

# Make a list of all repos in organization
def getRepos(payload):
  print('getting list of all repos, please wait...')
  repos = list()
  head = requests.get(repos_uri, params = payload)
  last_page = getLastPage(head.headers['link'])
  if last_page != 0:
    for page in range(1, last_page+1):
      repo_payload = payload.copy()
      repo_payload['page'] = page
      resp = requests.get(repos_uri, 
                  params = repo_payload, headers = {})
      repos_raw = json.loads(resp.text)
      new_repos = [ r['name'] for r in repos_raw ]
      repos += new_repos
  return repos

# Get the github personal access token from the git
# global config
proc = subprocess.Popen(["git","config","--global",
                         "--get","user.pat"],
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT)
access_token = proc.stdout.read().strip().decode()
payload = {'access_token': access_token}

# Build a list of repo names so we can check for a collision
# repos come in a paged form, so have to loop over them
repos = getRepos(payload)

# Parse command line
parser = argparse.ArgumentParser()
parser.add_argument('repo_name', help='desired repo name')
args = parser.parse_args()

# See if the repo already exists
if args.repo_name in repos:
  print(f'Repository {args.repo_name} already exists!')
  sys.exit(1)

try:
  # create the repo
  g = Github(access_token)
  my_org = g.get_organization(org_name)

  repo_name = args.repo_name
  repo_description = f'repo for {repo_name}'
  new_repo = my_org.create_repo(repo_name,
                               description = repo_description,
                               private = True,
                               auto_init = False)

  # add a team to the repo with admin permissions
  teamrepo_uri = f'https://api.github.com/teams/{team_id}' +  \
                 f'/repos/{org_name}/{repo_name}'
  team_payload = payload.copy()
  # choose the permission schema you want for this team
  team_payload['permissions'] = 'admin'
  repo_r = requests.put(teamrepo_uri, params = payload,
                        data = json.dumps(team_payload))

except:
   print("Exception type: %s, Exception arg: " +  \
         "%s\nException Traceback:\n%s" %  \
         (sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2]))
   print('\nError in creating or configuring repo.',
         'Check it out and try again')
   sys.exit(1)