Saturday, February 15, 2020

Dealing With Sparse Monitoring Data on Google Cloud Platform - Get the Raw Data!


I just wrote a blog post for my company, New Context Security, on how to deal with sparse monitoring data on Google Cloud Platform, using a Python script to retrieve the actual data:

https://newcontext.com/python-to-retrieve-gcp-stackdriver-monitoring-data/

Tuesday, May 14, 2019

Supplementing AWS CloudWatch Alarm capability - Watch over Lambda function

I haven't heard back from AWS support on the subject of my last message, so I created a Jenkins job to handle this auditing to ensure the Lambda function is running. Here is the bash shell script I used to implement this:

#!/usr/bin/env bash

# Used by a Jenkins job to monitor for an AWS Lambda function failing to fire
# every 90 minutes

# Algorithm:
#   get any cloudwatch events of Lambda invocation in the 
#   last $AlarmTime minutes. If there are none, then
#   the scheduled lambda function which should run every 
#   90 minutes. Once alarm condition is satisfied a file
#   is created to indicate that. Only the first time will
#   this job end in a fail. This fail will result in
#   reporting the problem. So, if the alarm condition
#   is satisfied but the file exists, the job won't fail.
#   However, we leave an escape hatch, in that the alarm
#   file that is present over 24 hours will be deleted.
#   So, if this alarm is neglected, it will come back every day. 

AlarmTime=95  # minutes - if late this much, alarm


Profile='ioce'
Region='us-east-1'
Namespace='AWS/Lambda'
Metric='Invocations'
Stat='SampleCount'
Dime='Name=FunctionName,Value=lambda_chef_converge_check'
AlarmFile='alarmOn'

OffsetExpression="$AlarmTime minutes ago"
StartTime=$(date -u -d "$OffsetExpression" +'%Y-%m-%dT%TZ')
EndTime=$(date -u +'%Y-%m-%dT%TZ')

# Get the metrics and test for DATAPOINTS
aws --profile $Profile --region $Region              \
  cloudwatch  get-metric-statistics                   \
  --namespace $Namespace --metric-name $Metric         \
  --start-time $StartTime --end-time $EndTime           \
  --period 60 --statistics "$Stat" --dimensions "$Dime"  \
  --output text|grep -q '^DATAPOINTS'
if [ 0 -eq $? ];then
  # Found datapoints, things are fine. Clear alarm file
  rm -f $AlarmFile
  
else
  # No datapoints found, we are missing a point, so alarm
  # if we haven't already done so for this episode
  if [ ! -f $AlarmFile ];then
   touch $AlarmFile
    exit 1 # get the job to fail
  else
    # Check if it is time to delete the file
    find -maxdepth 1 -type f -name "$AlarmFile" -cmin +1440 -delete
  fi
fi

I created a Jenkins job to pull this script from GitHub and execute it under this time schedule: H/15 * * * * *. Takes about 1.5 seconds to run from Jenkins trigger.

Monday, May 13, 2019

Back-filling AWS CloudWatch Events

I learned a lot more about how AWS handles events and alarms in CloudWatch (CW). The lesson is: CW events are posted at the time of the event (as measured at the source) and not at the delivery time. This means the data is always spiky if you are looking (with a CW alarm) for one missing event. It is spiky because there is some delay for delivery of the event. This graphic illustrates this:


You can see the spike over the Δt interval, the delay time in delivering the event. I would not label the AWS approach as "back-filling", but rather offsetting. If it were filling, the glitch would actually be filled in, so it vanished.

AWS' approach is obviously not very useful in cases where you have a delay in the arrival of the events (i.e. always) and you want to monitor for one missing point. I would argue a better approach, which is one suggested by the name "back-FILL", in that when a sample arrives you fill its presence over the entire period between event time and CW delivery time. This could be done by tracking the two times (event creation and CW arrival) and choosing the earlier time when the event is a begging time point and the later time when the event is an ending time point.
The CW implementation of averaging is sample-based. If time-based averaging were available, then this problem would be simple to solve. When averaged over a longer period, the spike would be insignificant. If we expect 4 invocations in the averaging period, the the time average will be 1.0 when no event is missing. When the event is missed, the time-based average would drop from 1.0 linearly until it reaches 0.75 at the point of the next invocation. One could set the threshold to 0.9 or higher or lower. This gives you a slider to tune out noise or increase the responsiveness. All delay effects would be minimized.
If I don't have a different solution using CW, then I will need to come up with a different option. Maybe a Lambda function running in a different region than one I am monitoring. Maybe a Jenkins job that uses the AWS CLI to query cloud watch. It is a shame to not be able to get this done in CW alone.


Wednesday, May 1, 2019

GitHub Repo setup

Problem: We needed a handy way to create GitHub repositories, enforce some naming restrictions and add appropriate team access. So, I automated this with Python 3.6+.

Along the way, I pulled some important tips from Bill Hunt's Gist addressing the same need.

I will discuss my code below, in sections, but look for the full code at the end of this post. The first section includes the comments on requirements to satisfy for this script to do its job. Note this is labelled as Python3.6 at the shebang line. You should really be on 3.7 by now, however!


#!/usr/bin/env python3.6

# Creates a new GitHub repo and added a team owner

# Setting up to run this:
#   1) Install python 3.6 (brew install python). Ensure
#      the shebang line of this script works for your
#      system.
#   2) pip3 install PyGithub (maybe pip or pip3.6 - whatever 
#      works for the python3.6 or 3.7 install)
#   3) Create a GitHub personal access token at 
#      https://github.com. Click your user picture in the
#      upper right, choose settings, then Developer settings
#      and Personal access tokens. Create a new token.
#      Scopes only needs to include repo + read:org scope
#      under the admin:org heading.
#   4) Insert that personal access token (PAT) in the git
#      global configuration file with: 
#        git config --global user.pat 
#   5) Also, ensure you have ssh set up to be used for
#      GitHub access
#   6) Edit this script to change the org_name and team_id
#      constants to match your needs
Next come the import statements. Most refer to Python standard libraries. However, there are two others: requests and PyGithub.

import os
import sys
import argparse
import getpass
import subprocess
import re
import json
import datetime
import getpass
import zipfile
from urllib.parse import urlparse

import requests
from github import Github
Some constants are defined. You need to edit org_name and team_id to match your GitHub configuration.

## Constants
org_name = 'dummy_org'
base_uri = f'https://git@github.com/{org_name}/'
repos_uri = f'https://api.github.com/orgs/{org_name}/repos'
teams_uri = f'https://api.github.com/orgs/{org_name}/teams'
# Find team ID via API call:
#   curl -H "Authorization: token "  \
#         https://api.github.com/orgs/{org-name}/teams
team_id = 12345678
Next comes the code related to getting the list of existing repos. This allows us to ensure that the new repo name does not conflict with an existing one. First, the getLastPage() function parses a URL returned by the GitHub api call to list repos, and from that, determines how many pages of repos exist. That number is used to drive a loop over that many iterations in the function getRepos().

# Used to determine how many pages there are on the repo list
def getLastPage(link_header):
  links = link_header.split(',')
  for link in links:
    link = link.split(';')
    if link[1].strip() == 'rel="last"':
      parsed_url = urlparse(link[0].strip(' <>'))
      link_data = parsed_url.query.split('&')
      for ld in link_data:
        if ld.startswith('page='):
          last_page = int( ld.split('=')[1] )
      return last_page
  return 0

# Make a list of all repos in organization
def getRepos(payload):
  print('getting list of all repos, please wait...')
  repos = list()
  head = requests.get(repos_uri, params = payload)
  last_page = getLastPage(head.headers['link'])
  if last_page != 0:
    for page in range(1, last_page+1):
      repo_payload = payload.copy()
      repo_payload['page'] = page
      resp = requests.get(repos_uri, 
                  params = repo_payload, headers = {})
      repos_raw = json.loads(resp.text)
      new_repos = [ r['name'] for r in repos_raw ]
      repos += new_repos
  return repos
Now, executed in the main body, we retrieve a GitHub personal access token, which has been stored in the Git global configuration (see opening comments on setup required). This access works by running a shell command in a subprocess. From that access token, we create a payload for POST transactions with GitHub to follow.

# Get the github personal access token from the git
# global config
proc = subprocess.Popen(["git","config","--global",
                         "--get","user.pat"],
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT)
access_token = proc.stdout.read().strip().decode()
payload = {'access_token': access_token}
Create the list of repo names. Also parse the command line argument - the repo name.

# Build a list of repo names so we can check for a collision
# repos come in a paged form, so have to loop over them
repos = getRepos(payload)

# Parse command line
parser = argparse.ArgumentParser()
parser.add_argument('repo_name', help='desired repo name')
args = parser.parse_args()

# See if the repo already exists
if args.repo_name in repos:
  print(f'Repository {args.repo_name} already exists!')
  sys.exit(1)
Next, we will actually create the repo via the Python GitHub package. This is part of a large try block.
 
try:
  # create the repo
  g = Github(access_token)
  my_org = g.get_organization(org_name)

  repo_name = args.repo_name
  repo_description = f'repo for {repo_name}'
  new_repo = my_org.create_repo(repo_name,
                               description = repo_description,
                               private = True,
                               auto_init = False)
Now we add team permission to this new repo.

  # add a team to the repo with admin permissions
  teamrepo_uri = f'https://api.github.com/teams/{team_id}' +  \
                 f'/repos/{org_name}/{repo_name}'
  team_payload = payload.copy()
  # choose the permission schema you want for this team
  team_payload['permissions'] = 'admin'
  repo_r = requests.put(teamrepo_uri, params = payload,
                        data = json.dumps(team_payload))
Next, handle any exceptions. This is very bad code, because it is not designed to handle specific exceptions. It handles any exception by printing a message and exiting with a 1 status. It is suitable to run now, but should be refined to handle any specific exceptions that do occur. So far, no exceptions were encountered!

except:
   print("Exception type: %s, Exception arg: " +  \
         "%s\nException Traceback:\n%s" %  \
         (sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2]))
   print('\nError in creating or configuring repo.',
         'Check it out and try again')
   sys.exit(1)
That's it. So, here is the full code of this script:

#!/usr/bin/env python3.6

# Creates a new GitHub repo and added a team owner

# Setting up to run this:
#   1) Install python 3.6 (brew install python). Ensure
#      the shebang line of this script works for your
#      system.
#   2) pip3 install PyGithub (maybe pip or pip3.6 - whatever 
#      works for the python3.6 or 3.7 install)
#   3) Create a GitHub personal access token at 
#      https://github.com. Click your user picture in the
#      upper right, choose settings, then Developer settings
#      and Personal access tokens. Create a new token.
#      Scopes only needs to include repo + read:org scope
#      under the admin:org heading.
#   4) Insert that personal access token (PAT) in the git
#      global configuration file with: 
#        git config --global user.pat 
#   5) Also, ensure you have ssh set up to be used for
#      GitHub access
#   6) Edit this script to change the org_name and team_id
#      constants to match your needs

import os
import sys
import argparse
import getpass
import subprocess
import re
import json
import datetime
import getpass
import zipfile
from urllib.parse import urlparse

import requests
from github import Github

## Constants
org_name = 'dummy_org'
base_uri = f'https://git@github.com/{org_name}/'
repos_uri = f'https://api.github.com/orgs/{org_name}/repos'
teams_uri = f'https://api.github.com/orgs/{org_name}/teams'
# Find team ID via API call:
#   curl -H "Authorization: token "  \
#         https://api.github.com/orgs/{org-name}/teams
team_id = 12345678

# Used to determine how many pages there are on the repo list
def getLastPage(link_header):
  links = link_header.split(',')
  for link in links:
    link = link.split(';')
    if link[1].strip() == 'rel="last"':
      parsed_url = urlparse(link[0].strip(' <>'))
      link_data = parsed_url.query.split('&')
      for ld in link_data:
        if ld.startswith('page='):
          last_page = int( ld.split('=')[1] )
      return last_page
  return 0

# Make a list of all repos in organization
def getRepos(payload):
  print('getting list of all repos, please wait...')
  repos = list()
  head = requests.get(repos_uri, params = payload)
  last_page = getLastPage(head.headers['link'])
  if last_page != 0:
    for page in range(1, last_page+1):
      repo_payload = payload.copy()
      repo_payload['page'] = page
      resp = requests.get(repos_uri, 
                  params = repo_payload, headers = {})
      repos_raw = json.loads(resp.text)
      new_repos = [ r['name'] for r in repos_raw ]
      repos += new_repos
  return repos

# Get the github personal access token from the git
# global config
proc = subprocess.Popen(["git","config","--global",
                         "--get","user.pat"],
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT)
access_token = proc.stdout.read().strip().decode()
payload = {'access_token': access_token}

# Build a list of repo names so we can check for a collision
# repos come in a paged form, so have to loop over them
repos = getRepos(payload)

# Parse command line
parser = argparse.ArgumentParser()
parser.add_argument('repo_name', help='desired repo name')
args = parser.parse_args()

# See if the repo already exists
if args.repo_name in repos:
  print(f'Repository {args.repo_name} already exists!')
  sys.exit(1)

try:
  # create the repo
  g = Github(access_token)
  my_org = g.get_organization(org_name)

  repo_name = args.repo_name
  repo_description = f'repo for {repo_name}'
  new_repo = my_org.create_repo(repo_name,
                               description = repo_description,
                               private = True,
                               auto_init = False)

  # add a team to the repo with admin permissions
  teamrepo_uri = f'https://api.github.com/teams/{team_id}' +  \
                 f'/repos/{org_name}/{repo_name}'
  team_payload = payload.copy()
  # choose the permission schema you want for this team
  team_payload['permissions'] = 'admin'
  repo_r = requests.put(teamrepo_uri, params = payload,
                        data = json.dumps(team_payload))

except:
   print("Exception type: %s, Exception arg: " +  \
         "%s\nException Traceback:\n%s" %  \
         (sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2]))
   print('\nError in creating or configuring repo.',
         'Check it out and try again')
   sys.exit(1)

Tuesday, April 9, 2019

A bug in Gnu/Linux run-parts script can cause cron to hang

About a year ago, I encountered a Linux server where the cron task scheduler had hung. Investigating further, I found in the script run-parts, which is used to run the cron.hourly, cron.daily, etc. tasks, had a major bug. In fact, I think it is the best example of a software bug I have observed.

/usr/bin/run-parts is a short shell script which walks through the files in the cron.daily, etc. folder and runs the scripts found there. This run-parts script is part of the crontabs package for RHEL and derivatives. I have observed it in Amazon Linux (1), CentOS and RHEL, versions 6 and 7.

run-parts contains this shell pipeline at its heart:
$i 2>&1 | awk -v "progname=$i" \
                        'progname {
                             print progname ":\n"
                             progname="";
                         }
                         { print; }'
The variable $i is the path to the script (found in cron.daily) to be executed. That awk script is horrendous example of programming. It defines a variable *progname* on the command line, e.g. logrotate. The awk script also defines a function of the same name. In that AWK function, the AWK function is deleted. That's right, while the function is executing, it deletes itself. That is a race condition.

The purpose of this code is to echo the progname initially once and then from then on just echo its input. To keep it as an awk script, that code should be replaced the following:
$i 2>&1 | awk -v "progname=$i" \
   'BEGIN { print progname ":\n" }
   { print; }'
However, there is no need to add the additional burden of awk having to echo each line. This would work just fine:
echo -e "$i:\n"
$i

Here is what the processes look like when the race condition is hit:

# ps axwu|grep cron
root 1793 0.0 0.0 116912 1188 ? Ss 2018 3:21 crond
root 12003 0.0 0.0 103328 860 pts/2 S+ 13:33 0:00 grep cron
root 14361 0.0 0.0 19052 948 ? Ss 2018 0:00 /usr/sbin/anacron -s
root 16875 0.0 0.0 106112 1268 ? SN 2018 0:00 /bin/bash /usr/bin/run-parts /etc/cron.daily
root 16887 0.0 0.0 105972 948 ? SN 2018 0:00 awk -v progname=/etc/cron.daily/logrotate progname {????? print progname ":\n"????? progname="";???? }???? { print; }

The awk process never finishes (it is trying to find the function it was trying to run, I guess it gets lost when the function is deleted. I discovered this on April 2, 2019 and this has been hung since Dec 21, 2018.

The process of running awk seems to have gotten nowhere:

# ps -p $pid H -www -o pid,cputime,state,lstart,time,etime,cmd
  PID TIME S STARTED TIME ELAPSED CMD
16887 00:00:00 S Fri Dec 21 14:13:01 2018 00:00:00 101-22:45:16 awk -v progname=/etc/cron.daily/logrotate progname {????? print progname ":\n"????? progname="";???? }???? { print; }

I joined the awk process with debugger and found out awk was not executing anything.

I first discovered this problem on Amazon Linux. AWS wants bugs for that reported on Amazon forums. When I had NO response, I set it aside. Then, on April 2, I discovered a cron hang again, this time in RHEL. I didn't know how difficult it might be in submitting bugs with RHEL (fearing I would have to track down licenses and more), I posted it to CentOS. They immediately told me to submit it to RHEL (its not like RHEL and CentOS are part of the same company ;-), so I did. It wasn't hard to do. There I finally got some traction. But, if you want it immediately, just edit the code as shown above.




Wednesday, May 2, 2018

How to use the aws cli and jq to list instance id and name tag


Using the Amazon Web Services CLI (command line interpreter tool), you can get information about virtual machines (EC2s). The information is JSON format, but it is quite voluminous. If you just want to see the two fields of Instance Id and the Name tag, it can be a challenge because the tags consist of an array of tag names paired with tag values. So, the magic here is the jq code necessary to grab the right tag.

# build up the jq code in two parts
jqProgram1='.Reservations[].Instances[] | (.Tags | '
jqProgram2='from_entries) as $tags | .InstanceId + ", " + $tags.Name'
jqProgram="$jqProgram1 $jqProgram2"

aws --profile myprofile --region us-east-1 ec2 describe-instances \
   --filters Name=tag-key,Values=Name |   \
   jq "$jqProgram" 
Example output:
"i-0f5f9271233816c3f, instance-name-1"
"i-0bd45657eefdb0345, instance-name-2"
"i-00fab46ea64a78997, instance-name-3"
The jq program captures the values of each tag as the variable tags which then has a .Name field to represent the Name tag.

Sunday, April 29, 2018

Javascript coding challenge - async callback/variable scope

The Problem

My son asked about this ECMAScript/Javascript code, where he was using Google's geocoding and mapping APIs to populate a map with markers.

<script type="text/javascript">
function test() {
    var locations = [
        ['100 N Main St Wheaton, IL 60187', 'Place A'],
        ['200 W Apple Dr Winfield, IL 60190', 'Place B']
    ];
    var map = new google.maps.Map(document.getElementById('map'), {
        zoom: 10,
        center: new google.maps.LatLng(41.876905, -88.101131),
        mapTypeId: google.maps.MapTypeId.ROADMAP
    });
    var infowindow = new google.maps.InfoWindow();
    var marker, i;
    var geocoder = new google.maps.Geocoder();
    for (i = 0; i < locations.length; i++) {
        geocoder.geocode({
            'address': locations[i][0]
        }, function(results, status) {
            marker = new google.maps.Marker({
                position: results[0].geometry.location,
                map: map,
                title: locations[i][1]
            });
        });
    }
}


It was broken, being unable to find locations[i][1] to assign to title.

My response:

So, what would I do?

Study the documentation further on Markers and Geocoding.

The second one tells me my understanding/recollection of scope of anonymous functions was wrong. Variable resultsMap is passed as a parameter to geocodeAddress and then it calls geocoder.geocode with an anonymous callback function that uses that variable. However, that variable will never change. So too for you, locations will never change. So, the real problem you have is that the value of i changed. When i hit the end of the for loop, i was = locations.length and that would put it outside the range of valid indexes for locations and it would choke. The async execution is what is causing the problem.

Google's examples all make only one mark, so they can hard code what they want.

So, options: (1) don't initially set the title of the marker, but come back later and do it, (2) specify a title variable that doesn't change, or (3) force waiting on async execution so i doesn't change.

(1) Now, can we count on Javascript not doing out of order execution of the callbacks? I don't think so. Otherwise, you could append the markers to an array and assume they are in order of the locations used in calls to geocode(). If that were true then you could come back and add the titles later.

Is there something returned in the results of geocode that would help us index into the locations array? Reading through https://developers.google.com/maps/documentation/javascript/geocoding doesn't reveal anything. I was hopeful that placeId might work, if it were an arbitrary field because you can pass the value in to the geocode call and you get it back out in results. However, Google has reserved the values for their own meaning. And, the address you send in to geocode is not necessarily the same one you get out of results.formatted_address.

(2) I imagine, with some effort, one could use dynamic code generation (the program writes code and then runs it) to define fixed variables for each of the values of locations, such that you have something like: location1 = locations[0], location2 = locations[1], etc. and then those variables could be referenced in the callback function. The eval() function is used to dynamically evaluate code. Even better might be just using the constant value in the callback function definition. So it would be something like this inside the for (i=0 loop:

var dynamicCode = 'geocoder.geocode(
 { \'address\': ' + locations[i][0] + '},
 function(results,status) { ' +
   'marker = new google.maps.Marker({ position: results[0].geometry.location,' +
   'map:map, title: \'' + locations[i][1] + ' }); });'
alert('about to execute this code:\n' + dynamicCode)
eval(dynamicCode)

You could embed newlines \n into that dynamic code if you want to make it look prettier, but it is not necessary. On the other hand, spaces could be trimmed out too. So, that is one solution.

(3) Another solution, waiting for callback to complete. Now, the modern way to deal with this is to use Javascript Promises. However, there would need to be support for this from the geocode library, where the callback comes from. Reviewing the reference documentation on geocode, does not reveal any support for this. So, a more hacky approach involves a lock. It would look like this:

var lock;
for (i = 0; i < locations.length; i++) {
    lock = 1; // lock set for this loop iteration
    geocoder.geocode({
      'address': locations[i][0]
      }, function(results, status) {
         marker = new google.maps.Marker({
            position: results[0].geometry.location,
            map: map,
            title: locations[i][1]
         });
        lock = 0; //
    });
    while (lock == 1) {
    await wait(100); // wait 100 milliseconds and check again
}
So, the lock is set going into the call to geocode() and only in the callback function is the lock unset. Polling for the lock to be unset happens every 100 ms, but a shorter time interval may make sense.