IT Blog

Saturday, February 10, 2024

Recent Python for an Amazon Linux 2 Based Container

A coworker had a process to publish a static website via Vercel. Vercel runs their publishing jobs using a container running Amazon Linux 2 (AL2). This particular publishing job for the coworker had to invoke a Python program which required at least Python 3.10. If you know CentOS, from which AL2 was derived, you know that their versions of Python supported are ancient. The nominal solution is to build Python from a source distribution in order to get a later version installed. There is an option to use the EPEL package source but things are very different with that source than they used to be and it used to easier to work with. So, building from source is the way to go. However, the coworker was frustrated by the length of time required for this process on the Vercel platform.

So, enter me. After some effort, I was able to convince this coworker that, yes, Python must be built from source. However, could we build Python in a pre-deployment step, not run on Vercel and then simply perform a binary install on Vercel? A binary package of all the files installed by the Python build could be created and saved into the Git repository. Maybe not an ideal thing to include such a large binary in the repo, but it should be workable.

A few Python packages are also required by the Python program that needed to run as part of the deployment. So, we might as well include these in the binary we assemble because that way will require less time (no network access required or any other overhead).

Now, this binary would only need to be updated if a different container image of AL2 were used by Vercel or the Python packages were to be required.

So, the results looked like this. This script is used to perform the binary image build:

#!/usr/bin/env bash

PYTHON_VERSION='3.11.7'
PYTHON_MINOR_VERSION='3.11'
SEVEN_Z_VERSION='2301'

cd /build-python
yum erase openssl-devel -y
yum install tar gzip make gcc openssl11 openssl11-devel libffi-devel bzip2-devel \
bzip2-devel libffi-devel zlib-devel xz xz-libs -y
curl -O https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz
tar xzf Python-${PYTHON_VERSION}.tgz
cd Python-${PYTHON_VERSION}
./configure
make altinstall
cd ..
/usr/local/bin/pip${PYTHON_MINOR_VERSION} install nbconvert jupyter-console pandoc

# 7z is better at compressing, yet on Feb 2, 2024 their cert was expired.
# Temporarily adding "-k" to this next command. Try without at later time.
curl -k -O https://7-zip.org/a/7z${SEVEN_Z_VERSION}-linux-x64.tar.xz
tar xf 7z${SEVEN_Z_VERSION}-linux-x64.tar.xz
./7zz a python_install.7zip /usr/local/bin /usr/local/lib
rm -rf Python-${PYTHON_VERSION} Python-${PYTHON_VERSION}.tgz 7z${SEVEN_Z_VERSION}-linux-x64.tar.xz \
7zzs History.txt License.txt MANUAL readme.txt

Using 7zip for the binary assembly saves almost have the space over a compressed tar file. Now, getting 7z took a few extra steps. We use the altinstall make target for the Python makefile and all files are installed under /usr/local.

This following script can take care of kicking off the build of the binary assembly, on, for example, my coworkers laptop. It assumes the above script is found at the path build-python/build-python.sh.

#!/usr/bin/env bash

VERCEL_BASE_CONTAINER='amazonlinux:2.0.20191217.0'

# Build the data needed on Vercel deployment: a 7z executable and an archive of a Python installation

docker run --rm -it -v $(pwd):/build-python ${VERCEL_BASE_CONTAINER} /bin/bash /build-python/build-python.sh

Finally, the Vercel deployment can be run with a script something like this script:

#!/bin/bash

# Install python setup with installed packages
./7zz x -spe -o/usr/local/ python_install.7zip
PATH="/usr/local/bin:$PATH"
npm run fixDocs && docusaurus build

It takes maybe 10 minutes to run the progress to build the binary assembly. About 62 MB is the resulting size. The Vercel deployment is significantly faster and less complicated.

Tuesday, June 28, 2022

Setup a Ruby Application in a Highly Constrained GitLab Executor

Some IT problems are easy to solve if one has the ability to control parameters and environments. When there are constraints, however, things can get really tough.

I found myself in such a constrained scenario. I needed to develop a GitLab CI/CD pipeline to install and run a Ruby application kitchen-terraform (GitHub link) on a GitLab CI/CD executor on which I had a very limited set of OS packages (GNU/Linux, Centos-based, yum) I could install. The Ruby version available was very old. In addition, the standard way of installing Ruby gems (language packages) using https://rubygems.org/gems was unavailable. Also, the repository where Ruby source is available is not accessible.

Add to this that I had a work environment available but it was a highly constrained Microsoft Windows system. I could not install Ruby or any other software on that machine. I was able to send myself brief emails from outside systems, however. And, I could download a Ruby source bundle via browser and Ruby gems were available. This Windows system has a install of Python 3.9 with the standard library along with a limited set of additional Python PyPI packages. However, I do have a Linux system that is isolated over which I have full control.

So, let's break this problem down into parts. First, let's build Ruby from source. Via the Windows system I downloaded the Ruby source package ruby-2.7.6.tar.gz and I could insert this into my GitLab git repo, and I started populating the vendor directory with this. Vendor is a typical directory that has meaning to Ruby and its gem ecosystem. So, building Ruby from source is straight forward. Need to install some OS packages - to be able to build Ruby and a few other things I needed to do. Squelch the output of these commands once I know they work well. This will keep the transcript of the GitLab pipeline from being noisy. This will do it:

yum install 'Development Tools' -y &> /dev/null
yum install openssl-devel openssl -y &> /dev/null
tar xf vendor/ruby-2.7.6.tar.gz
cd ruby-2.7.6
./configure --without-rdoc > /dev/null
make > /dev/null
# GitLab CI/CD pipelines run with root as the user
# so no sudo required on this next command
make install > /dev/null

Building Ruby from source takes a bit of time. My pipeline is going to be running regularly so it would be a nice addition if we don't have to do that every run. I cannot permanently change the GitLab executor, though that would be idea. But, GitLab CI/CD has a cache feature, which allows one to save a portion of the directory structure from the prior run of the pipeline and restore this prior to the next run. So, what if we captured the full Ruby build and set that to be cached. Then just run the 'make install'. Seems to make sense.

However, the Ruby build process using a lot of makefiles is a horrendous mess. It actually compiles C code during 'make install'. Yeah, that is badly broken. Beside not building Ruby every time, it would be ideal to not have to even install the Development Tools via yum, because that takes some unnecessary time. So, begin picking apart the makefiles and figure out what 'make install' is actually doing and what can be done to subset these steps. About 8 hours later, I have the minimal set of commands. However, we need a helper makefile to be added to the suite of other makefiles. Let's call this makefile 'only-install-ext.mak' and we will place it at the root directory. This is the contents to put in that file:

${INSTRUBY} --make="$(MAKE)" $(INSTRUBY_ARGS) --install=ext-comm
${INSTRUBY} --make="$(MAKE)" $(INSTRUBY_ARGS) --install=ext-arch

Here is the minimal script to install a previously built Ruby:

yum install make -y > /dev/null
make do-install-bin
make do-install-lib
make -f GNUmakefile -f only-install-ext.make install_ext_special
make do-install-gem

Ok, one more optimization we can do is to reduce the size of the cache GitLab maintains. This means less time restoring the cache. With some iterations, one can determine which directories of the Ruby build are really not needed by deleting the directories and determine if the above install process still works or the directory is needed. These directories are not needed: basictest benchmark bootstraptest cygwin doc sample spec test win32. So, with the full build process, we remove these directories so that, at the end of the pipeline execution, they will not be present to be added to the cache. In the pipeline configuration, we can add this to capture the cache:

build_job:
    ...
    cache:
      key: ruby
      paths:
        - ruby-2.7.6

So, here is the full shell script (bash) for the building/installing Ruby. We start by checking if the cache has been restored. If it has, we just do the "make install" process.

if [ -d ruby-2.7.6 ]; then
    cd ruby-2.7.6
    yum install make bind-utils openssl -y > /dev/null
    make do-install-bin
    make do-install-lib
    make -f GNUmakefile -f only-install-ext.mak install_ext_special
    make do-install-gem
    cd ..
else
    yum groupinstall 'Development Tools' -y &> /dev/null
    yum install openssl-devel bind-utils openssl -y &> /dev/null
    tar xf vendor/ruby-2.7.6.tar.gz
    cd ruby-2.7.6
    ./configure --without-rdoc > /dev/null
    make > /dev/null
    rm -rf basictest benchmark bootstraptest cygwin doc \

       sample spec test win32
    cd ..
    # some additional work done in this else block ...
    ...
fi

Ok, Ruby building/installing is taken care of in an optimized fashion. Next, if we want to be able to run kitchen-terraform, we need to have some gems to be installed. In fact, it is quite a large number in a very twisted hierarchy of dependencies. Turns out the easy way to get these gems (found after I started down the hard way path). The easy way is to utilize the Linux system over which I have full control to build an exhaustive list of gems with versions. Do this by first installing the very same ruby-2.7.6 version from source (seems all Linux versions are in the dark ages with Ruby versions the OS packages support). Once that is set up, change to a work directory. Create a file called "Gemfile" with this content:

source "https://rubygems.org"
gem "kitchen-terraform", "~> 6.1"

This Gemfile can be used for the Ruby "gem" command but also the super-powered "bundle" command. So, run the bundle command and capture the output.

bundle install > bundle.log

That output file will contain the names and versions of the full set of required gems for kitchen-terraform. Edit this down to a nice file listing one gem name and version per line. So, this file contents can safely be emailed to the restricted Windows system. Now these gems must be downloaded from rubygems.org. But there are so many that you really don't want to do this via your browser. So, Python to the rescue. This will read a file listing gems and versions.

import os
import sys
import requests

def getgem(name, version):
    filename = f'{name}-{version}.gem'
    url = f'https://rubygems.org/downloads/{filename}'
    r = requests.get(url)
    with open(filename, 'wb') as fp:
      fp.write(r.content)

Now, each of these gems is actually a .tar.gz file. They can be put in the vendor/cache directory in our repo where "bundle" can find them.

Next, create the Gemfile in our repo with this content:

gem "kitchen-terraform", "6.1.0"

Now, about these gems. Most of them are pure ruby code. Nothing special is required to install these gems besides the .tar.gz files downloaded. However, a small number of these gems require compiling C code. That would mean we need to have the yum group 'Development Tools' be installed. We don't want to have the installed routinely because it slows down the pipeline. So, how about if we build those particular gems at the same time we are building Ruby from source and we have the 'Development Tools' installed. Then, we can zip them up and save the result as a GitLab pipeline artifact, which can be subsequently be downloaded and inserted into our GitLab repo.

To do this, we need to install these compiled gems in a special location because there are lots of files created and we want to separate all those from other gems installed when Ruby was built from source. So, the following shell script code will install the gems in the local directory in the subdirectories of: build_info cache doc extensions gems specifications. The zip will then be unpacked on subsequent runs in the system gem directory.

# install these compiled gems from vendor directory and

# capture the zip
for gem in bcrypt_pbkdf-1.1.0 bson-4.15.0 ed25519-1.3.0 \

ffi-1.15.5 unf_ext-0.0.8.2 json-2.6.2;do
gem install -i . -N -V --local "vendor/cache/${gem}.gem" \

> /dev/null
done
# now save results as zip file
zip -r compiled_gems.zip build_info cache doc extensions \

gems specifications > /dev/null
# built it once, set as job artifact and download it

# and store in vendor/
# this is used below when not built here.
compiled_gems="yes"

You will see that compiled_gems is a flag, which if not set, will trigger installing these gems from the zip file later in this code. We change directory to the system gem install directory for unpacking the zip:

# install pre-built gems so we don't need to install

# developer tools every run
if [ -z "$compiled_gems" ]; then
    pushd /usr/local/lib/ruby/gems/2.7.0 > /dev/null
    unzip $CI_PROJECT_DIR/vendor/compiled_gems.zip > /dev/null
    popd > /dev/null
fi

These installations and the others can be tested by querying with the "gem" command:

# check the compiled gems are available
gem list '^(json|unf_ext|bcrypt_pbkdf|bson|ed25519|ffi)' -d
# and a few of the others
gem list '^(aws-eventstream|azure_graph_rbac)' -d

By the way, you can put this in the GitLab pipeline to capture compiled_gems.zip.

build_job:
    ...
    artifacts:
      paths:
        - compiled_gems.zip

So, if everything is put together, this is the full setup shell script:

if [ -d ruby-2.7.6 ]; then
cd ruby-2.7.6
yum install make bind-utils openssl -y > /dev/null
make do-install-bin
make do-install-lib
make -f GNUmakefile -f only-install-ext.mak install_ext_special
make do-install-gem
cd ..
else
yum groupinstall 'Development Tools' -y &> /dev/null
yum install openssl-devel bind-utils openssl -y &> /dev/null
tar xf vendor/ruby-2.7.6.tar.gz
cd ruby-2.7.6
./configure --without-rdoc > /dev/null
make > /dev/null
rm -rf basictest benchmark bootstraptest cygwin doc sample \

spec test win32
cd ..

# install these compiled gems from vendor directory and

# capture the zip
for gem in bcrypt_pbkdf-1.1.0 bson-4.15.0 ed25519-1.3.0 \

ffi-1.15.5 unf_ext-0.0.8.2 json-2.6.2;do
gem install -i . -N -V --local "vendor/cache/${gem}.gem" \

> /dev/null
done
# now save results as zip file
zip -r compiled_gems.zip build_info cache doc extensions gems \

specifications > /dev/null
# build it once, set as job artifact and download it and
# store in vendor/ this is used below when not built here.
compiled_gems="yes"
fi

# ruby location, /usr/local/bin, is already on path
echo -n 'Ruby version: '
ruby --version

# install pre-built gems so we don't need to install

# developer tools every run
if [ -z "$compiled_gems" ]; then
pushd /usr/local/lib/ruby/gems/2.7.0 > /dev/null
unzip $CI_PROJECT_DIR/vendor/compiled_gems.zip > /dev/null
popd > /dev/null
fi

# install remaining vendored gems; get gems here: https://rubygems.org/gems
bundle install --local &> bundle.log
log_lines=$(wc -l bundle.log|cut -d' ' -f 1)
normal_log_lines=260
if [ $log_lines -ne $normal_log_lines ]; then
echo "--- bundle.log lines=$log_lines ---"
cat bundle.log
fi

# check the compiled gems are available
# gem list '^(json|unf_ext|bcrypt_pbkdf|bson|ed25519|ffi)' -d
# and a few of the others
# gem list '^(aws-eventstream|azure_graph_rbac)' -d

Thursday, March 5, 2020

Auth0 and JWKS - What about the Key ID?

I recently inherited a software project, which needed some updates. It is a NodeJS project that was run as a Google Cloud function. It was part of a software system which made use of the Auth0 (https://auth0.com) service for authentication of users. Another application took care of the authentication and received a JSON Web Token (JWT), which is an encrypted item. That token is passed to the Google Cloud function. Next that Cloud Function must validate it.

To do this, the function makes use of two npm libraries: jsonwebtoken and jwks-rsa. Jwks-rsa is associated with the Java Web token Key Signing protocol. This is basically a simple https endpoint that returns a JSON string. By using a key id to index into that JSON (array), you can get the public key. Auth0 generates a RSA private/public key pair, unique for each account. They sign the JWT authentication tokens they provide with the private key. Then JWKS provides the public key which can be used to verify the JWT actually came from Auth0.

Here are some excerpts from the code I was looking at:

const jwt = require("jsonwebtoken");
const jwksClient = require("jwks-rsa");

const client = jwksClient({
            strictSsl: true,
            jwksUri: "https://your-company.auth0.com/.well-known/jwks.json",
            cache: true,
            rateLimit: true,
            jwksRequestsPerMinute: 5,
});

client.getSigningKey('some random characters that looked like base 64 were here', (err, key) => {
...
});

Of course, "your-company" and "some random..." were both replaced with values I am not sharing. Your-company is based on the account created at Auth0. The "some random..." turns out to be the Key ID.

Instinctively I reacted against that Key ID being hard-coded into a source file that is stored in an Internet accessible git repository. Is that secure? And, where did that Key ID come from? Was it available in the Auth0 web GUI? Could it be retrieved from another source at run time?

I asked about this on the Auth0 forum (you need to create an account to view). I got directed to the documentation and given the suggestion that the Key ID shouldn't be encoded. I was left with more questions.

Finally, it occurred to me to just curl the endpoint referenced (jwksUri) and see what it contained. It is a JSON array, with a single entry which is an object containing the "kid" Key ID and "x5c" public key. This is completely open. All someone needs is the "your-company" part of the URI and they can retrieve this too. All they can really do with the public key found here is validate the Auth0 Tokens are valid and from Auth0. So no big deal. I also determined that you get the Key ID by just looking at the data. Then .getSigningKey uses that Key ID to identify the correct array element.

Will there ever be more than one array element? Well, the Auth0 documents have this to say about that:

Currently, Auth0 only supports a single JWK for signing; however, it is important to assume this endpoint could contain multiple JWKs. As an example, multiple keys can be found in the JWKS when rotating signing certificates.

So, what if there are two? How do you know which one is the right one? I have no answer for that.

Bottom line, it is not a problem of security risk hardcoding the Key Id. It reveals nothing further that isn't already publicly accessible. However, what if the Key ID should change? That would be a problem which currently lacks a good solution. Now, perhaps Auth0 will never change the Key Id. They could still change the actual private/public keys, and update the public key on the JWKS endpoint and things would be fine.

So, for now, keep the Key ID hardcoded, but document what to do if it ever stops working (access the endpoint and get the new Key ID).

Saturday, February 15, 2020

Dealing With Sparse Monitoring Data on Google Cloud Platform - Get the Raw Data!

I just wrote a blog post for my company, New Context Security, on how to deal with sparse monitoring data on Google Cloud Platform, using a Python script to retrieve the actual data:

https://newcontext.com/python-to-retrieve-gcp-stackdriver-monitoring-data/

Tuesday, May 14, 2019

Supplementing AWS CloudWatch Alarm capability - Watch over Lambda function

I haven't heard back from AWS support on the subject of my last message, so I created a Jenkins job to handle this auditing to ensure the Lambda function is running. Here is the bash shell script I used to implement this:

#!/usr/bin/env bash

# Used by a Jenkins job to monitor for an AWS Lambda function failing to fire
# every 90 minutes

# Algorithm:
#   get any cloudwatch events of Lambda invocation in the 
#   last $AlarmTime minutes. If there are none, then
#   the scheduled lambda function which should run every 
#   90 minutes. Once alarm condition is satisfied a file
#   is created to indicate that. Only the first time will
#   this job end in a fail. This fail will result in
#   reporting the problem. So, if the alarm condition
#   is satisfied but the file exists, the job won't fail.
#   However, we leave an escape hatch, in that the alarm
#   file that is present over 24 hours will be deleted.
#   So, if this alarm is neglected, it will come back every day. 

AlarmTime=95  # minutes - if late this much, alarm


Profile='ioce'
Region='us-east-1'
Namespace='AWS/Lambda'
Metric='Invocations'
Stat='SampleCount'
Dime='Name=FunctionName,Value=lambda_chef_converge_check'
AlarmFile='alarmOn'

OffsetExpression="$AlarmTime minutes ago"
StartTime=$(date -u -d "$OffsetExpression" +'%Y-%m-%dT%TZ')
EndTime=$(date -u +'%Y-%m-%dT%TZ')

# Get the metrics and test for DATAPOINTS
aws --profile $Profile --region $Region              \
  cloudwatch  get-metric-statistics                   \
  --namespace $Namespace --metric-name $Metric         \
  --start-time $StartTime --end-time $EndTime           \
  --period 60 --statistics "$Stat" --dimensions "$Dime"  \
  --output text|grep -q '^DATAPOINTS'
if [ 0 -eq $? ];then
  # Found datapoints, things are fine. Clear alarm file
  rm -f $AlarmFile
  
else
  # No datapoints found, we are missing a point, so alarm
  # if we haven't already done so for this episode
  if [ ! -f $AlarmFile ];then
   touch $AlarmFile
    exit 1 # get the job to fail
  else
    # Check if it is time to delete the file
    find -maxdepth 1 -type f -name "$AlarmFile" -cmin +1440 -delete
  fi
fi

I created a Jenkins job to pull this script from GitHub and execute it under this time schedule: H/15 * * * * *. Takes about 1.5 seconds to run from Jenkins trigger.

Monday, May 13, 2019

Back-filling AWS CloudWatch Events

I learned a lot more about how AWS handles events and alarms in CloudWatch (CW). The lesson is: CW events are posted at the time of the event (as measured at the source) and not at the delivery time. This means the data is always spiky if you are looking (with a CW alarm) for one missing event. It is spiky because there is some delay for delivery of the event. This graphic illustrates this:

You can see the spike over the Δt interval, the delay time in delivering the event. I would not label the AWS approach as "back-filling", but rather offsetting. If it were filling, the glitch would actually be filled in, so it vanished.

AWS' approach is obviously not very useful in cases where you have a delay in the arrival of the events (i.e. always) and you want to monitor for one missing point. I would argue a better approach, which is one suggested by the name "back-FILL", in that when a sample arrives you fill its presence over the entire period between event time and CW delivery time. This could be done by tracking the two times (event creation and CW arrival) and choosing the earlier time when the event is a begging time point and the later time when the event is an ending time point.

The CW implementation of averaging is sample-based. If time-based averaging were available, then this problem would be simple to solve. When averaged over a longer period, the spike would be insignificant. If we expect 4 invocations in the averaging period, the the time average will be 1.0 when no event is missing. When the event is missed, the time-based average would drop from 1.0 linearly until it reaches 0.75 at the point of the next invocation. One could set the threshold to 0.9 or higher or lower. This gives you a slider to tune out noise or increase the responsiveness. All delay effects would be minimized.
If I don't have a different solution using CW, then I will need to come up with a different option. Maybe a Lambda function running in a different region than one I am monitoring. Maybe a Jenkins job that uses the AWS CLI to query cloud watch. It is a shame to not be able to get this done in CW alone.

Wednesday, May 1, 2019

GitHub Repo setup

Problem: We needed a handy way to create GitHub repositories, enforce some naming restrictions and add appropriate team access. So, I automated this with Python 3.6+.

Along the way, I pulled some important tips from Bill Hunt's Gist addressing the same need.

I will discuss my code below, in sections, but look for the full code at the end of this post. The first section includes the comments on requirements to satisfy for this script to do its job. Note this is labelled as Python3.6 at the shebang line. You should really be on 3.7 by now, however!


#!/usr/bin/env python3.6

# Creates a new GitHub repo and added a team owner

# Setting up to run this:
#   1) Install python 3.6 (brew install python). Ensure
#      the shebang line of this script works for your
#      system.
#   2) pip3 install PyGithub (maybe pip or pip3.6 - whatever 
#      works for the python3.6 or 3.7 install)
#   3) Create a GitHub personal access token at 
#      https://github.com. Click your user picture in the
#      upper right, choose settings, then Developer settings
#      and Personal access tokens. Create a new token.
#      Scopes only needs to include repo + read:org scope
#      under the admin:org heading.
#   4) Insert that personal access token (PAT) in the git
#      global configuration file with: 
#        git config --global user.pat 
#   5) Also, ensure you have ssh set up to be used for
#      GitHub access
#   6) Edit this script to change the org_name and team_id
#      constants to match your needs

Next come the import statements. Most refer to Python standard libraries. However, there are two others: requests and PyGithub.


import os
import sys
import argparse
import getpass
import subprocess
import re
import json
import datetime
import getpass
import zipfile
from urllib.parse import urlparse

import requests
from github import Github

Some constants are defined. You need to edit org_name and team_id to match your GitHub configuration.


## Constants
org_name = 'dummy_org'
base_uri = f'https://git@github.com/{org_name}/'
repos_uri = f'https://api.github.com/orgs/{org_name}/repos'
teams_uri = f'https://api.github.com/orgs/{org_name}/teams'
# Find team ID via API call:
#   curl -H "Authorization: token "  \
#         https://api.github.com/orgs/{org-name}/teams
team_id = 12345678

Next comes the code related to getting the list of existing repos. This allows us to ensure that the new repo name does not conflict with an existing one. First, the getLastPage() function parses a URL returned by the GitHub api call to list repos, and from that, determines how many pages of repos exist. That number is used to drive a loop over that many iterations in the function getRepos().


# Used to determine how many pages there are on the repo list
def getLastPage(link_header):
  links = link_header.split(',')
  for link in links:
    link = link.split(';')
    if link[1].strip() == 'rel="last"':
      parsed_url = urlparse(link[0].strip(' <>'))
      link_data = parsed_url.query.split('&')
      for ld in link_data:
        if ld.startswith('page='):
          last_page = int( ld.split('=')[1] )
      return last_page
  return 0

# Make a list of all repos in organization
def getRepos(payload):
  print('getting list of all repos, please wait...')
  repos = list()
  head = requests.get(repos_uri, params = payload)
  last_page = getLastPage(head.headers['link'])
  if last_page != 0:
    for page in range(1, last_page+1):
      repo_payload = payload.copy()
      repo_payload['page'] = page
      resp = requests.get(repos_uri, 
                  params = repo_payload, headers = {})
      repos_raw = json.loads(resp.text)
      new_repos = [ r['name'] for r in repos_raw ]
      repos += new_repos
  return repos

Now, executed in the main body, we retrieve a GitHub personal access token, which has been stored in the Git global configuration (see opening comments on setup required). This access works by running a shell command in a subprocess. From that access token, we create a payload for POST transactions with GitHub to follow.


# Get the github personal access token from the git
# global config
proc = subprocess.Popen(["git","config","--global",
                         "--get","user.pat"],
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT)
access_token = proc.stdout.read().strip().decode()
payload = {'access_token': access_token}

Create the list of repo names. Also parse the command line argument - the repo name.


# Build a list of repo names so we can check for a collision
# repos come in a paged form, so have to loop over them
repos = getRepos(payload)

# Parse command line
parser = argparse.ArgumentParser()
parser.add_argument('repo_name', help='desired repo name')
args = parser.parse_args()

# See if the repo already exists
if args.repo_name in repos:
  print(f'Repository {args.repo_name} already exists!')
  sys.exit(1)

Next, we will actually create the repo via the Python GitHub package. This is part of a large try block.

 
try:
  # create the repo
  g = Github(access_token)
  my_org = g.get_organization(org_name)

  repo_name = args.repo_name
  repo_description = f'repo for {repo_name}'
  new_repo = my_org.create_repo(repo_name,
                               description = repo_description,
                               private = True,
                               auto_init = False)

Now we add team permission to this new repo.


  # add a team to the repo with admin permissions
  teamrepo_uri = f'https://api.github.com/teams/{team_id}' +  \
                 f'/repos/{org_name}/{repo_name}'
  team_payload = payload.copy()
  # choose the permission schema you want for this team
  team_payload['permissions'] = 'admin'
  repo_r = requests.put(teamrepo_uri, params = payload,
                        data = json.dumps(team_payload))

Next, handle any exceptions. This is very bad code, because it is not designed to handle specific exceptions. It handles any exception by printing a message and exiting with a 1 status. It is suitable to run now, but should be refined to handle any specific exceptions that do occur. So far, no exceptions were encountered!


except:
   print("Exception type: %s, Exception arg: " +  \
         "%s\nException Traceback:\n%s" %  \
         (sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2]))
   print('\nError in creating or configuring repo.',
         'Check it out and try again')
   sys.exit(1)

That's it. So, here is the full code of this script:


#!/usr/bin/env python3.6

# Creates a new GitHub repo and added a team owner

# Setting up to run this:
#   1) Install python 3.6 (brew install python). Ensure
#      the shebang line of this script works for your
#      system.
#   2) pip3 install PyGithub (maybe pip or pip3.6 - whatever 
#      works for the python3.6 or 3.7 install)
#   3) Create a GitHub personal access token at 
#      https://github.com. Click your user picture in the
#      upper right, choose settings, then Developer settings
#      and Personal access tokens. Create a new token.
#      Scopes only needs to include repo + read:org scope
#      under the admin:org heading.
#   4) Insert that personal access token (PAT) in the git
#      global configuration file with: 
#        git config --global user.pat 
#   5) Also, ensure you have ssh set up to be used for
#      GitHub access
#   6) Edit this script to change the org_name and team_id
#      constants to match your needs

import os
import sys
import argparse
import getpass
import subprocess
import re
import json
import datetime
import getpass
import zipfile
from urllib.parse import urlparse

import requests
from github import Github

## Constants
org_name = 'dummy_org'
base_uri = f'https://git@github.com/{org_name}/'
repos_uri = f'https://api.github.com/orgs/{org_name}/repos'
teams_uri = f'https://api.github.com/orgs/{org_name}/teams'
# Find team ID via API call:
#   curl -H "Authorization: token "  \
#         https://api.github.com/orgs/{org-name}/teams
team_id = 12345678

# Used to determine how many pages there are on the repo list
def getLastPage(link_header):
  links = link_header.split(',')
  for link in links:
    link = link.split(';')
    if link[1].strip() == 'rel="last"':
      parsed_url = urlparse(link[0].strip(' <>'))
      link_data = parsed_url.query.split('&')
      for ld in link_data:
        if ld.startswith('page='):
          last_page = int( ld.split('=')[1] )
      return last_page
  return 0

# Make a list of all repos in organization
def getRepos(payload):
  print('getting list of all repos, please wait...')
  repos = list()
  head = requests.get(repos_uri, params = payload)
  last_page = getLastPage(head.headers['link'])
  if last_page != 0:
    for page in range(1, last_page+1):
      repo_payload = payload.copy()
      repo_payload['page'] = page
      resp = requests.get(repos_uri, 
                  params = repo_payload, headers = {})
      repos_raw = json.loads(resp.text)
      new_repos = [ r['name'] for r in repos_raw ]
      repos += new_repos
  return repos

# Get the github personal access token from the git
# global config
proc = subprocess.Popen(["git","config","--global",
                         "--get","user.pat"],
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT)
access_token = proc.stdout.read().strip().decode()
payload = {'access_token': access_token}

# Build a list of repo names so we can check for a collision
# repos come in a paged form, so have to loop over them
repos = getRepos(payload)

# Parse command line
parser = argparse.ArgumentParser()
parser.add_argument('repo_name', help='desired repo name')
args = parser.parse_args()

# See if the repo already exists
if args.repo_name in repos:
  print(f'Repository {args.repo_name} already exists!')
  sys.exit(1)

try:
  # create the repo
  g = Github(access_token)
  my_org = g.get_organization(org_name)

  repo_name = args.repo_name
  repo_description = f'repo for {repo_name}'
  new_repo = my_org.create_repo(repo_name,
                               description = repo_description,
                               private = True,
                               auto_init = False)

  # add a team to the repo with admin permissions
  teamrepo_uri = f'https://api.github.com/teams/{team_id}' +  \
                 f'/repos/{org_name}/{repo_name}'
  team_payload = payload.copy()
  # choose the permission schema you want for this team
  team_payload['permissions'] = 'admin'
  repo_r = requests.put(teamrepo_uri, params = payload,
                        data = json.dumps(team_payload))

except:
   print("Exception type: %s, Exception arg: " +  \
         "%s\nException Traceback:\n%s" %  \
         (sys.exc_info()[0], sys.exc_info()[1], sys.exc_info()[2]))
   print('\nError in creating or configuring repo.',
         'Check it out and try again')
   sys.exit(1)