Tuesday, April 9, 2019

A bug in Gnu/Linux run-parts script can cause cron to hang

About a year ago, I encountered a Linux server where the cron task scheduler had hung. Investigating further, I found in the script run-parts, which is used to run the cron.hourly, cron.daily, etc. tasks, had a major bug. In fact, I think it is the best example of a software bug I have observed.

/usr/bin/run-parts is a short shell script which walks through the files in the cron.daily, etc. folder and runs the scripts found there. This run-parts script is part of the crontabs package for RHEL and derivatives. I have observed it in Amazon Linux (1), CentOS and RHEL, versions 6 and 7.

run-parts contains this shell pipeline at its heart:
$i 2>&1 | awk -v "progname=$i" \
                        'progname {
                             print progname ":\n"
                             progname="";
                         }
                         { print; }'
The variable $i is the path to the script (found in cron.daily) to be executed. That awk script is horrendous example of programming. It defines a variable *progname* on the command line, e.g. logrotate. The awk script also defines a function of the same name. In that AWK function, the AWK function is deleted. That's right, while the function is executing, it deletes itself. That is a race condition.

The purpose of this code is to echo the progname initially once and then from then on just echo its input. To keep it as an awk script, that code should be replaced the following:
$i 2>&1 | awk -v "progname=$i" \
   'BEGIN { print progname ":\n" }
   { print; }'
However, there is no need to add the additional burden of awk having to echo each line. This would work just fine:
echo -e "$i:\n"
$i

Here is what the processes look like when the race condition is hit:

# ps axwu|grep cron
root 1793 0.0 0.0 116912 1188 ? Ss 2018 3:21 crond
root 12003 0.0 0.0 103328 860 pts/2 S+ 13:33 0:00 grep cron
root 14361 0.0 0.0 19052 948 ? Ss 2018 0:00 /usr/sbin/anacron -s
root 16875 0.0 0.0 106112 1268 ? SN 2018 0:00 /bin/bash /usr/bin/run-parts /etc/cron.daily
root 16887 0.0 0.0 105972 948 ? SN 2018 0:00 awk -v progname=/etc/cron.daily/logrotate progname {????? print progname ":\n"????? progname="";???? }???? { print; }

The awk process never finishes (it is trying to find the function it was trying to run, I guess it gets lost when the function is deleted. I discovered this on April 2, 2019 and this has been hung since Dec 21, 2018.

The process of running awk seems to have gotten nowhere:

# ps -p $pid H -www -o pid,cputime,state,lstart,time,etime,cmd
  PID TIME S STARTED TIME ELAPSED CMD
16887 00:00:00 S Fri Dec 21 14:13:01 2018 00:00:00 101-22:45:16 awk -v progname=/etc/cron.daily/logrotate progname {????? print progname ":\n"????? progname="";???? }???? { print; }

I joined the awk process with debugger and found out awk was not executing anything.

I first discovered this problem on Amazon Linux. AWS wants bugs for that reported on Amazon forums. When I had NO response, I set it aside. Then, on April 2, I discovered a cron hang again, this time in RHEL. I didn't know how difficult it might be in submitting bugs with RHEL (fearing I would have to track down licenses and more), I posted it to CentOS. They immediately told me to submit it to RHEL (its not like RHEL and CentOS are part of the same company ;-), so I did. It wasn't hard to do. There I finally got some traction. But, if you want it immediately, just edit the code as shown above.




No comments:

Post a Comment