I learned a lot more about how AWS handles events and alarms in CloudWatch (CW). The lesson is: CW events are posted at the time of the event (as measured at the source) and not at the delivery time. This means the data is always spiky if you are looking (with a CW alarm) for one missing event. It is spiky because there is some delay for delivery of the event. This graphic illustrates this:
You can see the spike over the Δt interval, the delay time in delivering the event. I would not label the AWS approach as "back-filling", but rather offsetting. If it were filling, the glitch would actually be filled in, so it vanished.
If I don't have a different solution using CW, then I will need to come up with a different option. Maybe a Lambda function running in a different region than one I am monitoring. Maybe a Jenkins job that uses the AWS CLI to query cloud watch. It is a shame to not be able to get this done in CW alone.
You can see the spike over the Δt interval, the delay time in delivering the event. I would not label the AWS approach as "back-filling", but rather offsetting. If it were filling, the glitch would actually be filled in, so it vanished.
AWS' approach is obviously not very useful in cases where you have a delay in the arrival of the events (i.e. always) and you want to monitor for one missing point. I would argue a better approach, which is one suggested by the name "back-FILL", in that when a sample arrives you fill its presence over the entire period between event time and CW delivery time. This could be done by tracking the two times (event creation and CW arrival) and choosing the earlier time when the event is a begging time point and the later time when the event is an ending time point.
The CW implementation of averaging is sample-based. If time-based averaging were available, then this problem would be simple to solve. When averaged over a longer period, the spike would be insignificant. If we expect 4 invocations in the averaging period, the the time average will be 1.0 when no event is missing. When the event is missed, the time-based average would drop from 1.0 linearly until it reaches 0.75 at the point of the next invocation. One could set the threshold to 0.9 or higher or lower. This gives you a slider to tune out noise or increase the responsiveness. All delay effects would be minimized.If I don't have a different solution using CW, then I will need to come up with a different option. Maybe a Lambda function running in a different region than one I am monitoring. Maybe a Jenkins job that uses the AWS CLI to query cloud watch. It is a shame to not be able to get this done in CW alone.
No comments:
Post a Comment