Popular Post

_

Thursday, February 17, 2011

Jonathan Gladstone: Threshold Management Diagram

Jonathan Gladstone has worked with a team to implement pro-active Mainframe CPU usage monitoring, basing his design partly on presentations and conversations with Igor Trubin (currently of IBM) and Boris Ginis (of BMC Software).

His system does not generate any alerts on this basis, but it’s a good place to go to
  • find out what’s been running hot (or cool) at the system level, and/or
  • figure out why at the service class level.

It compares each interval (in this case every 10 minutes) of the most recent day’s utilization (by system and by service class) with the average for a given hour on a given day of the week over the past six weeks. Each interval is compared to the set of the last 36 values in a similar timeframe. If more than one interval in an hour is higher than the 98th percentile for its hour & day, the hour is marked yellow; if more than four intervals are high the hour is marked red. If more than one interval is lower than the 2nd percentile for its hour & day, the hour is marked blue. Anything in between (i.e. anything that falls within roughly x-bar±2SD) is green.

Here’s the main “CPU Overview” page from his system:



The thumbnails give an idea of what’s going on – green is within normal range. Let’s look at the Sunday, Jan. 23rd (just because all the colours are there). Clicking on any thumbnail shows that day close up:



Without going into details about what runs in which systems, we can see that they’re listed in reverse alpha order and, of course, anyone who’s looking at this knows which system is which. The user can see that a lot of systems were running well below their normal utilization on this particular Sunday. That’s mostly because of some special testing: our developers were asked to stay off the systems if they could. To see more detail let’s choose SCA6, which has all of the colours. If we click anywhere on the bar for SCA6, this next level of detail is shown:


  

That chart shows the system’s total utilization (from SMF70s) for individual 10-minute intervals (green area) compared to the average, high (98%ile) and low (2%ile) values for each hour based on the last six weeks. We see why some hours are marked red, yellow or blue instead of green according to the rules above. Clicking  anywhere on the green area gets a long page full of control charts that show the same information for each defined service class within that system (from SMF72s).

Among them the following, BATH_A6, is high-priority batch. Clearly it was driving some of the yellow and red flags for this system in the 2-3 and 5-7h windows:





(This post is published here with Jonathan’s Gladstone permission. He retains all publication rights and copyright for this material)