System Management by Exception: January 2012

Tuesday, January 31, 2012

How to get the actual spreadsheets I have developed to build control charts.

I got that question from one of my blog's reader. Here is my answer to him:

I share my spreadsheets only during my live presentations I occasionally do (CMG.org local and international ones); so you can keep checking (on my blog) when my next one will be and you are welcome. I plan to submit my next CMG paper for this year International Conference in Las Vegas, so try to attend and we will talk. You can invite me to you local CMG chapter! If you do not have one - create that! Ask me how to do that and I can help.

If you cannot reach me that way, I usually do very detailed explanation how my spreadsheets (like control chart builder) work on my blog and papers, so I believe it is possible just to recreate them. Let me know if you have any specific question what is not clear in my postings or papers and I will try to give you more explanations.

And finally I am IBM consultant! Your company can call me to help you implement my ideas!

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Sunday, January 29, 2012

Bad and Good FiOS Upload Comparison

This is the continuation of previous research published here:
FiOS Problem: Large File Upload Speed Analysis

My son repeated the same test I did but on his own laptop and got the same and even more interesting result that shows the problem very clear: it does the fast upload at 2.5 Mgps during transferring 1st 2-3 Mg and then it degrades to ~0.17 Mgps for the rest of file traansfer with some ocational spiks:

Then my son went to his next door friend who has very fast 35/25 FiOS and tested the upload speed there the same way we did from my house. Everything was extremely fast as seen below:

Them I went to my other neighbor (same 35/25 FiOS) with my laptop, ran my test and got also the very good result:

That was a 200Mb file upload to YouTube and it shows actual upload speed about 25% out of 100Mbps which matches the standard speed test. I even tested sending attached 2 Mg file via my Verizon.net account - NO PROBLEM:

So we made the clear experiment and now I am sure there is no problem with my laptop and the FiOS in our neighborhood is OK. The problem is around my house. I am sending this report to Verizon support. We will see how they help me. The update will be in the next additional posts in my personal blog: http://trubinigor.blogspot.com.

Igor Trubin

Friday, January 27, 2012

FiOS Problem: Large File Upload Speed Analysis

I am afraid my home internet (Verizon FiOS) does not provide the upload speed that actually I pay for. I have 15/5 Mbps, so upload speed should be about 5 Mbps.

Speed test result (http://www.speakeasy.net/speedtest/) gives the following:

Good, right? But that test uses very small size file, I believe. So I could not resist to make some experiment to measure the real upload speed I have for relatively large file (~8Mb).

See result here:

That means I have about 0.17% out of 100Mbps = 0.17 Mbps instead of 5Mbps!!!!

That is interesting... a small file upload is fast, so standard speed test is not capturing a problem, but long upload is degrading significantly!

It cannot be a trick from Verizon to hide their problems I hope, that should be some network defect or capacity issue.

Sometimes I see errors:

- standard speed test from some distant locations shows error (see example below from Seattle):

- Plus when I am attaching >2 Mb file in my e-mail it returns error after uploading less than half of file :

Note: Ironically Verizon "home agent" program senses this e-mail problem, but cannot help at all!

So it is the real problem for a blogger!

I have opened a ticket with Verizon and having a battle right now with them to fix that. See details: http://trubinigor.blogspot.com/2012/01/facebook-verizon-account-responded-on.html

Igor Trubin

Monday, January 23, 2012

Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

UPDATE 2018:
The technique was successfully tested in the SonR (SEDS based Anomaly detection system) as described in the following post:

"My talk, "Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation", has been selected for #CMGimpact 2019

_______________________________________________________ original post:
As I promised in CMG'11 Trip Report, Part 1 here is my comments and some follow up analysis of the following paper: Quantifying Imbalance in Computer Systems that was written and presented at CMG'11 by Charles Loboz from Windows Azure.

The idea is to calculate imbalance of a system by using an entropy property which well know in the physics , economics and in the information theory.

In my other past posting I rose the following question:
"can the information theory (entropy analysis) could be applied to performance exception detection?"

Looks like the idea from the mentioning CMG paper of using entropy calculation against system performance data could lead to the answer of that my question!

Here is the quote from the paper:

"...Theil index is based on entropy - it describes the excess entropy in a system. For a data set xi,

i=1..n the Theil index is given by:

where n is the number of elements in the data set and xavg is the average value of all elements in the data set. To underline the application of the Theil index to measure imbalance in computer systems we call it henceforth the Imbalance Coefficient (IC).

Examining closer the IC formula above we can derive several properties:

(1) the ratio xi/xavg describes how much element i is above or below the average for the whole set. Thus IC involves only the ratio of each element against the average, not the absolute values of theelements.

(2) IC is dimensionless .– thus allows to compare imbalance between sets of substantially different quantities, for example when one set contains disk utilizations and another disk response times.

(3) The minimum value of IC is zero - when all elements of the data set are identical. The maximum value of the Imbalance Coefficient is log(n) - when all elements but one are equal; the maximum IC depends thus on the set size.

(4) We can view Imbalance Coefficient as a description of how concentrated is the use of some resource .– large values mean fewer users use most of the resource, small values mean more equal sharing.

We also define, for convenience, Normalized Imbalance Coefficient (nIC) as

to account for both imbalance within the set and the maximum entropy in that set. The nIC value ranges from 0 to 1 thus enabling comparison of imbalance between data sets with differing number of elements..."

Author applied that to the multiple disks utilization analysis, but he mentioned that approach could be used for measuring other computer subsystems imbalance. So I decided to try to calculate the imbalance of CPU utilization during the day (24 hours) and a week (168 hours) because the imbalance of capacity usage during a day or week is a pretty common concern. Also using my way to group base-line vs. actual data I have applied that twice to compare an "average" weekly/daily utilization vs. last week/days of actual utilization.

The raw data is the same as for the last Control Charting exercise I published here in the series of posts ( see EV-Control Chart as an example), where the actual data (in black) vs. historical averages (in green) are shown below:

Here is the result of calculating the actual vs. averaged nIC Imbalance difference for all 168 hours and for each weekdays (7 days by 24 hours):

You can see that in the day when the anomaly of CPU usage started - Wednesday - the imbalance was significantly different and all in all weekly imbalance was significantly different too! So indeed that metric can be use to capture some performance metric anomalies (pattern changes).

FYI: Here is the spreadsheet snapshot with actual calculation I used:

How better that method of imbalance change checking to compare with more traditional ways to do that (e.g. based on deviations) is hard to say. My personal preference is still EV-concept. Anyway someone needs to try that against more data...

BTW I have found another paper which relates to that topic:

Quantifying Load Imbalance on Virtualized Enterprise Servers by
Emmanuel Arzuaga and David R. Kaeli

In that paper here is the clear statement about imbalance: "A typical imbalance metric based on the resource utilization of physical servers is the standard deviation of the CPU utilization".

Still an entropy is interesting system property that should give us additional good source of information for pattern recognition, I believe. For instance, the balance of Capacity usage of large frames with a lot of LPARS (AIX p7s or VMware hosts) could be monitored by using that nIC metric to apply some possibly an automatic way to rebalanced capacity usage by using partition mobility or v-motion technologies.

Igor Trubin

Friday, January 20, 2012

LinkedIn Discussion: "How to write a book or blog”

I have responded on the LinkedIn Discussion: " How to write a book or blog ” initiated by professional blogger Greg Schulz. And I have got the following excellent advises I am going to follow:

"Igor with all of your white papers and posts, you probably have a good basis for a book or ebook. Likewise, in the course of doing a book project, there tends to be a lot of content that ends up on the "cutting room floor" that makes for future blogs posts, articles, tips, etc.

Sounds like a good theme topic for a book, particular if you took an angle of "...past, present and future...". The idea of the past, present and future is to discuss how statistical and empirical measurements have evolved, are being used and will continue to be important in the future. After all, you (or your cloud provider) cannot effectively manage what they do not have insight or awareness into. Hence the importance and role of statistical and empirical analysis.

Of course, you can play the buzzword bingo game angle by working in how big data and hadoop tie into the supporting statistical analysis. Try an experiment assuming that you have stats enabled for your websites, which is look at normal traffic patterns. Then do a post with a title along the lines of "statistical monitoring with big data" and see what changes in traffic patterns occur.
...
I have one primary blog (e.g. http://storageioblog.com ) where either most of my material goes initially or as a follow-up to items that appear elsewhere. Now that I think about it, I guess I do have other blogs that either pick up my feeds automatically, or that I periodic visit and quickly cross post if wordpress friendly. There are also a bunch of other sites where articles, topics, pod casts, videos or guest posts appear in addition to those that syndicate my blog feed (e.g. via http://storageioblog.com/RSSfull.xml orhttp://storageioblog.com/RSSfullArchive.xml ).

My RSS feeds are free to anyone to use as long as they retain links that are in the post maintain attributions and copyrights do not insert content or posts from others in-line of a post, or otherwise change the content context. Likewise, sites are free to use excerpts as long as they attribute back to the source and preserve copyrights including if/when put into creative commons...."

Igor Trubin

Control Chart usage in "Automated Analysis of Load Testing Results"

Searching again in http://academic.research.microsoft.com I have found that not only CMG papers have some discussions about anomaly detection/control charting subjects in the Systems Capacity Management field. Below are a few examples:

1. Automated Analysis of Load Testing Results , Zhen Ming Jiang published in Conference: International Symposium on Software Testing and Analysis - ISSTA , pp. 143-146, 2010

From Abstract of the paper: ".. This dissertation proposes
automated approaches to detect functional and performance
problems in a load test by mining the recorded load testing
data (execution logs and performance metrics).."

The paper has reference to three other ones (see below) related to the subject of this blog, I believe:

- I. A. Trubin and L. Merritt. Mainframe global and
workload level statistical exception detection system,
based on masf. In 2004 CMG Conference, 2004

Here is the content where my paper was referenced:
"... It is di cult for humans to interpret raw performance
metrics, as it is not clear how to categorize these raw met-
ric values into performance categories (e.g. high, medium
and low). Furthermore, some data mining algorithms (e.g.
Navie Bayes Classi er) only take discrete values as input.
We are currently exploring generic approaches to classify
performance metrics into discrete performance categories us-
ing techniques like control charts [Trubin's CMG'04 paper] to facilitate our future
work in performance analysis...."

BTW Here is a slide with MIPS control chart from that paper presentation:

2. L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and
E. Smirni. Anomaly? application change? or workload
change? towards automated detection of application
performance anomaly and change. In IEEE
International Conference on Dependable Systems and
Networks, 2008.

2. B. Anton, M. Leonardo, and P. Fabrizio. Ava:
Automated interpretation of dynamically detected
anomalies. In Proceedings of the Eighteenth
International Symposium on Software Testing and
Analysis, 2009.

I plan to find and read the last two papers and maybe to report something here....

Igor Trubin

Popular Post

_

Tuesday, January 31, 2012

Sunday, January 29, 2012

Friday, January 27, 2012

Monday, January 23, 2012

Friday, January 20, 2012