Popular Posts

Friday, January 27, 2012

FiOS Problem: Large File Upload Speed Analysis

I am afraid my home internet (Verizon FiOS) does not provide the upload speed that actually I pay for. I have 15/5 Mbps, so upload speed should be about 5 Mbps.

Speed test result (http://www.speakeasy.net/speedtest/) gives the following:
Good, right? But that test uses very small size file, I believe. So I could not resist to make some experiment to measure the real upload speed I have for relatively large file (~8Mb). 
See result here: 
 
That means I have about 0.17% out of 100Mbps = 0.17 Mbps instead of 5Mbps!!!!

That is interesting... a small file upload is fast, so standard speed test is not capturing a problem, but long upload is degrading significantly! 

It cannot be a trick from Verizon to hide their problems I hope, that should be some network defect or capacity issue.

Sometimes I see errors:
- standard speed test from some distant locations shows error (see example below from Seattle): 
 - Plus when I am attaching >2 Mb file in my e-mail it returns error after uploading less than half of file :


Note: Ironically Verizon "home agent" program senses this e-mail problem, but cannot help at all!

So it is the  real problem for a blogger!

I have opened a ticket with Verizon and having a battle right now with them to fix that. See details: http://trubinigor.blogspot.com/2012/01/facebook-verizon-account-responded-on.html 


Monday, January 23, 2012

Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

As I promised in CMG'11 Trip Report, Part 1 here is my comments and some follow up analysis of the following paper: Quantifying Imbalance in Computer Systems that was written and presented at CMG'11 by Charles Loboz from Windows Azure.

The  idea is to calculate imbalance of a system by using an entropy property which well know in the physics , economics and in the information theory

In my other past posting I rose the following question:
 "can the information theory (entropy analysis) could be applied to performance exception detection?"

Looks like the idea from  the mentioning CMG paper of using entropy calculation against system performance data could lead to the answer of that my question!

 Here is the quote from the paper: 



"...Theil index is based on entropy - it describes the excess entropy in a system. For a data set xi,
i=1..n the Theil index is given by:

where n is the number of elements in the data set and xavg is the average value of all elements in the data set. To underline the application of the Theil index to measure  imbalance in computer systems we call it henceforth the Imbalance Coefficient (IC). 

Examining closer the IC formula above we can derive several properties:
  • (1) the ratio xi/xavg describes how much element i is above or below the average for the whole set. Thus IC involves only the ratio of each element against the average, not the absolute values of theelements.
  • (2) IC is dimensionless .– thus allows to compare imbalance between sets of substantially different quantities, for example when one set contains disk utilizations and another disk response times.
  • (3) The minimum value of IC is zero - when all elements of the data set are identical. The maximum value of the Imbalance Coefficient is log(n) - when all elements but one are equal; the maximum IC depends thus on the set size.
  • (4) We can view Imbalance Coefficient as a description of how concentrated is the use of some resource .– large values mean fewer users use most of the resource, small values mean more equal sharing.

We also define, for convenience, Normalized Imbalance Coefficient (nIC) as

to account for both imbalance within the set and the maximum entropy in that set. The nIC value ranges from 0 to 1 thus enabling comparison of imbalance between data sets with differing number of elements..."

Author applied that to the multiple disks utilization analysis, but he mentioned that approach could be used for measuring other computer subsystems imbalance. So I decided to try to calculate the imbalance of CPU utilization during the day (24 hours) and a week (168 hours) because the  imbalance of capacity usage during a day or week is a pretty common concern. Also using my way to group base-line vs. actual data I have applied that twice to compare an "average" weekly/daily utilization vs. last week/days of actual utilization.

The raw data is the same as for the last Control Charting exercise I published here in the  series of posts ( see EV-Control Chart as an example), where the actual data (in black) vs. historical averages (in green) are shown below:

Here is the result of calculating the actual vs. averaged nIC Imbalance difference for all 168 hours and for each weekdays (7 days by 24 hours):

You can see that in the day when the anomaly of CPU usage started - Wednesday - the imbalance was significantly different and all in all weekly imbalance was significantly different too!  So indeed that metric can be use to capture some performance metric anomalies (pattern changes). 

FYI: Here is the spreadsheet snapshot with actual calculation I used: 

How better that method of imbalance change checking to compare with more traditional ways to do that (e.g. based on deviations) is hard to say. My personal preference is still EV-concept. Anyway someone needs to try that against more data...

BTW I have found another paper which relates to that topic:

Quantifying Load Imbalance on Virtualized Enterprise Servers by
Emmanuel Arzuaga and David R. Kaeli

In that paper here is the clear statement about imbalance: "A typical imbalance metric based on the resource utilization of physical servers is the standard deviation of the CPU utilization".

Still an entropy is interesting system property that should give us additional good source of information for pattern recognition, I believe. For instance, the balance of Capacity usage of large frames with a lot of LPARS (AIX p7s or VMware hosts)  could be monitored by using that nIC metric to apply some possibly an automatic way to rebalanced capacity usage by using partition mobility or v-motion technologies.   

Friday, January 20, 2012

LinkedIn Discussion: "How to write a book or blog”

I have responded on the LinkedIn Discussion: " How to write a book or blog  initiated by professional blogger Greg SchulzAnd I have got the following  excellent advises I am going to follow: 

"Igor with all of your white papers and posts, you probably have a good basis for a book or ebook. Likewise, in the course of doing a book project, there tends to be a lot of content that ends up on the "cutting room floor" that makes for future blogs posts, articles, tips, etc.

Sounds like a good theme topic for a book, particular if you took an angle of "...past, present and future...". The idea of the past, present and future is to discuss how statistical and empirical measurements have evolved, are being used and will continue to be important in the future. After all, you (or your cloud provider) cannot effectively manage what they do not have insight or awareness into. Hence the importance and role of statistical and empirical analysis.

Of course, you can play the buzzword bingo game angle by working in how big data and hadoop tie into the supporting statistical analysis. Try an experiment assuming that you have stats enabled for your websites, which is look at normal traffic patterns. Then do a post with a title along the lines of "statistical monitoring with big data" and see what changes in traffic patterns occur.
...
I have one primary blog (e.g. http://storageioblog.com ) where either most of my material goes initially or as a follow-up to items that appear elsewhere. Now that I think about it, I guess I do have other blogs that either pick up my feeds automatically, or that I periodic visit and quickly cross post if wordpress friendly. There are also a bunch of other sites where articles, topics, pod casts, videos or guest posts appear in addition to those that syndicate my blog feed (e.g. via http://storageioblog.com/RSSfull.xml orhttp://storageioblog.com/RSSfullArchive.xml ).

My RSS feeds are free to anyone to use as long as they retain links that are in the post maintain attributions and copyrights do not insert content or posts from others in-line of a post, or otherwise change the content context. Likewise, sites are free to use excerpts as long as they attribute back to the source and preserve copyrights including if/when put into creative commons...."


Control Chart usage in "Automated Analysis of Load Testing Results"


Searching again in http://academic.research.microsoft.com I have found that not only CMG papers have some discussions about anomaly detection/control charting subjects in the Systems Capacity Management field. Below are a few examples:

1. Automated Analysis of Load Testing Results , Zhen Ming Jiang published in Conference: International Symposium on Software Testing and Analysis - ISSTA , pp. 143-146, 2010


From Abstract of the paper: ".. This dissertation proposes
automated approaches to detect functional and performance
problems in a load test by mining the recorded load testing
data (execution logs and performance metrics).."

The paper has reference to three other ones (see below) related to the subject of this blog, I believe:



- I. A. Trubin and L. Merritt. Mainframe global and
workload level statistical exception detection system,
based on masf. In 2004 CMG Conference, 2004

Here is the content where my paper was referenced:
"... It is di cult for humans to interpret raw performance
metrics, as it is not clear how to categorize these raw met-
ric values into performance categories (e.g. high, medium
and low). Furthermore, some data mining algorithms (e.g.
Navie Bayes Classi er) only take discrete values as input.
We are currently exploring generic approaches to classify
performance metrics into discrete performance categories us-
ing techniques like control charts [Trubin's CMG'04 paper] to facilitate our future
work in performance analysis...."

BTW Here is a slide with MIPS control chart from that paper presentation:



2. L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and
E. Smirni. Anomaly? application change? or workload
change? towards automated detection of application
performance anomaly and change. In IEEE
International Conference on Dependable Systems and
Networks, 2008.


2. B. Anton, M. Leonardo, and P. Fabrizio. Ava:
Automated interpretation of dynamically detected
anomalies. In Proceedings of the Eighteenth
International Symposium on Software Testing and
Analysis, 2009.

I plan to find and read the last two papers and maybe to report something here....



Tuesday, December 27, 2011

IT/EV-Charts as an Application Signature: CMG'11 Trip Report, Part 1


I have attended the following CMG’11 presentation (see my previous post):

A Way to Identify, Quantify and Report Change
Richard Gimarc Kiran Chennuri
CA Technologies, Inc. Aetna Life Insurance Company

Identifying change in application performance is a time consuming task. Businesses today have
hundreds of applications and each application has hundreds of metrics. How do you wade
through that mass of data to find an indication of change? This paper describes the use of an
Application Signature to identify, quantify and report change. A Signature is a compact
description of application performance that is used much like a template to judge if a change has
occurred. There are a concise set of visual indicators generated by the Signature that supports
the identification of change in a timely manner.

Here are my comments.

I like the idea of building an application characteristic called Application Signature. As described in the paper it is actually based on typical (standard) deviations of Capacity usage during the peak hours of a day.

Looking closely to the approach I see it is similar with one I have developed for SEDS but it is a bit too simplified. Anyway it is great attempt to use SEDS methodology to watch application capacity usage.

I think the weekly IT-CONTROL CHART ( see other previous post ) is a way to compare usual weekly profile with last 168 hours of data (Base-line vs. Actual), so the base-line in the format of IT-Control Charts without actual data IS AN APPLICATION SIGNATURE but in much more accurate way. It even looks like somebody’s signature:

The actual data could be significantly different, as seen below:

And that diference should be automatically captured by SEDS-like system as an exceptions and calculated how much it differs from the "Signature" using EV meta metric as a weekly sum of each hour EV values  or as a EV-Control Charts like showed here.

For instance, in this example week the application had took a bit more than 23 unusual CPU hours as calculated below:

So, if weekly EV number is 0, that means the most recently the application (server or LPAR and so on) stayed within the IT-Signature, which is GOOD – no changes happend!

The paper also shows the “calendar view“ report that consists of set of daily control charts. It is another good idea. I used to use that approach before I switched to weekly IT- charts that cover 1/4 of a month or bi-weekly ones that cover 1/2 of a month. So if you have IT-charts there is no need for the "calendar view" that sometimes is not easy to read.

Another feature could be important for capacity usage estimates: it is a balance of hourly capacity usage for the day or week vs. overall average (e.g. weekdays vs. weekends or daily “cowboy hat” profile with lunch time drop). That is supposed to be an additional IT-Signature feature. There was another CMG’11 paper that presents some interesting approach to analyze/calculate that. I plan to publish my comments about that paper. So please check my next post soon.....

Tuesday, December 6, 2011

Application Signature: some of my SEDS ideas are at work

I am at CMG'11 conference now (in DC) presenting nothing this year (1st time for the last 11 years!), but I enjoy the conference and especially when my work is referenced.

Here is the example from paper called "Application Signature: A Way to Identify, Quantify and Report Change" which s presenting today at 4 pm by Richard Gimarc from CA Technologies, Inc and Kiran Chennuri from Aetna Life Insurance Company:

'...We readily admit that we are “standing on the shoulders of giants”; leveraging the work of others in the field to develop our own interpretation, implementation and use of an Application Signature....
... Perhaps the most influential work is by Igor Trubin. Starting in 2001, Trubin built on the ideas proposed by Buzen and Shum to develop the Statistical Exception Detection System (SEDS). Basically, SEDS “is used for automatically scanning through large volumes of performance data and identifying measurements of global metrics that differ significantly from their expected values”. Again, we see common ground with our use of an Application Signature. The points we leverage from Trubin’s work are:
  • Identify when performance metrics exceed of fall below expectation
  • Note and record the exceptions
  • Estimate the size of each exception rather than just recording its occurrence
  • Use control charts as a visual tool for examining current performance versus expected performance
 ...
What do you do when a change is identified?
  • Quantify the change. Does your current measurement exceed the Signature by 5%, or 100%? We are considering implementing a technique similar to what was described by Trubin.
  • Grade the change as either good or bad. If a metric increases, is that an indication of a bad change? Not always. Consider workload throughput; an increase in workload throughput is probably a good change. We need to find a way to customize each Application Signature metric to recognize and highlight both good and bad changes.
  • Develop a historical record of changes. Again, this is an idea developed by Trubin. A historical record will provide the application development and support staff with a quantitative description of sensitive application characteristics that may warrant improvement. 
...'
Some other anthers' work are referenced. I need to read that carefully and will report here about that in the other posts. Looking forward to attend that presentation! 

Richard and Kiran, thank you for referencing my work!



Tuesday, November 29, 2011

Finding the Edge of Surprise by Rich Olcott

I have definitely overlooked the following very good article of my CMG and IBM acquaintance:

MeasureIT - Issue 5.03 - Finding the Edge of Surprise by Rich Olcott 

At the 1st glance that article has a good overview of Classical SPC with some original suggestion how to apply that to IT data. Also I like the name of the article which could be a good short and metaphoric description of the main topic of this entire blog! 

BTW He provided there the reference to my CMG'2004 paper: “Mainframe Global and Workload Levels – Statistical Exception Detection System, Based on MASF,” CMG Proceedings (2004). The link to that my paper is published on very 1st posting of this blog!

And I have already mentioned  his previous work at my other posting: 
Aug 13, 2007
Dials for a PM Dashboard: Velocity's Missing Twin, and Quantifying Surprise, Rich Olcott
I plan to reread both his works and to add more comments-thoughts....

Wednesday, November 9, 2011

SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data

Last Thursday we had a very good Southern Computer Measurement Group meeting of 16 attendees in Richmond VA, where I have presented the material about how to use R, BIRT, MySQL and EXCEL to analyze and report systems' performance data having as an example some real Unix server CPU utilization data for control charting.

Agenda is still on SCMG website and now my presentation slides are published and linked there:

SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data
(slides).



Monday, October 10, 2011

Is Anomaly Detection Similar to Exception Detection? Apply SEDS for Information Security!

Sometimes I call my "Exception Detection" as "Anomaly Detection".  In some cases the performance degradation could be caused by parasite program (like badly written data collection agent ) or incompetent user (like submitting badly written ad-hock  database query) or even by a cyber attack (denial-of-service attack -DoS definitely  degrades performance to absolutly not performing, doesn't it?)

So it is similar by my opinion and the Exception Detection methodology I am offering to by using MASF technique can be applied to broader filed of Information Security. And vice versa! Some intrusion detection techniques could be useful for automatic performance issues detection!

I have made a litle Google reserch on that and found a few interesting approaches. See one of that:

See the abstract page for dissertation written by Steven Gianvecchio:

Application of information theory and statistical learning to anomaly detection.


So the question is "can that information theory (entropy analysis) could be applied to performance exception detection?"