System Management by Exception: May 2017

Wednesday, May 17, 2017

Eli Hizkiyev: ConicIT – Ground Breaking Technology Unleashed. Sense and Respond? Why Not Predict and Prevent?

ConicIT Summary

Take your existing performance monitoring environment to the next level. ConicIT, a software solution, reads thousands of performance and stability metrics per minute from your performance monitors such as TMON, Omegamon, Mainview, Sysview and others. ConicIT processes and analyzes these metrics with machine learning technology and automatically generates alerts about problems in which even seasoned, professional performance staff may not notice if they look at the same data.

Beyond the automatic analytics and alerts, ConicIT provides an efficient and friendly web-interface which allows you to browse through relevant performance data, in an aggregated way, including watching values and graphs from the very moment a problem occurs. With ConicIT in place, you won’t need to tediously jump between many different monitors or screens. ConicIT aggregates the data from different sources into single view, so you can watch the data easily receiving either high-level or low-level insights into your application performance.

ConicIT also creates important calculated variables. Examples include ratios, summaries, and critical information such as taking the cumulating CPU-time of a job or transaction and calculating the real-time CPU consumption of jobs and transactions. Much of this information is missing from all monitors. The real-time CPU consumption is calculated using the rate in which the CPU-Time rises during each minute.

One of the major advantages of ConicIT is the dynamic alerts which are based on machine learning and statistical algorithms. Traditional monitors offer simple static-alerts based on thresholds. But static alerts are always coming too late and most of them are false-alerts. ConicIT solves this problem with its advanced algorithms. ConicIT automatically studies the typical behavior of each metric every day of the week and every hour of the day. So ConicIT knows (and shows you) the expected range for each performance metric. ConicIT also learns how stable each variable is and how often and how long it may be out of its normal range. Based on this analysis, ConicIT recognizes when there is an anomaly in the system in one or more metrics. In such case ConicIT will send you an alert with information and graphs about the problem. These proactive alerts come much earlier and more accurately than any static-alert type performance system. ConicIT gives you time to solve problems before they affect your end-users, clients and customers.

The combination of early proactive-alerts when the problem started, along with supportive information and graphs, allows you and your team to quickly pinpoint where the problem started and which team should work on resolving it. Thus, ConicIT reduces the required war-rooms for fixing problems and reduces the mean time for repairing problems.

Figure 1: 30 hours graph

Figure 2: It takes a single click (on the left menu) to switch and view any type of information from any point in time

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Tuesday, May 9, 2017

Re-posting CMG blog posts: Lifting the Cloud of Obscurity from your Cloud Deployment - #cmgnews

Application Performance Management in Cloud and more.... See the post in CMG Amplify blog:

https://www.cmg.org/2017/05/priyanka-aroralifting-cloud-obscurity-cloud-deployment/

Igor Trubin

Wednesday, May 3, 2017

The effect of outliers on statistical properties - Anscombe's quartet

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to attack the impression among statisticians that "numerical calculations are exact, but graphs are rough."^[1]

Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

(You can easily check this in R by loading the data with data(anscombe).) But what you might not realize is that it's possible to generate bivariate data with a given mean, median, and correlation in any shape you like — even a dinosaur:

Source: The Datasaurus Dozen

Posted: 02 May 2017 08:16 AM PDT

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Igor Trubin

Reposting CMG blog posts. imPACt 2017 – What to do in New Orleans (#CMGnews)

Do you plan to attend this year imPACt conference? I do and I am glad it will be in New Orleans!
_______________

Founded by the French, ruled by the Spanish, and bought by the US…New Orleans is known for its distinct Creole culture, Jazz, Mardi Gras and many other attributes that give this city a powerful sense of identity. So, what do you do beyond attending technical sessions?

In upcoming posts we’ll share a few ideas with you. To get you started, here’s a few museums that may be of interest.

The National World War II Museum: The museum tells the story of the American Experience in the war that changed the world (#4 on TripAdvisor Top 10 list of museums in the USA)
New Orleans Historic Voodoo Museum: Well, of course you’re going to find a voodoo museum in NOLA…how could you not?
New Orleans Pharmacy Museum: This does make sense…Louisiana is the birthplace of modern pharmacy and New Orleans resident, Louis Dufilho, was America’s first licensed pharmacist.

Source: https://www.cmg.org/2017/04/impact-2017-new-orleans/

Igor Trubin

Monday, May 1, 2017

The 8th ACM/SPEC on International Conference on Performance Engineering: MASF and Control charts

ICPE '17 - conference site.

The following paper has a nice summary of how I use MASF and Control charts and then it has a proposition of similar but improved ...

Technique for Detecting Early-Warning Signals of Performance Deterioration in Large Scale Software Systems
Raghu Ramakrishnan † Tata Consultancy Services Noida, UP, INDIA
Arvinder Kaur † USICT, Guru Gobind Singh Indraprastha University Dwarka, Delhi, INDIA

ABSTRACT The detection of early-warning signals of performance deterioration can help technical support teams in taking swift remedial actions, thus ensuring rigor in production support operations of large scale software systems. Performance anomalies or deterioration, if left unattended, often result in system slowness and unavailability. In this paper, we presents a simple, intuitive and low-overhead technique for recognizing the early warning signs in near real time before they impact the system The technique is based on the inverse relationship which exists between throughput and average response time in a closed system. Because of this relationship, a significant increase in the average system response time causes an abrupt fall in system throughput. To identify such occurrences automatically, Individuals and Moving Range (XmR) control charts are used. We also provide a case study from a real-world production system, in which the technique has been successfully used. The use of this technique has reduced the occurrence of performance related incidents significantly in our daily operations. The technique is tool agnostic and can also be easily implemented in popular system monitoring tools by building custom extensions.

".....The use of control charts, MASF and its variations for monitoring software systems was proposed by Trubin et al. [24][25][26][27]. MASF partitions the time during which the system is operational, into hourly, daily or weekly reference segments to characterize repeatable or similar workload behavior experienced by a software system [8]. For example, the workload encountered by the system on Monday between 9:00 a.m. - 10:00 a.m. may be different from the workload between 10:00 a.m. - 11:00 a.m. Each segment is characterized by its mean and standard deviation. The number of reference sets can be further reduced using clustering techniques. The upper and lower limits are established for each reference at three standard deviations from the mean..."

[24] I. Trubin. Review of IT Control Chart. Journal of Emerging Trends in Computing and Information Sciences, 4(11):857–868, Dec. 2013.
[25] I. Trubin and V. C. Scmg. Capturing workload pathology by statistical exception detection system. In Proceedings of the Computer Measurement Group, 2005.
[26] I. A. Trubin. Global and Application Level Exception Detection System, Based on MASF Technique. In 28th International Computer Measurement Group Conference, December 8-13, 2002, Reno, Nevada, USA, Proceedings, pages 557–566, 2002.
[27] I. A. Trubin and L. Merritt. ”Mainframe Global and Workload Level Statistical Exception Detection System, Based on MASF”. In 30th International Computer Measurement Group Conference, December 5-10, 2004, Las Vegas, Nevada, USA, Proceedings, pages 671–678, 2004.

Full text: http://dl.acm.org/citation.cfm?id=3044533

Igor Trubin

Popular Post

_

Wednesday, May 17, 2017

Eli Hizkiyev: ConicIT – Ground Breaking Technology Unleashed. Sense and Respond? Why Not Predict and Prevent?

Tuesday, May 9, 2017

Re-posting CMG blog posts: Lifting the Cloud of Obscurity from your Cloud Deployment - #cmgnews

Application Performance Management in Cloud and more.... See the post in CMG Amplify blog:

Wednesday, May 3, 2017

The effect of outliers on statistical properties - Anscombe's quartet

Reposting CMG blog posts. imPACt 2017 – What to do in New Orleans (#CMGnews)

Monday, May 1, 2017

The 8th ACM/SPEC on International Conference on Performance Engineering: MASF and Control charts