IOA2

Data ware house, Business Intelligence, Big Data, Advance Analytics or Data Science etc. are quite known and prevalent practices in business application framework. Business and IT have been working together to harness data coming out of business transaction to derive insight and produce actionable information and create appropriate business products in a most cost effective manner. The need for IT business intelligence is well-known, but equally important is the need for IT operational intelligence. We have to have the similar or comparable approach in Infrastructure domain that could collect attributes related to infrastructure transaction and enable team to derive actionable insight and help improvise the infrastructure set up while configuring objects, provisioning adequate storage, attuning tier 1 to tier 3 storage stack, modifying cache, enabling & distributing core processors virtualized environment, load balancing load on the fly to different set of VMs and many more.

IOA(Infrastructure Operation Analytic) is the subject area that deals in creating healthy infrastructure stack by skimming underlying infrastructure transaction and attributes. Automation duly aided with IOA helps in self-adjusting the infrastructure configuration (such as vCPU, RAM, Cache, VIO bandwidth etc.) by self-learning as IT life cycle stages proceeds. In the long run IOA solution should be converted to a self-sustaining model. In advance analytics context, Data Analytics that focus on IT Infrastructure provides businesses the ability to predict IT system failures before they impact the bottom line, or uncover potential money saving opportunities.

IOA6Proactive alerting is about preventing small issues from becoming big problems. What are the key symptoms to alert IT staff something is not going on well in the Infrastructure set up? In an ideal world, everything is tested and works perfectly when deployed to production. Reality works much differently, requiring IT Operations team to maintain real-time visibility into the performance of production applications.

Is there a better way to develop trend-based early-warning alerts for the entire production environment so that IT teams can proactively identify and fix problems fast? How can we achieve a Real time monitoring set up to enable our IT staff to penetrate into the components and evaluate the situation. What components are not giving good throughput for a particular transaction? E.g. Application TAT (Turnaround time) is affected due to slowness in Informix DB response time only when user is querying “Get me available seats in all Delta Aircraft those are flying outbound to SFO between PST 0700 to 1000 HRS”. Can we possibly manage to get the underlying reasons such as “too many parameters with 50% of conditional parameters are involving texts in conditional clause” and similar, so that we ask DBA and Coder to fine tune the SQL and address the problem. Sometime we encounter delay in response near POS counter while making the payment, how can we set up a system that can give us insight about the “why”, could it be due to multiple round of “SSL” handshake taking per transaction consuming good amount of time!!!

cloud3In this evolving business and fast changing IT landscape, we need to identify areas that need to be brought under close scanner by designing & defining adequate attributes. These attributes should give relevant information to carry out the analysis. In this era of continuous system advancement, processes and underlying mechanism should be built upon self-learning model. Cloud solution, and changing landscape due to SMAC, IoT, virtualization, SDDC, SDN and all such software defined services set up would further complicate the scanner if we don’t put a long term strategic top down IOA framework early on.

I feel, following areas are very crucial and significant and need to be brought under close scanner  such as (a) End user Intelligence – It is about establishing connect the response a user is getting (let’s say webpage loading response time) to the underlying performance of infrastructure stack including storage, app server, network speed, MF load scheduler  etc. (b) Proactive Remediation – IOA should have a connect with all underlying components to enable L2/L3 staff to address the cause of slowness before it becomes a problem and App Server or Database server stops and stalls the business. (C) Continuous Optimization – This is an ITIL requirement in process maturity model. IOA set up needs to have the process to add metrics with permissible threshold value to alert the staff. This enforces team and process to optimize the component irrespective someone is complaining or not!!! This is more about the prevention. (D) Application Optimization – IOA set up need to be equipped with gauging change in the application performance for any underlying change in configuration so that team can jointly tweak things and expedite the rollout. (E)IT strategic decision and planning – I think, this is the core of IOA requirement. System administrator, storage analyst, Capacity planner etc. should have adequate information to make quick decision and keep system healthy all the time. Application migration, Re engineering of any application module, decommissioning of old and legacy systems are some of the examples that should be addressed through a good IOA solution. (F)  Security and Compliance – IOA solution should address risk identification and mitigation strategy. Questions such as  – what is the trigger for unusual type of transactions, what type of requests looks like brute-force FTP attack from overseas IP addresses. Last month we have seen Password management application “LASTPASS” faced brute force attack and some of the emails and profiles were compromised. Could that have been prevented if company had a good monitoring system aided with IOA solution? List can go on and on but I feel above are few key areas that need to be addressed through IOA. In another word, these are key functional requirements while designing an IOA solution.

Unlike regular monitoring set up that only shows what users are doing and experiencing on the frontend, how can we correlate user activity and experience to the performance in the backend IT infrastructure. In other words, we need to delve into the solution that not only tells “what” but also “why”? Most IT organizations purchase monitoring tools to meet narrow requirements, not according to strategic, overarching plan, resulting into adhoc accumulation of niche products that exists in silos, not the cohesive IT operational intelligence framework that will help mature up IT establishment. In IOA paradigm we need to identify source of data that carry’s information and can potentially give strategic value. Machine data including logging provided by processes and tools (SNMP, WMI etc.) can speak about system internals and system load/stress at that moment, can help in capacity planning. Agent data consisting of call-stack sampling, custom logging, byte-code instrumentation etc. can help diagnose code in SaaS platform. External data from synthetic transactions produces application transactions behavior around the world. Data travelling in the network can produce insight with greater amount of granularity relative to http traffic analysis, netflow, application throughput, security information, packet density etc. All these sources of data can produce significant positive impact in IOA landscape.

Challenge is to unleash this magnitude of data into something meaningful with context while establishing relationship amongst the attributes. Meaningful information can simplify the process by involving technicians such as security team, network engineer, DBA, data analyst, storage administrators, application architects etc. We need to depart from bottom up approach, in which we would collect data from each infrastructure component then aggregate, roll up to a reporting server level. By that time, data becomes old or may be obsolete; this could be useful in some instances of reporting but wouldn’t be helpful in REAL time or instant analysis and reporting. In my opinion we need to find out ways to extract information from wire and build real time analytical model. Network is the only entity that connects with all infrastructure components viz. storage, server, application exe files, DB/File  I/O, user interface such as browser etc. Each component talks to each other following certain protocols whether it is HTTP/S, NFS, TCP, IP, Telnet, FTP and many more. Optimal solution at minimum requires identify end points such as IP addresses, MAC etc., extract data and decipher underlying protocol to establish context and give real time value. Few vendors have already been working in this area and products are also available to some extent, however, still this has a long way to go. Capability needs to be built to process network packets at high speed while making connects with all dots giving meaning. Big Data solution could be leveraged to address this.

IOA5

An integration layer will need to be designed to collect all such real time information from multiple network segments in the whole Infrastructure stack and work with an analytical tool (e.g. QlikView, tableau, COGNOS etc. ) to reflect upon underlying machine behavior.  This could be similar to stage ODS (Operational Data Store) followed by OLAP (On Line Analytical Processing) tool on top of it. ODS can further offload data to DWH for in depth analysis such pattern, trend, linear, regression, explorative analysis etc. There needs to be a context analyzer or engine to build semantic of underlying behavior with appropriate relationship amongst other control points. IT staff should be able to quickly define metrics to track the performance with appropriate user interface. Our IOA set up should be built keeping in the mind to address business questions such as , how many sessions are open at any point of time, what is the CPU utilization break up along with different type of business transaction, how application response or latency is dependent upon distant distributed application and network bandwidth etc.

Summary

I have attempted to highlight the need of IOA type of solution bundled in the whole ITSM space in this fast changing IT landscape. I have also touched upon few key critical success factor to implement and use IOA that helps  to nourish Infrastructure component s and improvise complete IT establishment and keep business smart and healthy. Operation teams stand at the intersection of IT and the business. Increasingly, business success will depend on how quickly and how well these IT Operations teams respond to new demands. IOA is an area that needs good amount of R&D right from identifying right data source to transforming & cleansing data to creating adequate data model to performing analysis on the fly. Infrastructure generates tons of data and we need to develop the solution and ability to massage these data to derive actionable information to transform Infrastructure to a 0 downtime with fast, accurate and risk free ecosystem and allow business proliferate fearlessly. I am trying to invigorate all IT and business users to think in the direction of Infrastructure operation analytics and complement each other with idea, parameters and data points to help grow IT & Business environment together. I don’t want to, however, extend IOA to touch upon whole lot of business application behavior and analytics. But IOA with BI solution can be a big boon to IT establishment with regard to performance, health, throughput, business analytics, innovation etc.

Infrastructure Operation Analytics: Healthy Infrastructure keeps business healthy