Citrix, Splunk & VMWare Performance

  • 1
  • Idea
  • Updated 6 months ago
  • (Edited)
CareTech Solutions is an information technology (IT) and Web products and services provider for U.S. hospitals and health systems. Since 1998, our dedicated experts have been creating value for clients through customized IT solutions that contribute to improving the patient experience while lowering healthcare costs. From implementing emerging technologies to supporting day-to-day IT operations, CareTech Solutions offers industry-leading health information technology services that help hospitals simplify and manage complex IT initiatives while maximizing their investment in IT.

A client of CareTech Solutions is a healthcare provider that offers care in over 50 facilities spread across several states. Additionally, they have many employees that access compute services from homes and small offices over consumer rated network connections.

To support a daily concurrent user count of over 700 desktop connections, CareTech Solutions architected and implemented a large VMware and Citrix environment that included state of the art hardware from major vendors. Some examples include the Cisco Nexus platform for data transport, IBM for FC attached storage, and HP for compute enclosures and blades. The technologies employed upon this hardware foundation included VMware ESXi, Citrix Application & Desktop Virtualization, Citrix Netscaler Gateway, and IBM SAN Volume Controller (SVC), Fiber Channel over Ethernet (FCoE), and a number of monitoring packages to identify issues.

The client presented sporadic complaints of slow response time for some of its users. There were no obvious patterns such as geographic location of the user, time of day, network access method (MPLS, T1, consumer network connection), Citrix server, or ESXi host. The complaints were not predictable, the symptoms were not reproducible, and there was no way to correlate the complaints to any particular log message or error from any discrete hardware or infrastructure layer.

Further investigation into each separate incident identified a pattern and pointed to congestion on the external internet segments. This was especially apparent with employees working from home. Figure 1 is a screenshot from the Citrix InSight Center for one such user. Note the high WAN latency and Client Side Retransmits values.



These abnormal values could be found in every reported incident and it seemed like the problem was easily identified: users accessing the environment over a public internet connection were susceptible to WAN latency and packet retransmissions. The antithesis to this conclusion was that some users at some sites with MPLS and dedicated circuits were also reporting similar performance degradations. There were just no patterns in any of these complaints. A user at one site might experience slow performance while the person sitting next to her would not.

An enormous amount of effort was spent in an attempt to identify any type of pattern or correlation between a single incident and one or more layers in the overall infrastructure. Investigation was further complicated by barriers between technology realms. For example, a Citrix administrator was unable to log into a network switch to review logs. Coordination between many different groups was therefore necessary just to troubleshoot a single reported incident. To facilitate this and drive any potential efficiency into the process, a daily meeting was set up between all infrastructure groups to go over logs, discuss theories, and plan next steps. Very quickly it became clear that reviewing logs one at a time provided very little value. This data had to be consolidated from all layers (application, hypervisor, server, network, storage, and hardware) and made available to every administrator that was working on this issue. Furthermore, the collected data had to be presented in a way that correlations could be found between the log messages and even the count of log entries.

CareTech Solutions had recently implemented Splunk Enterprise and several Splunk applications. Because implementation was still in an early stage it was leveraged by only a very small number of administrators. However, the senior administrator that championed the implementation was an expert on the platform and was able to assist in the collection and presentation of data from every layer of the client’s infrastructure. Immediately two separate error messages in the ESXi logs presented itself. Of those two, one proved critical on the eventual identification and resolution of the core issue. The error described “lost access to volume due to connectivity issues.” This error led CareTech to investigate the storage infrastructure including the FCoE switches, storage devices, and fiber cables. Since there were multiple paths from every server to every storage device there was no reason that a host should experience lost access to any storage. An enormous effort was expended to eliminate each hardware and network segment from the list of potential sources for this message; no root cause was found in any investigated item.

Meanwhile, work continued to correlate the storage errors to log entries from the Citrix environment. Using Splunk to consolidate and report on the error messages a correlation was found. Figure 2 & 3 show a sample of data collected and reported from Splunk Software.






These two charts imply a build up of Client retransmitted packets immediately preceding a number of storage disconnections.

Following each disconnect message, the Citrix ICA round trip time values begin to fluctuate. This is not easily observable in the consolidated data presented in Figure 2 but is evident in an individual session as illustrated in Figure 4.



Furthermore, traffic from the Citrix server seems to stop as shown in Figure 5.



It was now clear that the root cause was not an external internet segment or even a Citrix performance problem. There was a problem with the storage layer — but not at a physical or communication level. These potential areas had been ruled out, in part because of hardware redundancies built into the environment and also because of further research through Splunk.

Investigation expanded beyond the client’s environment and into CareTech Solutions’ other customers. Although no other client was presenting similar symptoms, similar storage messages were reported through Splunk. Not only were there several different storage devices there were several different physical data centers.

Focus was shifted away from the client’s immediate environment and onto the storage disconnection messages. Dashboards were created in Splunk to further prove this theory and to provide confirmation of any positive or negative impact once a solution was implemented.

Continued research lead CareTech Solutions to eventually discover a change in how VMware handles file locking as implemented in a software update. SCSI reservations were replaced with Atomic Test and Set (ATS) algorithms for storage devices that support hardware acceleration. The firmware version of IBM’s StorWize V7000 storage devices do not support this algorithm and this incompatibility was the root cause of the error messages. Immediately after turning off ATS and using SCSI reservations, CareTech Solutions observed a positive impact to the client’s Citrix environment. The storage error messages were eliminated and the users reported a significant decrease in response time within Citrix sessions.

Once the logs were directed to and processed by Splunk, the technical team was able to correlate entries from the various infrastructure and software layers along a single timeline. Identifying the complaining devices and focusing on the interaction between those devices helped narrow the efforts towards the eventual discovered cause.
Photo of sarahjohn

sarahjohn

  • 1 Post
  • 0 Reply Likes

Posted 6 months ago

  • 1

Be the first to post a reply!