Director Under the Hood: New Users

Intro

Director is Citrix’s new metrics and monitoring dashboard. The interface is modern and the emphasis is on real-time information about your users. It consolidates information about your environment and makes it easy to differentiate between applications and desktops. If your only experience has been with EdgeSight in the past then you’ll see Director as a breath of fresh air.

There’s a lot of good views and data in the new Citrix Director and the “one pane of glass” view of your environment is pursued by all 3rd party monitoring, reporting, and alerting vendors. Unfortunately, it’s not easy to get all the same data I’ve gathered in past from the Director database. In this post we’ll look at tracking new users connecting to your Citrix environment.

For information on the database schema…read my previous article on Director.

New Users

I collect lots of metrics to report on my environment. One of the ones I track is the number of new users that connect to my Citrix environment. I view this metric as speaking to the overall adoption rate of my Citrix platform as well as a leading indicator for growth. Can we find this info in the Director Trends dashboard?

The short answer is no. The long answer is noooooooooooooooooooooooooooooo. In fact, it is not possbile to track this in EdgeSight. In a previous job, we worked around this by adding a USER table to the Edgesight database and then ran a query to compare the unique users who logged in that past month against the USER table. Who ever did not show up in the USER table was considered new.

SELECT distinct [user]
FROM vw_ctrx_archive_server_start_perf AS ESdata
WHERE [user] <> 'UNKNOWN'
and convert(varchar(10),time_stamp,111) between '2016/05/01'
and '2016/05/31'
and (NOT EXISTS
(SELECT distinct userid
FROM userarchive
WHERE (userarchive.userid = ESData.[user]))) order by [user]

The above query gets all the unqiue users who logged in between May 1st and May 31st (using the Edgesight view: vw_ctrx_archive_server_start_perf). It then compares this list against the userarchive table that we created to store the username and some other data about our users. Thus we got  a count of new users to our Citrix environment. Once we completed our monthly reporting, we added these new users to the userarchive table.

You say, “That’s great Alain. Wow! How the heck do I do this in Director?”

I say…

“SQL To the Rescue!”

For this query I’m using only one table:

MonitorData.User (Table)
image

I select the month and year and then count the usernames for that month and year. The great thing about this table is that it only creates a new row the first time a user connects to the system automatically. So, the following query will give you a easy way to see the new users who connected to your Citrix envrionment.

SELECT convert(char(9),datename(month,CreatedDate)) + ' '
+ convert(char(4),datepart(year,CreatedDate)) as 'Month',
count (Username) as 'New Users'
FROM MonitorData.[User]
GROUP BY convert(char(9),datename(month,CreatedDate)) + ' '
+ convert(char(4),datepart(year,CreatedDate))

MonitorData.User_query

In conclusion

I hope this encourages you to take a look under the hood of Director to see what you can get out of it. The database infrastructure is much, much simpler than EdgeSight and should provide a lot of good detail.

Thanks,
Alain

New ExtraHop Citrix Bundle Available

This is a quick walk-thru of our latest Citrix Bundle.  I would like to thank my colleague Tim Hughes for his work on creating this.

Keep in mind that I a limited to Demo data here.  In your environments, we can actually map out all of your applications and provide full visibility into downstream tiered environments that go beyond your Citrix servers but never fail to be escalated as a Citrix issue when they are slow.

Someone told me once, “If we could just bottle up your experience and give it to the other operations folks”.  At ExtraHop, we can do better than a bottle, we can do a bundle!

If you have a Citrix environment of any size, try is out, a quick POC will show you actionable information within minutes of creating the span.
Thanks!

Calling All Geeks!! (Synergy “Geek-O-Vation” awards)

If you use this site regularly, you are likely one of those special types who does their own custom solutions when trying to develop a Citrix strategy.  Citrix is actually hosting a contest for those who come up with great innovations around Citrix.  We would LOVE for one of our readers to win it!!!

Please go to http://blogs.citrix.com/2015/04/16/geekovation-award-at-synergy-geek-speak-tonight/ and enter your solution.  I know the readers of this blog are “Tinkerers” let’s show Citrix how real geeks get it done!!

Good luck!!!!!!!

The ESUTH team!!

No End in Sight: Cyber Security and the Digital Maginot Line

Yesterday my spouse was informed by a laboratory company where she was having some blood work done that she needed to provide them a credit card number that they could put on file in case our insurance company could not pay or did not pay the bill for the lab costs. This after showing our insurance card and providing proof that we are insured. Having lived with me the last 7 years she asked the woman at the counter for a copy of the InfoSec strategy asking them to “please include information on encryption ciphers, key lengths as well as information on how authentication and authorization is managed by their system and if her credit card information would be encrypted at rest”. Needless to say, they had no idea what she was talking about much to the exasperation of the people waiting behind her in line as well as the front office staff. She ended up getting her tests done but was told she would not be welcomed back if she was going to continue to be unwilling to surrender her Credit Card number to their front office for them to, digitally, keep on file.

Between the two of us, we have replaced 4 or 5 cards in the last 3 years due to various breaches, I have had to replace two and, I believe, she has had to replace 3 of them. In my case, each incident cost me around $800 that I had to wait weeks to get back and only after I went into the bank and filled out forms to attest that I did not make the charges. Each incident was about 4 hours of my time by the time all was said and done. Yes, there were lawsuits and lawyers were paid six figure sums as a result and I am sure they deserved it but at the end of the day, I was without my $800-$1600 for an extended period of time and I had to run through a regulatory maze just to get back what I had lost. No…..I never got any settlement money, I hope they spent it well. Fortunately for me, I am 46 years old now and have a great job, if this had happened to 26 year-old (still a screw-up) John, it would have been utterly devastating as I likely would have been evicted from my apartment and had bill collectors calling me. I can’t imagine calamity this creates for some folks.

I am somewhat dumbfounded that any company at any level would seek to get people to surrender their information digitally given the egregious levels of retail breaches that have plagued the industry the last few years. Forget that consumer advocacy is non-existent, while some retailers have been very forward in understanding the impact to their consumers, I simply do not see things getting better, EVER. The current method by which Cyber Security is practiced today is broken and there seems to be no motivation to fix it. This in spite of extremely costly settlements and damage to brands, the way we practice security today is deeply flawed and it’s not the Security team’s fault. Until system owners start taking some responsibility for their own security, these breaches will simply never end.

Bitching about the lack of responsibility of system owners isn’t new to me, my first “documented” rant on it was back in early 2010. As a system owner, I, almost compulsively, logged everything that went on and wrote the metrics to a centralized console. In a way, it was a bit of a poor-man’s DevOps endeavor. In doing so, I was able automate reporting so that when I came into work each morning, I would spend 15 minutes sipping my coffee and looking at all of the non-standard communications that went on the previous day (basically all internet traffic that did not use a web browser and all traffic outside the US). No, it wasn’t full IDS/IPS production but on two separate occasions, I was able to find malware before several seven figure investments in malware detection software. That is two instances in four years or 2/1000 mornings (approximately 4 years’ worth of work minus vacations, holidays etc.) where I noted actionable intelligence. That may not have been a lot but if you are one of the dozens of retailers who have had breaches in the last few years, I think it is plausible to assume the systems teams could have had an impact on the success of a breach had they been a little more involved in their own security. Don’t underestimate the value of human observation.

Why the INFOSEC is not enough?
Short of a crystal ball, I am not sure how we expect INFOSEC teams to be able to know what communication is acceptable and what communications are not. In the last few years “sophisticated persistent advanced super-duper complex malware” generally means that someone compromised a set of credentials and ran amuck for months on end stealing the digital crown jewels. Even if a policeman is parked outside my house, if they see someone walk up, open the door with a key and walk out with my safe, 60 TV (Actually, I don’t have a 60 inch TV) and other valuables how the hell are they supposed to know they should or should not be doing that. In most cases, this is the digital equivalent of what is happening in some of these breaches accept that digitally, I am sitting at my couch while all of this is going on in front of me. If an attacker has gotten credentials or has compromised a system and is stealing, expecting the security team to see this before extensive damage is done is unrealistic. With some of the social engineering techniques that exist and some of the service accounts used with elevated privileges, you don’t always have the 150 login failures to warn you. If I am actually paying attention, I can actually say, “Hey, what the hell are you doing, put that TV down before I call the cops!” (Or, my step-daughter is a foodie and she has some cast iron skillets that could REALLY leave a lump on someone’s head).

The presence of an INFOSEC team does not absolve system owners of their own security any more than the presence of a police department in my city means I don’t have to lock my doors or pay attention to who comes and goes from my house.

Police: “911 operator what is your emergency?”

Me: “I’ve been burgled, someone came into my house and stole from me”

Police: “When did this happen? Are they still in your house?”

Me: “It happened six months ago but I don’t know if they are still in my house stealing from me or not”

Police: “Ugh!!”

If someone has made a copy of the keys to my house it is not the police’s fault if they don’t catch them illegally entering my home in the same manor that the police cannot be everywhere, all the time, you INFOSEC team cannot inspect every digital transaction all the time.

Thought Exercise:
If someone has compromised a set of credentials or, say a server in your REST/SOAP tier and they are running ad hoc queries against your back end database, let’s evaluate how that would look to the system owner vs. the INFOSEC practitioner.

To the INFOSEC Practitioner: They see approved credentials over approved ports, since they are not the DBA or the Web Systems owner so this, likely, does not trigger any responses because the INFOSEC resource is not privy to the day to day behavior or design.
The DBA: The DBA should notice that the types of queries have changed and fall out of their chair.
Web Properties team: They should have a similar “WTF!?!?” moment as they note that the change from what is normally stored procedures or even recognizable SQL statements to custom ad hoc queries of critical data.

In this scenario, one in which I covered on wiredata.net in May of 2014, it is obvious that the INFOSEC professional is not as well positioned to detect the breach as he or she does not manage the system on a day to day basis and while several processes have INFOSEC involved during the architecture the idea that your INFOSEC team is going to know everything about every application is neither practical or reasonable. It is imperative that system owners take part in making sure their own systems are secure by engaging in a consistent level of intelligence gathering and surveillance. In my case, it was 15 minutes of every morning. Ask yourself, do you know every nonstandard communication that sourced from your server block? Will you find out within an hour, 8 hours, a single day? These are things that are easily accomplished with wire data or even log mongering but to continue to be utterly clueless of who your systems are talking to outside of normal communications (DNS, A/D, DB, HTTP) to internal application partners is to perpetuate the existing paradigm of simply waiting for your company to get breached. While we give the INFOSEC team the black eye, they are the least likely group to be able to see an issue in spite of the fact that they are probably going to be held accountable for it.

There are services from companies like FireEye and BeyondTrust that offer innovative threat analytics and offer a number of “non-charlatan” solutions to today’s security threats. I’ve struggled to avoid calling Cyber Security an abject failure but we are reaching the point where the Maginot line was more successful than today’s Cyber Security efforts. I am not a military expert and won’t pretend to be one but as I understanding, the Maginot line, the French solution to the German invasion during WWI, was built on the strategies of the previous war (breach) and was essentially perimeter centric and the enemy simply went around it (sound familiar?). So perimeter centric was it that apparently upon being attacked from behind they were unable do defend themselves as the turrets were never designed to turn all the way around. The thought of what to do once an enemy force got inside was apparently never considered. I find the parallels between today’s Cyber Security efforts and the Maginot line to be someone surprising. I am not down on perimeter security but a more agile solution is needed to augment perimeter measures. Once might even argue that there really isn’t a perimeter anymore. The monitoring of peer-to-peer communications by individual system owners is an imperative. While these teams are stretched thin already (don’t EVEN get me started on morale, workload and all around BS that exists in today’s Enterprise IT) what is the cost of not doing it? In every high profile breach we have noted in the last three years, all of these “sophisticated persistent threats” could have been prevented by a little diligence on the part of the system owners and better integration with the INFOSEC apparatus.

Cyber Insurance Policies could change things?
Actually, we are starting to see insurance providers force companies to purchase a separate rider for cyber breach insurance. I can honestly say, this may bring about some changes to the level of cyber responsibility shown by different companies. I live in Florida where we are essentially the whipping boys for the home owners insurance industry and I have actually received notification that if I did not put a hand rail on my back porch that they would cancel my policy. (The great irony being that I fell ass over teakettle on that very back porch while moving in). While annoyed, I had a hand rail installed post haste as I did not want to have my policy cancelled since, at the time, we only had one choice for insurance in Florida and it was the smart thing to do.

Now imagine I call that same insurance company with the following claim:
“Hello, yes, uh, I am being sued by the Girl Scouts of America because one of them came to my door to sell me cookies and she fell through my termite eaten front porch and landed on the crushed beer bottles that are strewn about my property cutting herself and then she was mauled by my five semi-feral pit bulls that I just let run around my property feeding them occasionally”.

Sadly, this IS Florida and that IS NOT an entirely unlikely phone call for an adjuster to get, however, even more sad is the fact that this analogy likely UNDERSTATES the level of cyber-responsibility taken by several Enterprises when it comes to protecting critical information and preventing a breach. If you are a Cyber Insurance provider and your customer cannot prove to you that they are monitoring peer-to-peer communications, I would think twice about writing the policy at all.

In the same manor that insurance agents drive around my house, expect auditors to start asking questions about how your enterprise audits peer-to-peer communications. If you cannot readily provide a list of ALL non-standard communications within a few minutes, you have a problem!! These breaches are now into the 7-8 digit dollar amounts and those companies who do not ensure proper diligence do so at their own peril.

Conclusion:
As an IT professional and someone who cares about IT Security, I am somewhat baffled at the continued focus on yesterday’s breach. I can tell you what tomorrow’s breach will be, it will involve someone’s production system or servers with critical information on them having a conversation with another system that it shouldn’t. This could mean a compromised web tier server running ad hoc queries; this could be a new FTP Server that is suddenly stood up and sending credit card information to a system in Belarus. This could be a pissed of employee emailing your leads to his gmail account. The point is, there ARE technologies and innovations out there that can help provide visibility into non-standard communications. While I would agree that today’s attacks are more complex, in many cases, they involve several steps to stage the actual breach itself. With the right platform, vigilant system owners can spot these pieces being put into place before they start or at least maybe detect the breach within minutes, hours or days instead of months. Let’s accept the fact that we are going to get breached and build a strategy on quelling it sooner. As a consumer who looks at his credit card expiration date and thinks to himself “Yeah right!” basically betting it gets compromised before it expires. I see apathy prevailing and companies who really don’t understand what a pain in the ass it is when I have to, yet again, get another Debit or Credit card due to a breach and while they think it is just their breach, companies need to keep in mind that your breach may be the 3rd or 4th time your customer has had to go through this and it is your brand that will suffer disproportionately as a result. Your consumers are already fed up and companies need to assume that the margin of error was already eaten up by whichever vendor previously forced your customers through post-breach aftermath. I see system owners continuing to get stretched thin and kept out of the security process and not taking part in the INFOSEC initiatives at their companies, either due to apathy or workload. And unfortunately, I see no end in sight….

Thanks for reading

John M. Smith

 

 

 

 

 

 

 

 

 

 

 

 

ExtraHop’s Citrix Solution Architecture Bundle Walk-Thru

I recorded a walk-thru of the Citrix Alpha Bundle now integrated with our latest 4.0 release.  Below is an example of the dashboard features.  Keep in mind, all of this can be done with NO AGENTs installed on your system and NO WMI walking or interrogating your systems.  We are completely passive and can similar detailed information for all of your environments (Database, SOAP/REST, Web and pretty much anything with an IP Addresses).

In the video below I will discuss some of the application containers and how they can be leveraged for troubleshooting (some overlap in information)

If you are interested in checking this out, we offer a free discovery edition or reach out to me at johnsmith@wiredata.net and I will put you in contact with your area team.

Thanks

John

 

Nerd on a wire: Packet-Panning for Operational Gold at 20 gb/s

Gathering Operational Intelligence in today’s 10,40 and soon 100 Gigabit Ethernet networks is a far different task than it was back in the 100 meg or even gigabit days. In this post I want to discuss the different triggers that make up the Citrix Alpha bundle and why they are important and just what they are collecting. I like to think of ExtraHop like a pan that someone may use to try and find Gold. When the data current is at 10’s of gigabits, the use of triggers can be your best friend helping sift through the digital gravel and sand to find those gold nuggets of information that reside within everyone’s wire data watershed.

What are triggers:
While ExtraHop collects a myriad of metrics (over 2000 per device) there are times when you may want to sift through the packets to pull specific nuggets of information out of the data stream. An example would be, ExtraHop collects metrics on ICA latency and organizes it for you on a per server basis allowing you to drill into the VDA instance and observe the latency associated with the users on that box. That is great but using triggers we have the ability to get into even finer details. In the example in the previous article, we can also write a trigger that parses out the 24 bit network ID and collate the average Latency for an entire subnet. In the trigger text you see below, we are setting up key value pairs that will be written to the datastore. Note we instantiate the client IP with the “var IP” variable. Then you see that we have applied the “mask” object to the client ip metric to get the IP’s 24 bit network ID. This mask can be changed to accommodate how you have subnetted your network.
#########################################################################################################################################################
var appname = ‘XenApp’
var IP = Flow.client.ipaddr;

//Begin grabbing ICA Open Info
if (event == “ICA_TICK”) {
//Grab ICA Channel Info by UserName and write to CTXOPS
for(I=0; I < ICA.channels.length; I++) {
log(ICA.channels[I].description)
Application(appname).metricAddCount(“ica_channel_user_cnt_” + ICA.channels[I].description, ICA.channels[I].serverBytes,1);
Application(appname).metricAddDetailCount(“ica_channel_user_cnt_detail”, “User: ” + ICA.user + ” Server: ” + ICA.host + ” Channel: ” + ICA.channels[I].description, ICA.channels[I].serverBytes);
}

Application(appname).metricAddDataset(“ica_lat_by_subnet”,ICA.networkLatency);
Application(appname).metricAddDetailSampleset(“ica_lat_by_subnet_detail”,IP.mask(24), ICA.networkLatency);
Application(appname).metricAddDetailSampleset(“ica_lat_by_user_detail”, ICA.user, ICA.networkLatency);
Application(appname).metricAddDetailSampleset(“ica_lat_by_clientIP_detail”, Flow.client.ipaddr, ICA.networkLatency);

############################################################################################################################################################

I have now used ExtraHop triggers to take my wire data mining to a whole new level. Below I want to talk about all of the triggers that I have included in the Alpha Bundle and explain what they are doing.

Citrix Infrastructure Metrics: (Assigned to all VDA’s and ICA Listeners)
This is a more holistic trigger that fires on the ICA_OPEN and ICA_TICK events. In this trigger I am panning for ICA Launch metrics via the ICA_OPEN event. I am also gather ICA Channel footprint information from the ICA_TICK event. The ICA_TICK event also has my Latency metrics that I used to report user latency.

CTX Request Ticket Cancelled: (Assigned to DDC, XML Brokers or VDAs)
This trigger came about as I was troubleshooting 1030 errors with wire shark. As some of you know, the 1030 error (or also called “protocol driver error”) can be somewhat maddening in larger environments. Basically you get a call saying users are getting kicked out when they try to launch Citrix. What I noticed in the XML is that there was a heading called “RequestTokenCancelled” that fired every time I got a 1030 error. So whenever I see that text in the XML between the DDC/XML Broker and the clients I log the Citrix server IP Address so that Citrix teams can quickly narrow down problem servers. This is also an example of how ExtraHop is consistent with the “panning for gold” narrative. We can tap into things like the XML communications between hosts and pull out very useful information. This is what separates us from our competitors who provide very narrow and rigid, non-editable metrics.

DDC Registrations: ( For XenDesktop 5x-7x)
This was set up specifically to monitor the number of servers that have “phone home” to their DDC. You can assign it to your Citrix server device group or the DDCs. The way I found this was that I noticed that every five minutes, the VDAs would connect to the DDC over a URI that had iRegistrar as part of the uri-stem. I simply increment counters and map them to a specific DDC. This can be used to tell you if your DDCs are load balancing or if you have had a rash of systems that did not come up from the night before.

ICA DB Errors: (Assign to VDAs)
This is actually not how it sounds. While I have limited enterprise applications with the Demo Data the goal here is to log the DB errors that come from your ICA servers and VDI desktops. The goal here is to be able to provide service desk staff as well as engineers the ability to troubleshoot the actual applications. An example of this would be if a user called saying that a database application was having issues, the person supporting them could look at the ICA DB Errors and see if that user or their VDI instance saw any database issues. The idea here being that the call maybe actually get escalated to someone other than the Citrix team for a change? Depending on your environment, we would customize this trigger to accommodate your web/http/cifs based applications and keeping their metrics tied to just the Citrix environment.

ICA User Debugging (CTXOPS):
This trigger was written around the same idea as ICA DB Errors. We gather some user metrics by firing on the ICA_OPEN, ICA_TICK and ICA_CLOSE events. This gives us data on user sessions that can, if needed, be included in the escalation information.

Profile Server Performance – XenApp/XenDesktop: (Assing to VDAs)
This nearly always has to be customized to accommodate the CIFS path for the users profiles. If you are running one of the premium profile services we can retrofit a trigger for Appsense, Res, etc. Basically we want to look for the ISO that is on the user’s desktop or the 20GB of music they have in their roaming profile.

PVS Trigger: (Assign to VDAs or PVS farm)
These are layer 4 turn time triggers set up specifically to look for PVS traffic. I have not seen a problem PVS environment yet so I am not sure what to provide here. What you could use this for is to log in and make sure everybody booted up properly or if you are having issues check the performance, turn times of the PVS traffic.

Slow ICA load time debugging: (Assign to VDAs)
In this trigger I am earmarking what a slow launch and a normal launch is. This trigger makes up the metrics on the grid that shows Fast and Slow launches. I am also looking for CIFS errors that could cause issues with the Citrix load times.

XenApp DNS Server Performance: (Assign to VDAs)
Few people have a full appreciation for how slow DNS will wreck everything. This triggers keeps track of DNS performance for your Citrix farm. This can be used to troubleshoot slow logons, slow applications and all around general weirdness that ensues when DNS is less than pristine.

XenApp DNS Timeouts: (Assign to VDAs)
Self-explanatory, it is very important that you not have DNS timeouts for legit records such as SRV records or application server records.

XenApp Zero Windows:
Here we are monitoring every time the XenApp/XenDesktop server closes its TCP window (always due to I/O related issues) as well as when their peers close their windows. This can be handy when the proverbial “Citrix is Slow” and you notice that the back end Database or Web Server is closing its TCP window.

XML Broker Performance: (Assign to XML Brokers)
Again, while looking in wire shark I noticed that whenever apps enumerated a uri stem with wnpr.dll would show up. So we key off the performance of this URI and provide metrics on it.

I felt the need to include this and provide information on how we were gathering the data and what/why these triggers were important. I also want to assure the readers that, at worst, they may have to edit these triggers similar to what 80% of you have already had to do with Power Shell scripts that you have downloaded. We also offer the interface up to you and have a support forum where you can get help when you want to write your own triggers. Unlike some of the more closed architectures that leave you no/minimal ability to customize the dashboard, we leave you the option of your own canvas where you can paint your own operational masterpiece.

I will be recording a video that includes a walk thru. There are a few other triggers in the bundle that are not mentioned, they are not quite ready for the public but send me an email if you would like to try it and we will work through it.

Thank you for reading

John M. Smith, CTP

Nerd on a wire: The ExtraHop Citrix “Alpha” Bundle

So I have been working on a new Citrix bundle for ExtraHop customers and potential customers for a few weeks now. ExtraHop complements and expands on HDX insight by furthering your visibility into your Citrix infrastructure allowing you to get, not only ICA metrics on such things as ICA Channel impact, Latency and Launch times, we also provide you with the ability to break the information out by Subnet, Campus (friendly name), custom dashboards that empower lower paid staff as well as keep Director/VP types informed. In this post I want to discuss the Citrix Alpha Bundle and go over what it has so far and what can be included in it.

How ExtraHop works:
For those who do not know, ExtraHop is a completely agentless wire data analytics platform that provides L7 operational intelligence by passively observing L7 conversations. We do not poll SNMP or WMI libraries, ExtraHop works from a SPAN port and observes traffic and provides information based on what is observed at up to 20GB per second. In the case of Citrix, we do considerably more than observe the ICA Channel and I will show you in this post just how much incredibly valuable information is available on the wire that is relevant to Citrix engineers and architects. Learn more about ExtraHop’s solution for Citrix VDI monitoring.

For more on how ExtraHop works with Citrix Environments please visit:
Learn more about ExtraHop’s solution for Citrix VDI monitoring.

How are we gathering custom Metrics?
The way we gather these metrics is to pull the metrics from the spanned port through a process called triggers. ExtraHop allows us to logically create device groups and collate your VDA (ICA listeners for either XenDesktop or XenApp) and apply triggers specifically to those device groups. Device groups can be created using a variety of methods including IP Address, Naming convention, type (ICA,DB,HTTP,FTP,etc) as well as static.

Now, to earmark specific metrics to JUST your Citrix environment you apply the necessary triggers to the device group.
(All triggers are pre-written and you won’t have to write anything, worst case, you may have to edit some of them).

Results:
The triggers we have written create what is called an “App Container”. Below is a screen shot of the types of metrics that we are gathering in our Triggers. While it does not even remotely cover all that we can do I will explain a few of the metrics for you.

Citrix Infrastructure Page

Infrastructure Metrics: (all metrics drill down)

  • Slow Citrix Launches: We have set in the trigger to classify any Citrix Launch time in excess of 30 seconds to increment the Slow Launch counter. This is something you have the ability to change within the trigger itself.
  • CIFS Errors: We log the CIFS errors so that you can see things like “Access Denied” or “Path Not Found”. Anyone who has had a DFS share removed and the ensuing 90 logon while Windows pines for the missing drive letter knows what I am talking about.
  • Fast Citrix Launches: A better name is probably normal Citrix Launches, for the existing trigger set this would be anything under 30 seconds. Your thresholds can be customized.
  • DDC1/DDC2 Registrations: The sample data did not have any XenDesktop but this was a custom XML trigger written to count the number of times a VDA registers with a DDC so that you can see the distribution of your VDA’s across your XenDesktop Infrastructure.
  • DNS Response Errors: Self-explanatory, be aware if you have not looked at you DNS before. Because DNS failures happen on the wire, this is a huge blind spot for agent based solutions. I was shocked the first time I actually saw my DNS failure rate.
  • DNS Timeouts: Even more damning than response errors, these are flat out failures. These, like response errors, can indicate Active Directory issues due to an overworked DC/GC or misconfigured Sites and Services
  • Citrix I/O Stalls: While we do not measure CPU/Disk/Memory, we can see I/O related issues via the zero windows. When a system is experiencing I/O binding, it will close the TCP window. When this happens we will see it and it is an indication of an I/O related issue.
  • Server I/O Stalls: This is basically the opposite of a Citrix I/O stall, if a back end Database server is acting up and someone is on the phone and they say “Citrix”, we all saw the Citrix “Get out of jail free” card…the call will be sent to the Citrix team. This provides the Citrix team the ability to see that the back end server is having I/O related issues and not waste their time doing someone else’s troubleshooting which, in my 16 years of supporting Citrix, was about 70% of the time.

Launch Metrics: (Chart includes drill down)
When you click on the chart under the Launch Metrics chart you will be given a list of Launch/Login metrics based on the following metrics below.

  • Launch by Subnet: We collect this to see if there is an issue with a specific subnet
  • Launch by App: As some applications may have wrappers that launch or have parameters that make external connections we provide launch times by applications.
  • Launch by Server: We provide this metric so that you can easily see if login issues are specific to a particular server.
  • Launch by User: This will let you validate a specific user having issues or you can note things like “hey, all these users belong to the accounting OU” maybe there is an issue with a drive letter or login script.
  • Login by User: This is the login info which is how fast A/D logged them in.

Login vs. Launch:
At a previous employer, what we noted was that if we had a really long Load time accompanied by a really long login time we needed to look at the A/D infrastructure (Sites and Services, Domain Controllers) and a long Load time accompanied with a short login time would indicate issues with long profile loads, etc. The idea is that the login time should be around 90% of the load time meaning post-login, not much goes on.

XML Broker Performance:
We are one of the few platforms that can provide visibility into the performance of the XML broker. While not part of the ICA Channel it is an important part of your overall Citrix infrastructure. Slow XML brokering can cause slow launches, the lack of applications being painted, etc. We can also provide reporting on STA errors that we see as we have visibility into XML traffic between the Netscaler/Web Interface and the Secure Ticket Authority (STA).

DNS Performance:
If you have applications that are not run directly on your Citrix servers and you are not using host files, DNS performance is extremely important. The drill down into the DNS Performance chart will provide performance by client and by server. If you see a specific DNS Server that is having issues you may be able to escalate it to the A/D and DNS teams.

Citrix Latency Metrics Page

  • Latency by Subnet: Many network engineers will geo-spatially organize their campus/enterprise/WAN by network ID. One of the Operational Intelligence benefits of ExtraHop is that we can use triggers to logically organize the performance by subnet allowing a Citrix team to break out the performance by Subnet. If given a list of friendly names, we can actually provide a mapping of location-to-NETID. Example: 192.168.252.0 is your 3rd floor on your main campus. We can provide the actual friendly name if you want. This can be very useful in quickly identifying problem areas, especially for larger Citrix environments.
  • High Latency Users: For this metric, any user who crosses the 300ms threshold is placed into the High Latency area. The idea is that this chart should be sparse but if you see a large amount of data here you may find the need to investigate further. Also, double check the users you note here with the chart below that will include the overall user latency. You may have an instance where a user’s latency was high due to them wandering too far away from an access point or overall network issues but find that when you look at the Latency by UserID chart that they overall latency was acceptable.
  • Latency by UserID: This metric is to measure the latency by user ID regardless of how good/bad it was.
  • Latency by Client IP: This is the latency by individual client IP. I think that I may change this to include the latency by VDA (XenDesktop or XenApp). This can be valuable to know if a specific set of VDA listeners are having issues.

Below is the drill down for the Latency by Subnet chart. This will allow you to see if you have an issue with a specific subnet within your organization. Example: You get a rash of calls about type-ahead delays, the helpdesk/first responder does not put together that they are all from the same topological area. The information below will allow the Citrix engineer to quickly diagnose the issue if the problem is a faulty switch in an MDF or an issue with a LEC over an MPLS cloud. Below we have set the netmask to /24 but that can be changed to accommodate however you have subnetted your environment.

Citrix PVS Performance Page
I haven’t had a great deal of PVS experience outside of what I have set up in my lab. In the last few years I had sort of morphed into a DevOps/Netscaler/INFOSEC role with the groups I was in. That said, because we are on the wire we are able to see the turn timing for your PVS traffic. I won’t go into the same detail here as with the previous two pages but what you are looking at is a heat map of your PVS turn times. In general I have noticed, other than when things are not working well, that the turn timing should be in the single digits. I will practice breaking my PVS environment to see what else I can look at. I have tested this with a few customers but their PVS environments were working fine and no matter how many times I ask them to break them they just don’t seem compelled to. I have also included Client/Server request transfer time as well as size to allow the Citrix team to check for anomalies.

Detecting Blue Screening Images:
One thing I have come across while being on a team that used PVS was that occasionally something would go wrong and systems would blue screen. Most reboots happen overnight and so it can be somewhat difficult to get into work every day and know right away which servers did not come up the previous night without some sort of manual task. Below is the use of a DHCP trigger that counts the number of requests. In the Sesame Street spirit, when you look below you can sort of see that “one of these kids is doin’ his own thing….”. Note that most of the PVS driven systems have a couple of DHCP requests and 192.168.1.156 has 30. Why? Because I created a mac address record on the PVS Server and PXE booted XenServer image from my VMWare Workstation which produced a Blue Screen.

In those environments with hundreds or even thousands of servers, the ability to see blue screening systems (or systems that are perpetually rebooting) can be very valuable. The information below is from our new Universal Payload Analysis event that the TME team wrote for us to gather DHCP statistics.

Other things we can add
While my lab is pretty small and I don’t have any apps like PeopleSoft, Oracle Financials or basic Client Server applications. ExtraHop has the ability to map out your HTTP/Database/Tiered applications for you and make sure that you see the performance of all of your enterprise applications as they pertain to Citrix. By adding the HTTP request/response events we can see ALL URI’s and their performance as well as any 500 series errors. We can see slow stored procedures for database calls that are made from the Citrix servers. You can also classify SOAP/REST based calls by placing those applications in their own App Container and position your team to report on the performance of downstream applications that can be a sore spot for Citrix teams when they are held accountable for the performance because the front end was published on Citrix.

Empowering First Responders
When you have a small lab of 3-4 VDA’s and a limited amount of demo data it is a little tough to get too detailed here but I wanted to show ways that we can empower some of your first responders. One of the challenges with Citrix support is that it can get very expensive really fast. Normally, calls may go from a helpdesk to a level 2 and if it is still an issue, to a level 3 engineer. With Citrix, calls have a habit of going from the helpdesk directly to the Citrix engineer and this can make supporting it very expensive. If we can position first responders to be able to resolve the issue during the initial call it can be a considerable savings to the organization. For this we have created a “Citrix Operations” app container and while it is somewhat limited here, the idea is that we can put specific information in here that could make supporting remote Citrix users much easier.

Below you see a list of metrics, the App Container below allows a service desk resource to actively click on Open/Closed sessions and get the following information.

(I know the text box says “On the left” but due to how my wordpress theme is set up I just couldn’t fit them side by side.  I mean below here…..sorry…)

From this page, the first responder can see if there are any database errors, if we want, we can put the HTTP 500 errors all in real time. The engine updates every 30 seconds so by the time the user calls in, they will be able to go in and see their Citrix experience. This can be custom retrofitted for your specific applications/environment.

So why are you writing this?
I would love to talk to a larger Citrix deployment to find out what the current pain points are for you and find out what additional custom information can be netted with triggers so that we can make that data readily available. Like I said, I have supported Citrix for about 16 years but the last few I was more on the cloud side. If I could get an idea of what blind spots you have in your environment I am positive we can find a way to provide visibility into it with wire data analytics. If you are a PowerShell guru, you will love the JavaScript trigger interface that we have and you should have no trouble editing the triggers to suit your own environment or even writing your own triggers. Wire data analytics provides relevant data to more than just the Citrix team. At one of my previous employers we had the DBAs, Citrix/Server team and INFOSEC all leveraging Wire Data from ExtraHop.

So can I haz it?
In fact, yes, you can have it for free if you like, we offer a Discovery edition that has ICA enabled that will allow you to keep up to 24 hours of data but will do everything that you see in this post. You have a few options but if you don’t want to go down the sales cycle you can download the discovery edition (you will get a call from inside sales to pick your brain) or you can get an evaluation of either a physical or virtual appliance but for that I have to get your area rep involved (you will not regret working with us and we will not make the process painful). Because we are passive and we gather information with no agents we can sit back passively and observe your environment with zero impact on your servers. If you want to set this up, just shoot me an email at johnsmith@extrahop.com and I will provide you the Citrix Alpha bundle after you have downloaded the Discovery Edition or requested an Evaluation.

The discovery edition can be downloaded from the link below, the entire process was pretty painless, I went through it a few weeks ago. After signing up we can get you access to the documentation you will need and a forum account.

http://www.extrahop.com/products/appliances/extrahop-discovery-edition/

Thanks for reading and please let me know if you would like to contribute.

John Smith

Finally a book on Edgesight (Yes…it is a little late but…)

I have been a bit busy but I was asked to review a book written by a Citrix Architect named Vaqar Hasan out of Toronto Canada.  While Edgesight has been declared End of Life in 2016, it still provides the most detailed metrics for XenApp environments out there today.  Also, most folks are still on XenApp 6.5 and will remain so for at least two years so I believe the content is still very relevant to today’s XenApp environments as Citrix shops ease into XenDesktop 7 App Edition.  Also, according to the Citrix website there will be extended support until 2020 for XenApp 6.5 which is a very possible scenario for some folks.

While it does not offer a great deal of ad hoc queries like the EdgesightUnderTheHood.com site does, it offers some very nice details on laying the groundwork for your Edgesight implementation with detailed instructions on setting up alerts, emails, and environmental considerations such as anti-virus exclusions.

While I wish this book would have been available in 2008, there has not been a great deal of literature around Edgesight and he is only asking ten dollars for the Kindle edition.  I think it is important that we support the few IT experts in this industry who take the effort to publish good content and the cost is low enough that you can put it on your corporate p-card.

So if you don’t mind, maybe you can support this guy for helping out.

Thanks!

John

Image

http://www.amazon.com/Instant-EdgeSight-XenApp-Vaqar-Hasan-ebook/dp/B00ESX19VO/ref=sr_1_1?ie=UTF8&qid=1386594285&sr=8-1&keywords=Edgesight

Moving non-Edgesight data over to new Blog at http://wiredata.net

As you have noticed, I have been writing quite a bit about Extrahop/Splunk.  Most of you are aware that Edgesight is NOT dead and lives on having added real-time monitoring to go with an archival strategy.  I plan to continue writing about it on this site but I am moving the Extrahop information over to http://wiredata.net.

I will write some about Netscalers, INFOSEC and HDX Insight as well as Extrahop and Splunk.

Please head over and have a look, I have recorded one video and I plan to add at least ten more.  Follow it @wiredata

Let the finger pointing BEGIN! (..and end) Canary Herding With Extrahop FLOW_TURN

In IT, dependable metrics become our Canary in a coal mine. We use them as indicators of issues. Like a dead canary in a coal mine, they don’t know exactly how much they have been exposed or exactly how bad it is but they know they need to get the hell out of there. In the world of Operational intelligence, we can use metrics as indicators of which parts of the proverbial shaft are having issues and need to be adjusted, sealed off or abandoned altogether. To continue in the same vein as my previous post I wanted to discuss the benefits of the FLOW_TURN trigger when trying to get a baseline performance of specific servers and transactions and you don’t want to drill into layer 7 data as much as you just want to check the layer 4 performance between two hosts Extrahop has the FLOW_TURN trigger that will allow you to take the next step in layer 4 flow metrics by looking at the following:

Request Transfer: Time it took for the client to make the request
Response Transfer: Time it took for the server to respond
Request Bytes: Size of the Request
Response Bytes: Size of the Response
Transaction Process Time: The time it took for the transaction to complete. You may have a fast network with acceptable request and response times but you may note serious tprocess times which could indicate the kind of server delay we discussed in some of the Edgesight posts.

In today’s Virtualized environment you may see things like:

  • A Four Port NIC with a 4x1GB port channel plugged into a 133mhz bus
  • 20 or more VMs sharing a 1GB Port Channel
  • Backups and volume mirror going on over the production network.

These are things that may manifest themselves as slowness of either the application or slow response from your Clients or servers. What the FLOW_TURN metric gives you is the ability to see the basic transport speeds of the Client and Server as well as the process time of the transaction. Setting up a trigger to allow you to harvest this data will lay the foundation for quality historical data on the baseline performance of specific servers during specific times of the day. The trigger itself is a few lines of code.

log(“ProcTime ” + Turn.tprocess)
RemoteSyslog.info(
” eh_event=FLOW_TURN” +
” ClientIP=”+Flow.client.ipaddr+
” ServerIP=”+Flow.server.ipaddr+
” ServerPort=”+Flow.server.port+
” ServerName=”+Flow.server.device.dnsNames[0]+
” TurnReqXfer=”+Turn.reqXfer+
” TurnRespXfer=”+Turn.rspXfer+
” tprocess=”+Turn.tprocess

)

Then you assign the trigger to specific servers that you want to monitor (If you are using the Developer Edition of Extrahop in a home lab just assign to all) then you will start collecting metrics. In my case I am using Splunk to collect Extrahop Metrics as they are the standard for big data archiving and fast queries. Below you see the results of the following Query:
sourcetype=”Syslog” FLOW_TURN | stats count(_time) as Total_Sessions avg(tprocess) avg(TurnReqXfer) avg(TurnRespXfer) by ClientIP ServerIP ServerPort

This will produce a grid view like the one below:
Note in this grid below you see the client/server and port as well as the total sessions. With that you then see the Transfer metrics for both the Client and Server as well as the process time. The important things to note here:

  • If you have a really long avg(tprocess) time, double check the number of sessions. A single instance of an avg(tprocess) of 30000ms is not as big of a deal as 60,000 instances of an 800ms avg(tprocess). Also keep in mind that Database servers that may be performing data warehousing may have high avg(tprocess) metrics because they are building reports.
  • Note the ClientIP Subnets as you may have an issue with an MDF where clients from a specific floor or across a frame relay connection are experiencing high avg(TurnReqXfer) numbers.

If you want to see the average request transfer time by Subnet use the following Query: (I only have one subnet in my lab so I only had one result)

sourcetype=”Syslog” FLOW_TURN | rex field=_raw “ClientIP=(?<subnet>\d+\.\d+\.\d+\.)” | stats avg(TurnReqXfer) by subnet

If you want to track a servers transaction process time you would use the query below:

sourcetype=”Syslog”
FLOW_TURN ServerIP=”192.168.1.61″ | timechart avg(tprocess) span=1m

Note in the graph below you can see the transaction process time for the server 192.168.1.61 throughout the day. This can give you a baseline so that you know when you are out of what (or when the Canary has died)

Conclusion:
While I am not trying to take what we do for a living and say that it is as simple as swinging a hammer in a coal mine but for the longest time, this type of wire data has not been readily accessible unless you had a “Tools team” working full time on a seven figure investment in a mega APM Product.  This took me less than 15 minutes to set up and I was able to quickly get a holistic view of the performance of my servers as well as start to build baselines so that I know when the servers are out of the norm. I have had my fill of APM products that I need an entourage to deploy or have a dozen drill downs to answer a simple question, is my server out of whack?

In the absence of data, people fill those gaps with whatever they want and they will take creative license to speculate. The systems team will blame the code and the network, the Network team will blame the server and the code the developers will blame the Systems admins and Network team. With this simple canary herding tool, I can now fill that gap with actual data.

If the Client or Server transfer times are slow we can ask the Network team to look into it, if the tprocess time is slow it could be a SQL table indexing issue or a server resource issue. If nothing else, you have initial metrics to start with and a way to monitor if they go over a certain threshold. When integrated with a big-data platform like Splunk, you have long term baseline data to reference.

A lot of time there is no question the canary has died, it’s just getting down to which canary died.

Extrahop now as a Discovery Edition that you can download and test for free (Including FLOW_TICK and FLOW_TURN triggers).

http://www.extrahop.com/discovery/

Thanks for reading!!!

John M. Smith


Follow

Get every new post delivered to your Inbox.

Join 48 other followers