All posts by John M. Smith

Moving non-Edgesight data over to new Blog at http://wiredata.net

As you have noticed, I have been writing quite a bit about Extrahop/Splunk.  Most of you are aware that Edgesight is NOT dead and lives on having added real-time monitoring to go with an archival strategy.  I plan to continue writing about it on this site but I am moving the Extrahop information over to http://wiredata.net.

I will write some about Netscalers, INFOSEC and HDX Insight as well as Extrahop and Splunk.

Please head over and have a look, I have recorded one video and I plan to add at least ten more.  Follow it @wiredata

Let the finger pointing BEGIN! (..and end) Canary Herding With Extrahop FLOW_TURN

In IT, dependable metrics become our Canary in a coal mine. We use them as indicators of issues. Like a dead canary in a coal mine, they don’t know exactly how much they have been exposed or exactly how bad it is but they know they need to get the hell out of there. In the world of Operational intelligence, we can use metrics as indicators of which parts of the proverbial shaft are having issues and need to be adjusted, sealed off or abandoned altogether. To continue in the same vein as my previous post I wanted to discuss the benefits of the FLOW_TURN trigger when trying to get a baseline performance of specific servers and transactions and you don’t want to drill into layer 7 data as much as you just want to check the layer 4 performance between two hosts Extrahop has the FLOW_TURN trigger that will allow you to take the next step in layer 4 flow metrics by looking at the following:

Request Transfer: Time it took for the client to make the request
Response Transfer: Time it took for the server to respond
Request Bytes: Size of the Request
Response Bytes: Size of the Response
Transaction Process Time: The time it took for the transaction to complete. You may have a fast network with acceptable request and response times but you may note serious tprocess times which could indicate the kind of server delay we discussed in some of the Edgesight posts.

In today’s Virtualized environment you may see things like:

  • A Four Port NIC with a 4x1GB port channel plugged into a 133mhz bus
  • 20 or more VMs sharing a 1GB Port Channel
  • Backups and volume mirror going on over the production network.

These are things that may manifest themselves as slowness of either the application or slow response from your Clients or servers. What the FLOW_TURN metric gives you is the ability to see the basic transport speeds of the Client and Server as well as the process time of the transaction. Setting up a trigger to allow you to harvest this data will lay the foundation for quality historical data on the baseline performance of specific servers during specific times of the day. The trigger itself is a few lines of code.

log(“ProcTime ” + Turn.tprocess)
RemoteSyslog.info(
” eh_event=FLOW_TURN” +
” ClientIP=”+Flow.client.ipaddr+
” ServerIP=”+Flow.server.ipaddr+
” ServerPort=”+Flow.server.port+
” ServerName=”+Flow.server.device.dnsNames[0]+
” TurnReqXfer=”+Turn.reqXfer+
” TurnRespXfer=”+Turn.rspXfer+
” tprocess=”+Turn.tprocess

)

Then you assign the trigger to specific servers that you want to monitor (If you are using the Developer Edition of Extrahop in a home lab just assign to all) then you will start collecting metrics. In my case I am using Splunk to collect Extrahop Metrics as they are the standard for big data archiving and fast queries. Below you see the results of the following Query:
sourcetype=”Syslog” FLOW_TURN | stats count(_time) as Total_Sessions avg(tprocess) avg(TurnReqXfer) avg(TurnRespXfer) by ClientIP ServerIP ServerPort

This will produce a grid view like the one below:
Note in this grid below you see the client/server and port as well as the total sessions. With that you then see the Transfer metrics for both the Client and Server as well as the process time. The important things to note here:

  • If you have a really long avg(tprocess) time, double check the number of sessions. A single instance of an avg(tprocess) of 30000ms is not as big of a deal as 60,000 instances of an 800ms avg(tprocess). Also keep in mind that Database servers that may be performing data warehousing may have high avg(tprocess) metrics because they are building reports.
  • Note the ClientIP Subnets as you may have an issue with an MDF where clients from a specific floor or across a frame relay connection are experiencing high avg(TurnReqXfer) numbers.

If you want to see the average request transfer time by Subnet use the following Query: (I only have one subnet in my lab so I only had one result)

sourcetype=”Syslog” FLOW_TURN | rex field=_raw “ClientIP=(?<subnet>\d+\.\d+\.\d+\.)” | stats avg(TurnReqXfer) by subnet

If you want to track a servers transaction process time you would use the query below:

sourcetype=”Syslog”
FLOW_TURN ServerIP=”192.168.1.61″ | timechart avg(tprocess) span=1m

Note in the graph below you can see the transaction process time for the server 192.168.1.61 throughout the day. This can give you a baseline so that you know when you are out of what (or when the Canary has died)

Conclusion:
While I am not trying to take what we do for a living and say that it is as simple as swinging a hammer in a coal mine but for the longest time, this type of wire data has not been readily accessible unless you had a “Tools team” working full time on a seven figure investment in a mega APM Product.  This took me less than 15 minutes to set up and I was able to quickly get a holistic view of the performance of my servers as well as start to build baselines so that I know when the servers are out of the norm. I have had my fill of APM products that I need an entourage to deploy or have a dozen drill downs to answer a simple question, is my server out of whack?

In the absence of data, people fill those gaps with whatever they want and they will take creative license to speculate. The systems team will blame the code and the network, the Network team will blame the server and the code the developers will blame the Systems admins and Network team. With this simple canary herding tool, I can now fill that gap with actual data.

If the Client or Server transfer times are slow we can ask the Network team to look into it, if the tprocess time is slow it could be a SQL table indexing issue or a server resource issue. If nothing else, you have initial metrics to start with and a way to monitor if they go over a certain threshold. When integrated with a big-data platform like Splunk, you have long term baseline data to reference.

A lot of time there is no question the canary has died, it’s just getting down to which canary died.

Extrahop now as a Discovery Edition that you can download and test for free (Including FLOW_TICK and FLOW_TURN triggers).

http://www.extrahop.com/discovery/

Thanks for reading!!!

John M. Smith


Go with the Flow! Extrahop’s FLOW_TICK feature

I was test driving the new 3.10 firmware of Extrahop and I noticed a new feature that I had not seen before (it may have been there in 3.9 and I just missed it). There is a new trigger called FLOW_TICK, that basically monitors connectivity between two devices at layer 4 allowing you to see the response times between two devices regardless of L7 Protocol. This can be very valuable if you just want to see if there is a network related issue in the communication between two nodes. Say, you have an HL7 interface or a SQL Server that an application connects to. You are now able to capture flows between those two devices or even look at the Round Trip time of tiered applications from the client, to the web farm to the back end database. When you integrate it with Splunk you get an excellent table or chart of the conversation between the nodes.

The Trigger:
The first step is to set up a triggler and select the “FLOW_TICK” event.

Then click on the Editor and enter in the following Text: (You can copy/Paste the text and it should appear as the graphic below)

log(“RTT ” + Flow.roundTripTime)
RemoteSyslog.info(
” eh_event=FLOW_TICK” +
” ClientIP=”+Flow.client.ipaddr+
” ServerIP=”+Flow.server.ipaddr+
” ServerPort=”+Flow.server.port+
” ServerName=”+Flow.server.device.dnsNames[0]+
” RTT=”+Flow.roundTripTime
)

Integration with Splunk:
So if you have your integration with Splunk set up, you can start consulting your Splunk interface to see the performance of your layer 4 conversations using the following Text:
sourcetype=”Syslog” FLOW_TICK | stats count(_time) as TotalSessions avg(RTT) by ClientIP ServerIP ServerPort

This should give you a table that looks like this: (Note you have the Client/Server the Port and the total number of sessions as well as the Round Trip Time)

If you want to narrow your search down you can simply put a filter into the first part of your Splunk Query: (Example, if I wanted to just look at SQL Traffic I would type the following Query)
sourcetype=”Syslog” FLOW_TICK 1433
| stats count(_time) as TotalSessions avg(RTT) by ClientIP ServerIP ServerPort

By adding the 1433 (or whatever port you want to filter on) you can restrict to just that port. You can also enter in the IP Address you wish to filter on as well.

INFOSEC Advantage:
Perhaps an even better function of the FLOW_TICK event is the ability to monitor egress points within your network. One of my soapbox issues in INFOSEC is the fact that practitioners beat their chests about what incoming packets they block but until recently, the few that got in could take whatever the hell they wanted and leave unmolested. Even a mall security guard knows that nothing is actually stolen until it leaves the building. If a system is infected with Malware you have the ability, when you integrate it with Splunk and the Google Maps add-on, to see outgoing connections over odd ports. If you see a client on your server segment (not workstation segment) making a 6000 connections to a server in China over port 8016 maybe that is, maybe, something you should look into.

When you integrate with the Splunk Google Maps add-on you can use the following search:
sourcetype=”Syslog” FLOW_TICK | rex field=_raw “ServerIP=(?<IP>.[^:]+)\sServerPort” | rex field=_raw “ServerIP=(?<NetID>\b\d{1,3}\.\d{1,3}\.\d{1,3})” |geoip IP | stats avg(RTT) by ClientIP IP ServerPort IP_city IP_region_name IP_country_name

This will yield the following table: (Note that you can see a number of connections leaving the network to make connections in China and New Zealand, the Chinese connections I made on purpose for this lab and the New Zealand connections are NTP connections embedded into XenServer)

If you suspected you were infected with Malware and you wanted to see which subnets were infected you would use the following Splunk Query:
sourcetype=”Syslog” FLOW_TICK
%MalwareDestinationAddress%
| rex field=_raw “ServerIP=(?<IP>.[^:]+)\sServerPort” | rex field=_raw “ClientIP=(?<NetID>\b\d{1,3}\.\d{1,3}\.\d{1,3})” | geoip IP | stats count(_time) by NetID

Geospatial representation:
Even better, if you want to do some big-time geospatial analysis with Extrahop and Splunk you can actually use the Google Maps application you can enter the following query into Splunk:
sourcetype=”Syslog” FLOW_TICK | rex field=_raw “ServerIP=(?<IP>.[^:]+)\sServerPort” | rex field=_raw “ClientIP=(?<NetID>\b\d{1,3}\.\d{1,3}\.\d{1,3})” |geoip IP | stats avg(RTT) by ClientIP NetID IP ServerPort IP_city IP_region_name IP_country_name | geoip IP

Conclusion:
I apologize for the RegEx on the ServerIP field, for some reason I wasn’t getting consistent results with my data. You should be able to geocode the ServerIP field without any issues. As you can see, the FLOW_TICK gives you the ability to monitor the layer 4 communications between any two hosts and when you integrate it with Splunk you get some outstanding reporting. You could actually look at the average Round Trip Time to a specific SQL Server or Web Server by Subnet. This could quickly allow you to diagnose issues in the MDF or if you have a problem on the actual server. From an INFOSEC standpoint, this is fantastic, your INFOSEC team would love to get this kind of data on a daily basis. Previously, I used to use a custom Edgesight Query to deliver a report to me that I would look over every morning to see if anything looked inconsistent. If you see an IP making a 3389 connection to an IP on FIOS or COMCAST than you know they are RDPing home. More importantly, the idea that an INFOSEC team is going to be able to be responsible for everyone’s security is absurd. We, as SyS Admins and Shared Services folks need to take responsibility for our own security. Periodically validating EGRESS is a great way to find out quickly if Malware is running amok on your network.

Thanks for reading

John M. Smith

Where wire data meets machine data

So what exactly IS wire data? We have all heard a lot about Machine data but most folks do not know what wire data is or how it can both augment your existing Operational Intelligence endeavor as well as provide better metrics than traditional APM solutions. Extrahop makes the claim that they are an Agentless solution. They are not unique in the claim but I believe they are pretty unique in the technology. It comes down to a case of trolling and polling. Example: A scripted SNMP process is “polling” a server to see if there are any retransmissions. Conversely Extrahop is Trolling for data as a passive network monitor and sees the retransmissions as they are occurring on the wire. Polling is great as long as the condition you are worried about is happening at the time you do the polling. It is similar to saying “Are you having retransmissions?” (SNMP Polling) vs. “I see you are having a problem with retransmissions”. Both are agentless but there is a profound difference in terms of the value each solution delivers.

Where an agent driven solution will provide insight into CPU, Disk and Memory, wire data will give you the performance metrics of the actual layer 7 applications. It will tell you what your ICA Latency as it is measured on the wire, it will tell you what SQL Statements are running slow and which ones are not, it will tell you which DNS records are failing. The key thing to understand is that Extrahop works as a surveillance tool and is not running on a specific server and asking WMI fields what their current values are. This is profoundly different than what we see in traditional tools in the last 10-12 years.

When Machine data meets Wire Data:
I am now over 9 months into my Extrahop deployment and we have recently started a POC with Splunk and the first task I performed was to integrate Extrahop wire data into the Splunk Big Data back end. All I can say is that it has been like yin and yang. I am extremely pleased with how the two products integrate together and the fusion of wire data with machine data will give my organization a level of visibility that they have never had before. This is, in my opinion, the last piece to the Operational Intelligence puzzle.

In this post I want to talk about three areas where we have been able to see profound improvement in our environment and some of the ways we have leveraged Splunk and Extrahop to accomplish this.

How does data get from Extrahop to Splunk?
Extrahop has this technology called Triggers. Basically, there is a mammoth amount of data flowing through your Extrahop appliances (up to 20GB per Second) and we are able to tap into that data as it is flowing by and send it to Splunk via Syslog. This allows me to tap into CIFS, SQL, ICA, MQSeries, HTTP, MySQL, NFS, ORACLE as well as other Layer 7 flows and allow me to send information from those flows (Such as Statement, Client IP, Server and Process Time for SQL) to Splunk where I can take advantage of their parsing and big data business intelligence. This takes data right off the wire and puts it directly into Splunk. Just like any Unix based system or Cisco device that is set to use Syslog, what I like about Extrahop is the ability to discriminate between what you send to Splunk and what you don’t send to Splunk.

Extrahop/Splunk Integration: SQL Server Queries

Grabbing SQL Queries and reporting on their performance:
One of the most profound metrics we have noted since we started integrating Splunk and Extrahop was the ability to create a flow then cherry pick metrics from it. Below you will see a pair of Extrahop Triggers (the drivers for Splunk integration) the first trigger builds the flow by taking the DB.statement and the DB.procedure fields (pre-parsed on the wire) and creating a flow that you can then tap into when you send you syslog message in the next trigger.

The stmt (var stmt) variable, refers to the flow that we just created above, we will instantiate this flow and pull from it key metrics such as statement and procedure and couple it with the DB.tprocess and then tie in the process time of specific SQL Statements.

At the bottom you see the RemoteSyslog.info command that sends the data to Splunk (or KIWI with SQL, or what we call “skunk”).

Note below, I am NOT logging the database name but that is a trigger option in Extrahop if you have more than one database that uses similar table names. Also note, the condition if (DB.tprocess >=0). I am basically grabbing every database process. This measurement is in milliseconds so if you only wanted to check database queries that took longer than one second it would read if (DB.tprocess>=1000)


For myself, I assign both of these triggers to my Citrix XenApp servers and they are able to report on the database transactions that occur in my Citrix environment. Obviously, you can apply this triggers to your webservices, individual clients as well as the database servers themselves. In my case, I already had a device group for the XenApp servers.

This translates into the metrics you see below where Splunk automatically parses the data for me (YES!) and I am ready to start drilling into it to find problem queries, tables and databases.

Below you see how easy (well, I recommend the O’Reily “Regular Expressions” book) it is to now parse your wire data to provide the performance of specific queries. As you can see below, this allows you to see the performance of specific queries and get an understanding of how specific tables (and their corresponding indexes) are performing. The information you see in the graphic below can be delivered to Splunk in real-time and you can get this kind of insight without running SQL Profiler. If they are logging into the application with Windows credentials, you will have the user ID as well.

Also, you don’t have to know regex every time, you can save the query below as a macro and never have to type the regex ever again. You can also make that rex field a static column. I am NOT an regex guru, I managed to get every field parsed with a book and Google.

For me this now allows you to report on average process time by:

  • Database Server
  • User ID
  • Database Table (if you know a little regex)
  • Database
  • Client Subnet
  • Client IP Address
  • Individual Stored Procedure

Basically, once your data is indexed by Splunk’s Big Data back end, you have “baseball stats” as in it is crazy what you can report on (Example: who hit the most home runs from the left side of the plate in an outdoor stadium during the month of July). You can get every bit as granular as that in your reporting and even more.

Extrahop/Splunk Integration: DNS Errors

Few issues will be as maddening and infuriating as DNS Resolution issues. A windows client can pine for resolution for as long as five seconds. This can create some serious hourglass time for your end users and impact the performance of tiered applications. An errant mistake in a .conf file mapping to an incorrect host can be an absolute needle in a haystack. With the Splunk integration, (Extrahop’s own console does a great job of this as well) you can actually integrate the DNS lookup failures as the happen in real time and take them off the wire and into your Splunk Big Data platform. Below you see the raw data as it happens. (I literally went to a DOS prompt and started typing NSLOOKUP on random thought up names. (The great irony being that in this age of domain squatting, 1/3 of them actually came back!!!!). As my mentor and brother James “Jim” Smith once told me “if you have issues that are sometimes there and sometimes not there, it’s probably DNS” or “If DNS is not absolutely pristine, funny things happen” or my all-time favorite quote from my Brother Jim “Put that GOD DAMN PTR record in or I will kick your phucking ass!” Needless to say, my brother Jim is rather fond of the DNS Failure record keeping of Extrahop.

Below you see a very simple trigger that essentially logs the client IP, the DNS Server IP and the DNS Query that was attempted, the condition is set so that it is triggered in the event of an error.

Below is the resultant raw data.


As with the SQL Data we had above, we have more Parsing goodness because we are integrating the data into Splunk: Note the server cycling through the domain DNS Suffix thus doubling the number of failures.

So within the same vein as “baseball stats” you can report on DNS lookup failures by DNS Query (As you see above, those records who fail to be looked up most often) but you also have the ability to report on the following:

  • DNS Failures by Client IP and by DNS Query (Which server has the misconfigured conf file)
  • DNS Failures by DNS Server
  • DNS Failures by Subnet (bad DHCP setting?)

Proper DNS pruning and maintenance takes time, until Extrahop I cannot think of how I would monitor DNS failures outside of wireshark (Great tool but not much big data or business intelligence behind it). The ability to keep track of DNS failures will go a very long way in providing needed information to keep the DNS records tight. This will translate into faster logon times (especially if SRV lookups are failing) and better overall client-server and nth-tiered application performance.

Extrahop/Splunk Integration: Citrix Launch Times

One of the more common complaints of Citrix admins is the slow launch times. There are a number of variables that Extrahop can help you measure but for this section we will simply cover how to keep track of your launch times.

Below you see a basic trigger that will keep track of the load time and login time. I track both of these metrics as often, if the login time is 80-90% of the overall load time you likely need to take a look at group policies or possibly loopback processing. This can give you an idea of where to start. If you have a low logiinTime metric but a high loadTime metric it could be something Network/DNS related. You create this query and assign it to all XenApp Servers.


The Raw Data: Below you see the raw data, I am not getting a username yet, there is a trick to that I will cover later but you see below I get the Client Name, Client IP, Server IP and I would have my server name if my DNS was in order (luckily my brother Jim isn’t here)

As with the previous two examples, you now can start to generate metrics on application launch performance.


And once again with the baseball stats theme you can get the following metrics once your Extrahop data is integrated into Splunk:

  • Average Launch time by UserName
  • Average Launch time by Client Name
  • Average Launch time by Client IP
  • Average Launch time by Customer Subnet (using some regex)
  • Average Launch time by Application (as you see above)
  • Average Launch time by XenApp Server (pinpoint a problem XenApp server)

Conclusion:
While I did not show the Extrahop console in this post the Extrahop console is quite good, I wanted to show how you could integrate wire data into your Splunk platform and make it available to you along with your machine data. While you are not going to see CPU, Disk IOPS or Memory utilization on the wire, you will seem some extremely telling and valuable data. I believe that all systems and system related issues will manifest themselves on the wire at some point. An overloaded SQL Server will start giving you slower ProcessTime metrics. A flapping switch in an MDF at a remote site might start showing slower Launch Times in Citrix and a misconfigured .conf file may cause lookup failures for your tiered applications that you run. These are all metrics that may not manifest themselves with agent driven tools but you can note them on the wire. Think of Extrahop as your “Pit Boss” engaging in performance surveillance of your overall systems.

I have found the integration between Splunk and Extrahop gives me a level of visibility that I have never had in my career. This is the perfect merger of two fantastic data sources.

In the future I hope to cover integration for HTTP, CIFS as well as discuss the security benefits of sending wire data to Splunk.

Thanks for reading.

John M. Smith

Useful Regex statements

Getting the Subnet ID (24 bit Mask)
This is the REX statement that will let you query for a 24 bit subnet ID.  This will let you check Citrix Latency and Load/Launch times by Subnet within a customer’s network.

rex field=client_ip “(?<net_id>\d+\.\d+\.\d+)” | stats avg(Load) count(load_time) by net_id

Getting performance on SQL INSERT statements:

The REGEX below will allow you to get the actual table that an insert command is updating.  This could be useful to see if SQL write actions are not performing as expected.  This REX will parse out the table name so that you can check the performance of specific tables.

rex field=_raw “Statement=insert INTO\s(?<Table>.[^\s]+)”

Getting the Table Name within a SELECT statement:
The REX statement below allows you to get the table that a select statement is running against.  Mapping the performance by Table name may give you an indication that you need to re-index.
| rex field=_raw “[fF][rR][oO][mM]\s(?<Table>.[^\s]+)”

From Buzzword to Buziness: A conversation about Operation Intelligence with 3 Pioneers at Citrix Synergy

From Buzzword to Buziness: A conversation about Operation Intelligence with 3 Pioneers at Citrix Synergy

While a family commitment is keeping me from moderating my Geek Speak session you get a “Moderator Upgrade” in the form of Splunk’s Brandon Shell.  

As we have discussed, there are changes to Citrix’s Edgesight Product as well as some great innovations by smaller APM players that are well worth looking into.

This Geek Speak features the CEO of the recent “Best of Interop” awardee Extrahop, Jesse Rothstein, Jason Conger, an architect for Splunk and Dana Gutride, an architect for Citrix’s next generation of monitoring.  

In this session we will question the dominant paradigm surrounding monitoring the user experience and engage in a discussion about what it is to leverage Operational Intelligence. 

 

Preparing for life without Edgesight with ExtraHop

So, the rumors have been swirling and I think we have all come to the quiet realization that Edgesight is going to be coming to an end. At least the Edgesight we know and Love/Hate.  While we all await the next version of HDX Edgesight we can almost be certain that the data model and all of the custom queries we have written over the last three years will not be the same.

For those of us who have continued with this labor of love trying squeeze every possible metric we could out of Edgesight we are likely going to have to come to grips with the fact that the next generation of Edgesight will not have the same level of metrics we have today.

Let’s be honest, Edgesight has been a nice concept but there has been extensive problematic issues with the agent both from a CPU standpoint (firebird service taking up 90% CPU) and keeping the versions consistent. The real-time monitoring requires elevated permissions of the person looking into the server forcing you to grant your service desk higher permissions than many engineers are comfortable with. I am, for the most part, a “tools”-hater. In the last 15 years I have watched millions of dollars spent on any number of tools, all of which told me that they would be the last tool I would need and all of them in my opinion, for the most part, underwhelming. I would say that Edgesight has been tolerable to me and it has done a great job of collecting metrics but, like most tools I have worked with, it is Agent based, also it cannot log in real-time. The console was so unusable that I literally have not logged into it for the last four years. (In case you were wondering why I don’t answer emails with questions about the console).

For me, depending on an agent to tell you there is an issue is a lot like telling someone to “yell for help if you start drowning”. If a person is under water, it’s a little tough for them to yell for help. With agents, if there is an issue with the computer, whatever that is (CPU, Disk I/O, Memory) will likely impact the agent as well. The next best thing, which is what I believe Desktop Director is using, is to interrogate a system via WMI. Thanks to folks like Brandon Shell, Mark Schill and the people at Citrix who set up the Powershell SDK. This has given rise to some very useful scripting that has given us the real-time logs that we have desperately wanted. That works great for looking at a specific XenApp server but in the Citrix world where we are constantly “proving the negative” it does not provide the holistic view that Edgesight’s downstream server metrics provided.

Proving the negative:

As some of you are painfully aware, Citrix is not just a Terminal Services delivery solution. In our world, XenApp is a Web Client, a Database Client, Printing Client and a CIFS/SMB client. The performance of any of these protocols will result in a ticket resting in your queue regardless of the downstream server performance. Edgesight did a great job of providing this metric letting you know if you had a 40 second network delay getting to a DFS share or a 5000ms delay waiting for a server to respond. It wasn’t real-time but it was better than anything I had used until then.

While I loved the data that Edgesight provided, the agent was problematic to work with, I had to wait until the next day to actually look at the data, unless you ran your own queries and did your own BI integration you had, yet another, console to go to and you needed to provide higher credentials for the service desk to use the real-time console.

Hey! Wouldn’t it be great if there were a solution that would give me the metrics I need to get a holistic view of my environment? Even better, if it were agentless I wouldn’t have to worry about which .NET framework version I had; changes in my OS, the next Security patch that takes away kernel level access and just all around agent bloat from the other two dozen agents I already have on my XenApp sever. Not to mention the fact that the decoupling of GUIDs and Images thanks to PVS has caused some agents to really struggle to function in this new world of provisioned server images.

It’s early in my implementation but I think I have found one….Extrahop.

Extrahop is the brain-child of ADC pioneer Jesse Rothstein who was one of the original developers of the modern Application Delivery Controller. The way Extrahop works is that it sits on the wire and grabs pertinent data and makes it available to your engineer and, if you want, your Operations staff. Unlike wireshark, a great tool for troubleshooting; it does not force you, figuratively, to drink water from a fire hose. They have formed relationships with several vendors, gained insight into their packets and are able to discriminate between which packets are useful to you and which packets are not. I am now able to see, in real-time, without worrying about an agent, ICA Launch times and the Authentication time when a user launches an application. I can also see client latency, Virtual Channel Bytes In and Bytes Out for Printer, Audio, Mouse, Clipboard, etc.

(The Client-Name, Login time and overall Load time as well as the Latency of my Citrix Session)

In addition to the Citrix monitoring, it helps us with “proving the negative” by providing detailed data about Database, HTTP and CIFS connections. This means that you can see, in real-time, performance metrics of the application servers that XenAPP is connecting to. If there is a specific URI that is taking 300 seconds to process, you will see it when it happens without waiting the next day for the data or having to go to edgesightunderthehood.com to see if John, David or Alain have written a custom query.

If there is a conf file that has an improper DNS entry, it will show up as a DNS Query failure. If your SQL Server is getting hammered and is sending RTOs, you will see it in real-time/near-time and be able to save yourself hours of troubleshooting.

(Below, you see the different metrics you can interrogate a XenApp server for.)


Extrahop Viewpoints:
Another advantage of Extrahop is that you can actually look at metrics from the point of view of the downstream application servers as well. This means that if you publish an IE Application and it connects to a web server that integrates with a downstream database server you can actually go to that web server you have published in your application and look at the performance of that web server and the database server. If you have been a Citrix Engineer for more than three years, you should already be used to doing the other team’s troubleshooting for them but this will make it even faster. You basically get a true, holistic view of your entire environment, even outside of XenApp, where you can find bottlenecks, flapping interfaces and tables that need indexing. If your clients are on an internal network, depending on your topology you can actually look at THEIR performance on their workstations and tell if the switch in the MDF is saturated.

Things I have noted so far looking at Extrahop Data:

  • SRV Record Lookup failures
  • Poorly written Database Queries
  • Exessive Retransmissions
  • Long login times (thus long load times)
  • Slow CIFS/SMB Traffic
  • Inappropriate User Behavior

GEOCODING Packets:
Another feature I like is the geocoding of packets, this is very useful to use if you want to bind a geomap to your XenApp servers to see if there is any malware making connections to China or Russia, etc. (I have an ESUTH post on monitoring Malware with Edgesight.) Again, this gives me a real-time look at all of my TCP Connections through my firewall or I can bind it on a per-XenApp, Web Server or even PC node. The specific image below is of my ASA 5505 and took less than 15 seconds to set up (not kidding).

On the wire (Extrahop) vs. On the System (Agent):
I know most of us are “systems” guys and not so much Network guys. Because there is no agent on the system and it works on the wire, you have to approach it a little differently and you can see how you can live without an agent. Just about everything that happens in IT has to come across the wire and you already have incumbent tools to monitor CPU, Memory, Disk and Windows Events. The wire is the last “blind spot” that I have not had a great deal of visibility into from a tools perspective until I started using Extrahop. Yes there was wireshark but for archival purposes and looking at specific streams are not quite as easy. Yes, you can filter and you can “flow TCP Stream” with wireshark but it is going to give you very raw data. I even edited a TCPDUMP based powershell script to write the data to SQL Server thinking I could archive the data that way. I had 20GB of data inside of 30 minutes, with Extrahop you can actually trigger wire captures based on specific metrics and events that it sees in the flow and all of the sifting and stirring is done by Extrahop just leaving you to collect the gold nuggets.

Because it is agentless you don’t have questions like “Will Extrahop support the next edition of XenAPP?” “Will Extrahop Support Windows Server2012” “What version of the .Net Framework do I need to run Extrahop” “I am on Server Version X but my agents are on version Y”

The only question you have to answer to determine if your next generation of hardware/software will be compatible with Extrahop is “Will you have an IP Address?” If your product is going to have an IP Address, you can use Extrahop with it. Now, you have to use RFC Compliant protocols and Extrahop has to continue to develop relationships with vendors for visibility but in terms of deploying and maintaining it, you have a much simpler endeavor than other vendors. The simplicity of monitoring on the wire is going to put an end to some of the more memorable headaches I have had in my career revolving around agent compatibility.

Splunk/Syslog Integration:
So, I recently told my work colleagues that the next monitoring vendor that shows up saying I have to add yet another console I am going to say “no thanks”. While the Extrahop console is actually quite good and gives you the ability to logically collate metrics, applications and devices the way you like, it also has extensive Splunk integration. If there are specific metrics that you want sent to an external monitor, you can send them to your syslog server and integrate them into the existing syslog strategy be it Envision, KIWI Syslog Server or any other SIEM product. They have a javascript based trigger solution that allows you to tap into custom flows and cherry pick those metrics that are relevant to you. Currently, there is a very nice and extensive Splunk APP for Extrahop.

I am currently logging (in real-time) the following with Extrahop:

  • DNS Failures (Few people realize how poor DNS can wreck nth-tiered environments)
  • ICA OPEN Events (to get logon times and authentication times)
  • HTTP User Agent Data
  • HTTP Performance Data

So if this works by monitoring the wire, isn’t it the Network team’s tool?
The truth is it’s everybody’s tool, the only thing you need the network team to do is span ports for you (then log in and check out their own important metrics). You can have the DBA log in and check the performance of their queries, the Network Engineers can log in and check jitter, TCP retransmissions, RTOs and throughput, the Citrix guy can log in and check Client Latency, STA Ticket delivery times, ICA Channel throughput, Logon/Launch Times, the Security team can look for TCP Connections to China, Russia and catch people RDPing home to their home networks and the Web Team can go check which user-Agents are the most popular to determine if they need to spend more time accommodating tablets. Everybody has something they need on the wire; I sometimes fear that we tend to select our tools based on what technical pundits tell us too. In our world, from a vendor standpoint, we tend to like to put things in boxes (which is a great irony given everyone’s “think outside the box” buzz statement). We depend on thought leaders to put products in boxes and tell us which ones are leaders, visionaries, etc. I don’t blame them for providing product evaluations that way, we have demanded that. For me, Extrahop is a great APM tool but it is also a great Network Monitoring tool and has value to every branch of my IT Department. This is not a product whose value can be judged by finding its bubble in a Gartner scatter plot.

Conclusion:
I have not even scratched the surface of what this product can do. The triggers engine basically gives you the ability to write nearly any rule you want to log/report any metric you want. Yes, there are likely things you can get with an agent that you cannot get without an agent but in the last few years these agents have become a lot like a ball and chain. You basically install the appliance or import the VM, span the ports and watch the metrics come in. I have had to change my way of thinking of metrics gather from system specific to siphoning data off the wire but once you wrap your head around how it is getting the data you really get a grasp of how much more flexibility you have with this product than with other agent based solutions. The Splunk integration was the icing on the cake.

I hope to record a few videos showing how I am doing specific tasks, but please check out the links below as they have several very good live demos.

To download a trial version: (you have to register first)
http://www.extrahop.com/discovery/

Numerous webinars:
http://www.extrahop.com/resources/

Youtube Channel:
http://www.youtube.com/user/ExtraHopNetworks?feature=watch

Thanks for reading and happy holidays!

John