It was the summer of 2014. I was well into my big data addiction thanks to Splunk. I was looking for a fix anywhere: Splunk my home? Splunk my computer usage? Splunk my health? There were so many data points out there for me to Splunk but none of them would payoff like Splunking my driving…

Rocky Road

At the time, my commute was rough. Roads with drastically changing speeds, backups at hills and merges, and ultimately way more stop and go than I could stomach. But how bad was my commute? Was I having as bad an impact on the environment as I feared? Was my fuel efficiency much worse than my quiet cruise-controlled trips between New York and Boston? With my 2007, I really had no way to know…that is, until I learned about Automatic.

Automatic is a dongle that goes into your On Board Diagnostic (OBD) port. This port is hiding in plain sight – typically right under your steering wheel – but only your mechanic knows to look for it. The OBD port is what the mechanics use to talk to the car during service. It turns out there’s a ton of information available through that USB-like port and there’s a slew of new devices and dongles (for which Automatic is one of) that expose that information to you. Combine that car info from what your phone is capturing of the world around you and you got yourself a delicious data stew!

The Dataman Cometh

Out of the box, Automatic provides really cool details about every trip it records: fuel efficiency, route path, drastic accelerations, sudden brakes, periods of speeding, and so on. This is all accessible for review on your smart phone. A little bit hidden is that Automatic provides a dashboard where all of your trips can be seen in aggregate (http://dashboard.automatic.com). This dashboard allows you to see some basic aggregate statistics (sums and averages) for a selected time period of trips. I’m sure you see where I’m going with this…

I wanted more from this data. I had already been spoiled by the power of Splunk and I knew that if I could Splunk this data, I could do so much more. That’s when I noticed in the bottom right corner of the dashboard an export option. EUREKA! I immediately downloaded the resulting CSV of my trip data and got to work adding it to Splunk.

Getting Dizzy with Vizzies

I created each and every search that immediately came to mind: total miles driven, total time behind the wheel, total number of trips. Then I got into the visualizations: number of trips over time, fuel efficiency trends relative to instances of poor driving (sudden brakes, fast accelerations, or speeding), fuel prices over time, and location of the cheapest gas observed.

Basic aggregate analytics of Automatic data within Splunk.

As I started seeing my data represented visually, I was reminded of something from Dr. Tom LaGatta’s .conf2014 talk “Splunk for Data Science” (http://conf.splunk.com/speakers/2014.html#search=lagatta). He spoke about how the human brain processes data more effectively in a visual manner. I was now seeing exactly what he meant! My data was no longer a table, or a graphic for a single trip. Instead I was able to use my entire trip history to create visualizations that demonstrated my driving trends and the impact of my behaviors on fuel efficiency – things I would never have captured by looking at individual events, nor by reading a spreadsheet of numbers. With this new enthusiasm, I took Dr. LaGatta’s premise to the max and used a D3 Chord Diagram to represent my travel frequency from zip code to zip code.

Density of zip code transits represented with a Chord Diagram

Dynamic Data

After posting this collection of insights as an app on SplunkBase, I met a wise Splunk Ninja named Todd Gow. Todd showed me the ways of the force by using Damien Dallimore’s REST API Module Input to pull my Automatic data perpetually, rather than solely through manual CSV exports from dashboard.automatic.com. Thanks to Todd and Damien’s coaching, as well as some support from the developers at Automatic, the app was now able to pull user’s driving behavior automatically.

This was a game changer. Not only did it ease the import of data, but it meant that users of Automatic and Splunk could now create Splunk alerts based on driving performance! Alerts that highlight everything from erratic driving, to poor fuel efficiency, to notifications that the vehicle has gone outside its typical geographic area (stolen?). The possible insights were growing now that new data could be analyzed against historical performance.

“Where we’re going, we don’t need roads”

Since Back to the Future II’s prediction for 2015 was off, let’s conclude by covering some final insights achieved thanks to Splunk + Automatic:

Besides the metrics mentioned above, I was able to calculate some interesting details about my fuel efficiency. Automatic provides audio tones to alert the driver to speeding, drastic acceleration, and sudden braking. This produced a Pavlovian response for me such that, over time, I could feel myself adjusting my driving behavior in response. So this led me to wonder: Has Automatic saved me money on fuel by adjusting my driving behavior? Thanks to Splunk, I was able to calculate such an answer!

First I calculated my average fuel efficiency from when I started with Automatic till now. Since each trip with Automatic includes fuel consumption and fuel prices (as calculated by prices local to my trip location), I was able to compare the fuel efficiency increase to the average fuel prices and provide an estimate of money saved. Considering the cos of Automatic’s dongle, I concluded that I MADE money by spending less on fuel thanks to my improved driving behavior!

Fuel efficiency data from Automatic represented in Splunk

To learn about the search techniques to draw these, and other, conclusions, download the app from SplunkBase and check out the dashboards for yourself.

“Get Outta My Dreams, Get Into My Car”

With my Automatic driving data in Splunk, my mind won’t stop “racing” with new insights to implement. In addition to the alerts proposed in this post, I plan to provide insights into vehicle usage for wear and tear, route path details, stronger fuel economy calculations, and webhook incorporations.

Before any of that, I have to finish what has become a complete app re-write. I’ve been rewriting the data ingestion modular input to both simplify app configuration as well as take advantage of the new Automatic API and its new field names.

Down the “road”, I hope to collaborate with the Common Information Model team here at Splunk to define a vehicle-based data model. That way, users of any vehicle-related data capture (such as Automatic) could take advantage of the Automatic app on SplunkBase and get the same insights, regardless of the data’s differences in field names and formats.

Of course, if you’ve found cool insights into your driving data, post it below and share your discoveries! Let us know how you used Splunk to make your “machine data accessible, usable and valuable to everyone.” Thanks for reading, drive safe, and happy Splunking!

A World Without Splunk

In my pre-Splunk days, I spent significant time leading the vision for standards and automation in our company’s large distributed IBM WebSphere Network Deployment environment. Even though we used standard build tools and a mature change process, significant entropy and deviations were introduced into the environment as a product of requirements for tuning, business, infrastructure, security, and compliance.

As a result, we were unable to recognize the scope of impact when it came to security vulnerabilities or violations with 3rd party compliance. Even worse for us, we spent way too many staff-hours trying to replicate issues between production and quality assurance environments because we had no easy way to recognize the contributing configuration differences.

It’s a Bird, It’s a Plane, It’s Splunk!

Given the challenge of aggregating and correlating disparate data, my searching eventually led me to Splunk.

I quickly grew acclimated to Splunk. The Search Tutorial walked me through the install. The documentation was easy to read and like nothing I had ever seen – I didn’t need a PHD to understand it and I was immediately seeing how I would get value.

After some time playing and working with the developers of the [then beta] add-on for WebSphere (latest is at https://splunkbase.splunk.com/app/2789/), we were up and running with WebSphere’s configuration files populating in Splunk. We were finally able to compare environments to each other, themselves over time, or against the entire infrastructure thanks to Splunk’s search processing language, field extraction, and native processing of XML. Furthermore, Splunk’s Schema-on-the-fly meant that as WebSphere’s XML object structure changed we could make the minor tweaks and adjustments without having to rebuild a database, or re-code a custom solution, or wait for an updated product from the vendor.

Most importantly, the model around data aggregation with forwarders eliminated the risk of entropy inherit in other solutions. What we saw in Splunk was what was set on the infrastructure, regardless of changes. No one had to manually update records according to documented changes. In fact, because Splunk was read-only, it satisfied audit and change control concerns that other products presented. Lastly, because Splunk is the platform for machine data, regardless of structure, we were able to correlate problems and configuration with data from the java runtime (JMX), system metrics from the operating system (cpu, mem, disk, etc…) for both Windows and AIX, and both jvm and application logs. The compounding value from these otherwise disparate data sources was astonishing.

And There Was Much Rejoicing!

Thanks to Splunk’s dashboarding capabilities, I was able to create dashboards that dynamically presented configuration discrepancies between two JVMs. This addressed our challenge with identifying what discrepancies were contributing to unexpected runtime behavior. Another dashboard showed what JVMs were missing or had a non-standard value from a dynamically populated a drop down of all JVM custom properties that existed throughout the infrastructure. This satisfied our challenge of understanding the scope of impact for vulnerabilities and compliance addressed by such properties.

Given the small and infrequent volume of configuration changes, this solution could be implemented with the free Splunk license. Using the free license obviously has limitations on functionality, but I mention this because its worth highlighting how much value we were able to get with such a small Enterprise License.

The Cliff Hanger

Unfortunately, I left that company (to be a full time Splunk admin!) before the solution was fully deployed. But that’s why I wanted to share it with you. I’d love to hear how others are able to demonstrate amazing value by adding configuration data to the operating system and log data that they are already Splunking.

Please build on this solution and share it in the comments…or even create your own app or Data Models for Splunk! Show the rest of us how you’ve found your own way to “make machine data accessible, usable and valuable to everyone!”

Thanks for reading and happy Splunking!

Is your Splunk environment spamming you? Do you have so many alerts that you no longer see through the noise? Do you fear that your Splunk is losing its purpose and value because users have no choice but to ignore it?

I’ve been there. I inherited a system like that. And what follows is an evolution of how I matured those alerts from spams to saviors.

Let it be known that Splunk does contain a number of awesome search commands to help with anomaly detection. If you enjoy what you read here, be sure to check them out since they may simplify similar efforts. http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Commandsbycategory#Find_anomalies

Stage 1: Messages of Concern

Some of the first alerts created are going to be searches for a specific string like the word “error”. Maybe that search runs every five minutes and emails a distribution list when it finds results. Guess what? With a search like that, you’ve just made a giant leap into the world of spammy searches. Congratulations…?!

Stage 2: Thresholds

Ok, so maybe alerting every time a message appears is too much. The next step is often to strengthen that search by looking for some number of occurrences of the message over some time period. Maybe I want to be alerted if this error occurs more than 20 times in a five minute window.

Unfortunately, even with this approach you’ll soon hit your “threshold” (pun) for threshold-based searches. What you didn’t know is that you tuned this search during a low part of the day and the 20 times is too low for peak activity. The next logical step is to increase the thresholds to something like 40 errors every five minutes to accommodate the peak and ignore the low periods when there’s less customer impact. Not ideal but still an improvement.

Unfortunately, you’re still driving blind. Over time, if business is bad, you may stop seeing alerts simply because the threshold of 40 errors every five minutes is unattainable for the lower customer usage of the system. In that case, you still have issues but you’re ignoring them! What about the inverse? If business is good, usage of the system should pick up and a threshold of 40 errors every five minutes will generate spam. In both scenarios you’re not truly seeing the relative impact of the error to the overall activity on the system during that time period.

Stage 3: Relative Percentages

Fortunately, you’ve realized the oversight you’ve made. So you embrace the stats command to do some eval-based statistics to only alert you when the errors you want are more than some percentage (let’s say, 50%) of the overall event activity during that time window. Wow! Amazing improvement. Less spam and you are now seeing spikes only! Good job!

Of course, we can do even better. Who says that it isn’t normal for the errors to be more than that 50%? Now we’re back at Step 2 and tweaking relative thresholds based on our perception of what is a “normal” percentage of errors for this system. If you’re with me so far, then buckle up, because we’re about to move into the “behavioral” part of the discussion.

Stage 4: Average Errors

You want to calculate what is normal for your data. You decide that using the average number of errors over some time period is your way to go. So you now use timechart to determine the average number of errors over a larger time window and compare that result to the number of errors in the current time window. If the current time window has more, then you alert.

Your search probably looks fancy by now. You may have even implemented a summary index or some acceleration and switched to the tstats command to facilitate the historical average. If you didn’t know Splunk’s search language yet, you’re definitely learning it now.

There’s just one little catch: When we compare against the average (or better yet, the median), we are going to trigger an alert half the time simply because of how the average is calculated. That’s no good because the goal here is to reduce the spam and instead we just created a dynamic threshold that is still alerting us about half of all time periods.

Stage 5: Percentiles

Now, if you don’t remember percentiles from grade school math, fear not! I didn’t either and I simply read the Wikipedia entry and got it pretty quickly. I’ll try to explain it here in the context of our challenge but I’m sure there are a bazillion pages on the interwebs where it’s better described.

So far, our search is looking at the number of times our error appears over the last five minutes. It compares that quantity against the average we’ve seen for all prior five minute windows, and, if the current five minute window is larger than the average, an alert is triggered.

Now what if instead of the average, we wanted to alert only when the current five minute window’s count of errors is larger than the maximum count of such errors within a five minute window. That would be cool because we’d know if we’ve spike higher than ever before! But what if we already know our historical record has some wild spikes that are too high to compare against. If we assume a normal distribution (silly math stuff), then we can ignore the top five percent of high values. So if we took all the historical values we had and wanted to compare against everything except the highest set of values that are five percent of the total historical snapshots we have, we would be talking about the 95th percentile.

Did I lose you there? Let’s try another approach: When I compare the current five minute window’s count of errors against the historical snapshots of prior five minute windows, for the 95th percentile, I know that if I were to order all the historical results, 95% of the results are lower than the 95th percentile and only five percent of the results are higher.

If you’re still not with me, see if Siri can help. It’s cool. I’ll wait ’cause I realize this is getting confusing….

…waiting….

Cool! You’re back! Thanks for coming back! And you got me a coffee? Wow, you’re so sweet!

Ok, so where I was going with this is that we now have, instead of the average, a higher threshold to compare against. That means that based on historical observations, we will only alert when the current window’s error count has gone over that higher threshold (like 95th percentile). Why did I pick 95th percentile? Let me tell you about something I call The Lasso Approach.

The Lasso Approach

What I call the Lasso Approach is a triage strategy I created for getting a perimeter around an error, or as I think of it, getting a lasso around your problem.

For an especially spammy system, set some high thresholds. Something like the 95th percentile for that given error. By adjusting your searches to this, you will only be alerted roughly five percent of the time. That means you will only be alerted to the most flagrant issues.

Fix those bad actor bugs causing all those nasty alerts. Eventually the 95th percentile is too generous and you’ll lower it to to the 75th percentile (or third quartile if I’m getting my math concepts right). With the lower threshold will come more alerts. Fix those, rinse and repeat. Eventually you discover that you’ve triaged the system by cleaning up the most common errors first. Soon your machine data and email inbox will be readable because the spam of errors are gone!

BONUS Stage 6: IT Service Intelligence

I would be remiss if I didn’t mention the next step in this alerting story: IT Service Intelligence (or IT SI for short). IT SI takes alerting a step further in two ways (does that make this two steps further?):

Thresholds
Time Policies
Adaptive Thresholding

Let’s talk about them each real quick.

Thresholds add functionality so instead of the binar-ity (made up a word?) of alert or no alert, a given search (or KPI in this context) can be any number of statuses such as Healthy, Unhealthy, Sick, Critical, or Offline.

Time Policies build on the challenge earlier in this post about busy hours versus quiet hours. It makes perfect sense to have different thresholds for different times of year, day vs night, or peak vs non-peak. That’s what Time Policies give you.

Lastly, Adaptive Thresholding builds on what we did with the percentile work. Within IT SI any KPI can be configured with static thresholds or thresholds that are indicative of the behavior of that data set over time. That means your thresholds adjust as your business adjusts – all in an easy to use UI.

Here’s some documentation on all those features in case you want to see what it looks like. ITSI has so much more in it, but this blog entry has already going on enough. http://docs.splunk.com/Documentation/ITSI/latest/Configure/HowtocreateKPIsearches#Create_a_time_policy_with_adaptive_thresholding

Stage 7: Actionable Alerts

This isn’t so much a stage like the prior ones. Merely a call out for you to enjoy the fact that your Splunk system now only alerts when something truly needs attention, i.e. Actionable Alerts.

Congratulations and get ready for your promotion.

With the acquisition of Caspida (now Splunk UBA) in July of 2015, we have been talking to many customers regarding user and entity behavioral analytics. Our customers have been asking questions about how this type of threat detection product works, and in this blog, I’m going to discuss some of the most common questions, along with answers and/or explanations from a security researcher and practitioner’s viewpoint.

What makes Splunk UBA unique compared to detection technologies?

Splunk UBA uses an unsupervised machine-learning based approach to determine whether events generated from multiple data sources are anomalies and/or threats. This is a turnkey approach that does not require customers to train the models, and does not require administrators to develop signatures in advance, in order to detect a threat.

What are the common use cases for Splunk UBA?

Splunk UBA can be deployed in any network in order to detect the insider threat, malware, data exfiltration, and other types of activity across the kill chain that would be indicative of malicious behavior. In general, Security Operations Centers (SOCs) continue to handle more data sources as the enterprise grows and matures, but the team is not necessarily able to staff more analysts at the same growth rate. Most SOC environments have not truly shifted to utilizing machine-learning based approaches today, but they need to consider this type of approach, in order to scale their operations and increase their efficacy.

In Figure 1 below, I’ve illustrated a typical maturity strategy seen in security operations environments today. Initially, an organization that is less mature may rely more heavily on signature-based solutions to catch “low-hanging fruit”. As their security operations capability improves, the team starts deploying more layered defenses that utilizes heuristic-based detection methods, in order to cope with the increase in monitored data sources.

Eventually, as security operations teams successfully defend against traditional threats, attackers start employing custom, tailored techniques that become much more difficult to detect. This becomes difficult for security teams to reconcile, as they also have to juggle more data sources to monitor and alert against, as the enterprise modernizes and potentially outsources their services to external SaaS providers. To cope, teams look to automate more of their analytic capabilities by leveraging security solutions that can detect threats across disparate log sources at scale, without requiring analysts to manually parse and interpret all of these different log formats.

Figure 1. Detection approaches for maturing a security operations capability over time.

How does this approach work with the “low and slow” types of attacks?

First of all, let’s define what is meant by a “low and slow attack”. This class of attacks typically involve what appears to be legitimate traffic, transmitting at very slow rates. The low volume of traffic, coupled with slow rates of transmission improves an attacker’s chance of evading traditional detection methods. Splunk UBA is capable of storing and correlating against event data for very long periods of time, which means this type of attack will be less likely to succeed.

How do user identities figure into Splunk UBA?

Unfortunately, Active Directory (AD) logs alone do not work very well in user identity resolution. There are currently ways to map the same user with different User IDs together via a common source (e.g., source IP address, etc.). A file that is referred to as an HR file, or other identity and access management (IAM) logs are considered foundational to the solution. These logs and files will hold values like the user name, email address, SID and NT Username. Centrify Express is an example of an identity management solution, that can be used to help facilitate multi-domain controller environments.

What are “peer groups”, and why are they important?

One use case Splunk UBA addresses is insider threat detection. There are many published papers and metrics that discuss various methodologies to detect the insider threat. One model for insider threat detection makes use of “peer groups” to identify anomalies, and possible threats in the network. Peer groups can be used to categorize users by behavior, both expected, and actual. For example, if Bob and Alice are both in the Finance department, and have similar roles, we would expect that they might have similar server access profiles. Both Bob and Alice would have a legitimate reason to access the firm’s financial databases. Conversely, neither of them would have a business need to access source code repository servers within the Engineering department. This is peer group representation.

There are other ways to group users by actual behavior. For example, Bob and Mallory are not in the same department. Mallory is in the Engineering department. However, Bob and Mallory have similar work habits. They have the same work hours, and they both come into work, and after they grab coffee, they log into their laptops, and open their browsers and navigate to CNN.com. Therefore, Bob and Mallory might be in a separate peer group than Bob and Alice; however, Bob is a member of both groups. By defining multiple overlapping peer groups per user, anomalies that may not be detected in one peer group may appear in other peer groups.

Although this blog post addresses the most common questions related to Splunk UBA, there are of course, other questions that are not covered in this post. If you have additional questions, please reach out to ubainfo@splunk.com

Several options exist to bring SNMP into Splunk, with such examples as our SNMP Modular Input. But what if you already have a SNMP collection built with Cacti? You could consolidate, rebuild and reconfigure all the collection… but the easier option would be to take Cacti, and feed it into Splunk. This is a great example of leveraging one tool to collect the data, but bringing all the information together into a single platform for analytics.

Cacti Mirage is a new plugin release in the Cacti community, which simply grabs the updates of the Cacti poller prior to writing the RRD files, and mirrors a copy out for Splunk to collect using the Splunk Universal Forwarder. You can find a more detailed tutorial and review the Cacti plugin.

ldi="11" t="1454100361" rrdn="traffic_in" rrdv="1820067239"

The simplicity of taking the SNMP polling results, and displaying them as key-value pairs, allows for the automatic extract of these fields within Splunk. The only gap left is the link between the local data ID (ldi) and what it actually means.

Part of the Cacti Mirage Add-On for Splunk (soon to be on splunkbase), includes a script to extract out of the Cacti database the meaning of the local data ID: the host, data source, data type and other information. Splunk receives this, and generates the lookup on the search head to automatically expand the ldi field into the useful pieces of information. You can find more details and contribute to the add-on on GitHub.

With that said, you can now take all the collected data in Cacti, build the dashboards in Splunk using all the capabilities which exist there, and even take it to the next level correlating it to other data sources.

Over the Australia Day long weekend here in sunny Brisbane, Queensland, a buddy of mine and I started noticing that his fridge didn’t seem very cold – meaning that the beer was not cold, clearly a drastic problem. No matter how far down we turned the thermostat, the fridge just wouldn’t cool down. He wasn’t sure if he was imagining it, or if it had always been that way. My buddy didn’t really want to go out and buy a new fridge and wanted to try and fix it himself, however had no idea if any of the changes we’d made to the fridge were making it better or worse.

My buddy works for a Splunk partner and IoT company here in Brisbane, RIoT Solutions (www.riotsolutions.com.au), and so had a spare Raspberry Pi laying around. We figured, why not hook up a temperature sensor from Adafruit.com and log the data to find out? As it can take quite a while for a fridge’s temperature to plateau, trawling though the log files just wasn’t practical, and we wanted something that could tell us through trial and error if the changes we had made to the fridge were successful. So, we installed a small instance of Splunk on an Ubuntu VM in his home lab (despite me telling him to run it in AWS!).

fridge1

Splunk, as usual, took less than 10 minutes from install to ingesting the data, to generating a visualisation with real time temperature data. Now we could see the fridge was averaging around 12 degrees celsius, when a fridge should ideally be between 0 – 4.

Now we could start the trial and error to fix the fridge. First Kris tried kicking it, but this didn’t do much. Then he tried turning the thermostat temperature knob to a few different positions and monitoring each position, finding one which lowered the temperature to around 9 degrees. However, this still wasn’t low enough and he was noticing the thermostat knob was a little chunky to turn. So we decided to break the fridge.

As soon as we pulled the thermostat out we could see that it was frozen into a solid block. Kris left the fridge running overnight, without the thermostat, so it could defrost, and when he logged into Splunk the next morning, he could see that the temperature had indeed dropped….to -2 degrees! On opening the fridge he found everything was nice and frosty, especially the beer! However, despite there being “a steak in every beer”, he knew his real food would eventually freeze solid at that temperature.

So, the defrosted thermostat went back into the fridge, and we went back to Splunk to find the best position for the temperature knob. Now that the thermostat was working properly, any changes we made to the temperature were reflected quickly in Splunk, so we used a dashboard with a real-time temperature value and watched the temperature go up and down, finally getting it within the 0 – 4 degree range.

fridge

Thanks Splunk, for helping my mate serve me cold beer, and save himself from having to buy a new fridge!

“Now Splunk and my Raspberry Pi are off to monitor my fish tank temperature as my fish have been going cross eyed lately.”

For details on how to build the Pi – https://learn.adafruit.com/adafruits-raspberry-pi-lesson-11-ds18b20-temperature-sensing/overview

Hey there community and welcome to the 51^st installment of Smart AnSwerS.

Super Bowl 50 is making its way to the SF Bay Area next week, and traffic around HQ has been getting noticeably worse with Super Bowl City just a mile away. What does that mean? MOAR TRAFFIC and longer commute times ;( Luckily piebob, out of the kindness of her heart, gave the community team the OK to work from home amidst the sportsball madness. Such boss! So wow! Much thanks!

Important note: this week’s SFBA Splunk User Group meeting has been postponed to next week, Feb 10th, to avoid Super Bowl traffic as well!

Check out this week’s featured Splunk Answers posts:

How to create and trigger an alert when replication/search factors are not met on the indexer cluster master?

rcreddy06 had issues with fluctuating statuses for replication and search factors in an indexer cluster and wanted to set up an alert. Lucas K pointed out that the indexer clustering management console on the master node has searches that display whether or not these factors are met. He explained how to obtain them by adding “showsource” to the end of the URL or clicking on the magnifying glass to see other searches that power relevant results in the dashboard. Lucas K saved rcreddy06 the time of reinventing the wheel by doing a little digging and setting the alert conditions as needed.
https://answers.splunk.com/answers/329356/how-to-create-and-trigger-an-alert-when-replicatio.html

Is there a programmatic method to list and analyze which objects/resources (indexes, macros, lookups) are used by scheduled searches?

Olli1919 wanted to identify and list which scheduled searches relied on certain lookups and macros. The idea was to prevent these searches from breaking if or when any changes are made to these knowledge objects. Olli1919 actually came back to answer the question with a search to check which scheduled searches depend on which lookups. woodcock also shared two apps by his fellow SplunkTrust peers that could help take a deep dive into the efficiency and health of your deployment: Knowledge Object Explorer by martin_mueller and Data Curator by Runals.
https://answers.splunk.com/answers/329645/is-there-a-programmatic-method-to-list-and-analyze.html

Does the multisearch command have a limit like subsearch?

Masa was curious to know if there was any limit for each search clause in the multisearch command like subsearch. cpride confirmed that the same type of limits do not apply to multisearch since subsearches run during the parsing phase of a search and have to finish and return results before the parse phase completes. Multisearch, on the other hand, is a generating command, and its main limitation is the searches must be entirely distributable.
https://answers.splunk.com/answers/326321/does-the-multisearch-command-have-a-limit-like-sub.html

Thanks for reading!

Missed out on the first fifty Smart AnSwerS blog posts? Check ‘em out here!
http://blogs.splunk.com/author/ppablo

As a network geek, I’ve always wanted to leverage sniffers and deep packet inspection programs to understand user experience and to secure networks. I have a home lab with many virtual machines. But let’s be honest, I really want to know what my household is doing on the Internet! I needed something light-weight, NOT an appliance as large as a data center!

Network Sniffers aren’t anything new. In fact, they’re old school. But, who would have thought a Raspberry Pi would be powerful enough to act as a real-time 24×7 sniffer? I embarked on this journey recently with the Splunk Stream App. And I must say, I’m pretty impressed.

Splunk Stream captures real-time streaming wire data and performs packet analysis at layer 4 (TCP, UDP) as well as many layer 7 applications (HTTP, DNS, etc.) of the OSI model. This enables reporting on things like application response times, application decoding (i.e. what web pages are accessed), as well as detecting unauthorized users.

Splunk Stream is supported on many operating systems, but mostly Intel x86 based computers and architectures. At Splunk, as part of a side project, one of our developers ported the Stream binaries to ARM architecture. PERFECT! The Raspberry Pi is ARM based! On this platform, Splunk Stream is capable of TCP and UDP decoding (layer 4), as well as HTTP decoding (layer 7). Unfortunately, it does not support the breadth of other applications as with Intel x86 versions. However, this was enough for my project. Heck, you can probably install Stream on a rooted android device, as many are ARM based. I don’t know how useful that would be, but it’s quite entertaining! Note that the ARM version of Splunk Stream is not a GA product, so it’s currently unavailable to customers.

What I used:

Raspberry Pi. (I used the Pi 2 Model B):
https://www.raspberrypi.org/products/raspberry-pi-2-model-b/

USB Wifi Network Adapter (acts the management interface). Here’s the one I got from Amazon:
http://amzn.com/B003MTTJOY

SharkTap:
http://amzn.com/B00DY77HHK

Splunk app for Stream:
https://splunkbase.splunk.com/app/1809/
Part of the purpose of this blog is to showcase how the Stream app is light-weight and can run on a Raspberry Pi! You can install Splunk Stream on a PC or Linux server in lieu of the Pi, and you have the benefit of full functionality, whereas Stream on Pi only summarizes TCP, UDP, and HTTP. The Stream app ported to ARM is currently unavailable to customers, but might be in the future…

Splunk Forwarder for Linux ARM:
https://splunkbase.splunk.com/app/1611/
OR
Splunk Forwarder for most OS’s (to deploy on a PC, Linux, Unix machine):
http://www.splunk.com/en_us/download/universal-forwarder.html

Setting it All Up:

To capture all incoming and outgoing traffic from my network to the Internet, I placed the Sharktap between my service provider (Comcast) and my router (an Asus AC88U). Connected to the mirror port is the Raspberry Pi which has the ‘Splunk Forwarder for ARM processor’ and ‘Stream app ported to ARM’ installed.

Hardware Configuration Diagram:

Collecting Data Diagram:

Command line of the Raspberry Pi with everything installed & running:

The Splunk software installation is just like any other Splunk install. If you need help, check the docs at http://apps.splunk.com.

The Results:

Now that I have a Splunk forwarder and Splunk Stream running on the Pi, let’s see what shows up in the Splunk interface. From the below screenshots, Stream is seeing TCP traffic, and decoding HTTP traffic. Using the search index=* host=rpi2 sourcetype =stream*, we can see there are stream:tcp and stream:http, as expected.

We also see various others including stream:Splunk_HTTPResponseTime and stream:Splunk_HTTPStatus. These are out of the box sourcetypes that make it easier to track URL response times and HTTP statuses without requiring a lot of Splunk core licensing. They can additionally be aggregated into predefined intervals to further reduce licensing volumes. For example, you can get one record every minute summarizing all response times for a particular URL, or one record summarizing the HTTP status codes by URL. Below is the screenshot for configuring this within the Stream app:

I also have Splunk Stream instrumented on the Splunk server, so it has visibility into all local access, as well as broadcast requests. This is evident in some of the out of box dashboards below:

Why can’t I see security camera video from my phone?

So, now that the Raspberry Pi has been running for a few days and reliably performing deep packet inspection, time to put this data to use and solve some problems. I have a Lorex security camera system on my premise. Through the Lorex Stratus NetHD mobile app, I can see live video streams on my phone and tablet anywhere from the world! However, lately, it hasn’t been working. When registering the camera system in the app, it returns ‘Error Code 43’. This is a perfect use case for Splunk Stream to understand what is communicated between the app, and the security system DVR.

While working from the office (aka Starbucks), I attempt several times to register my security system in the mobile app and from my Apple computer. Next, I access the Splunk interface, and filter on the access ports. I run the following search:

index=* host=rpi2 sourcetype=stream*
(dest_port=xxxx OR src_port=xxxx)
| table src_ip, src_content, dest_ip, dest_content
(ports masked for privacy).

In the table, I can see my 2 source ip addresses hitting the Comcast router. The data sent from the device is encrypted, which is reassuring. But, the contents coming back from the Lorex DVR indicate ‘Page Not Found’. As 2 devices are reporting the same error, I shared these results with Lorex support. The problem turned out to be a recent firmware update with a defect.

I am excited to try other use cases with the Splunk Stream app. I may finally figure out why my Netflix movies sometimes clip. For those of you with kids, you may use this to figure out what your kids are actually doing when they say they’re working on homework. Regardless, post a comment and let us know how you’re using this setup or what you found!

Happy Splunking!

(Hi all–welcome to the latest installment in the series of technical blog posts from members of the SplunkTrust, our Community MVP program. We’re very proud to have such a fantastic group of community MVPs, and are excited to see what you’ll do with what you learn from them over the coming months and years.
–rachel perkins, Sr. Director, Splunk Community)

Hello everyone!

I am Michael Uschmann, one of the members of the SplunkTrust.

Lately I was annoyed by the fact that I had to enter my login on my Splunk DEV VM after a meeting or break. So, I thought ‘Why not setup SSO on this Splunk instance so I don’t have to enter my password again?’ But there was this small problem: I don’t have an AD or LDAP server running on my VM – bummer.

Docs and Apache to the rescue!

The docs provide an excellent overview about SSO will work and what to consider:

http://docs.splunk.com/Documentation/Splunk/6.3.2/Security/HowSplunkSSOworks#How_Splunk_processes_the_proxy_request

But the most important statement from the docs is this:

If the IP is trusted, then splunkd uses the information contained in the request header and conducts the authorisation process.

This is the key to get SSO function. So, how do I configure this?

The .conf files needed

I have to change the server.conf to include the following option:

trustedIP = <IP address>
* All logins from this IP address are trusted, meaning password is no longer required
* Only set this if you are using Single Sign On (SSO)

So, my server.conf includes this:

[general]
trustedIP = 127.0.0.1

The next config file I have to change is web.conf:

SSOMode = [permissive | strict]
* Allows SSO to behave in either permissive or strict mode.
* Permissive: Requests to Splunk Web that originate from an untrusted IP address are redirected to a login page where they can log into Splunk without using SSO.
* Strict: All requests to splunkweb will be restricted to those originating from a trusted IP except those to endpoints not requiring authentication.
Defaults to "strict"

trustedIP = <ip_address>
* Trusted IP. This is the IP address of the authenticating proxy.
* Splunkweb verifies it is receiving data from the proxy host for all SSO requests.
* Uncomment and set to a valid IP address to enable SSO.
* Disabled by default. Normal value is '127.0.0.1'
* If appServerPorts is set to a non-zero value, this setting can accept a richer set of configurations, using the same format as the "acceptFrom" setting.

tools.proxy.on = [True | False]
* Used for running Apache as a proxy for Splunk UI, typically for SSO configuration. See http://tools.cherrypy.org/wiki/BehindApache for more information.
* For Apache 1.x proxies only. Set this attribute to "true". This configuration instructs CherryPy (the Splunk Web HTTP server) to look for an incoming X-Forwarded-Host header and to use the value of that header to construct canonical redirect URLs that include the proper host name. For more information, refer to the CherryPy documentation on running behind an Apache proxy. This setting is only necessary for Apache 1.1 proxies. For all other proxies, the setting must be "false", which is the default.
Defaults to False

remoteUser = <http_header_string>
* Remote user HTTP header sent by the authenticating proxy server.
* This header should be set to the authenticated user.
* Defaults to 'REMOTE_USER'.
* Caution: There is a potential security concern regarding Splunk's treatment of HTTP headers.
* Your proxy provides the selected username as an HTTP header as specified above.
* If the browser or other http agent were to specify the value of this header, probably any proxy would overwrite it, or in the case that the username cannot be determined, refuse to pass along the request or set it blank.
* However, Splunk (cherrypy) will normalize headers containing the dash, and the underscore to the same value. For example USER-NAME and USER_NAME will be treated as the same in SplunkWeb.
* This means that if the browser provides REMOTE-USER and splunk accepts REMOTE_USER, theoretically the browser could dictate the username.
* In practice, however, in all our testing, the proxy adds its headers last, which causes them to take precedence, making the problem moot.
* See also the 'remoteUserMatchExact' setting which can enforce more exact header matching when running with appServerPorts enabled.

My web.conf includes this:

[settings]
SSOMode = permissive
trustedIP = 127.0.0.1,192.168.56.1,192.168.56.101
remoteUser = REMOTE_USER
tools.proxy.on = false

Why did I use those options? Let me explain a bit:

SSOMode = permissive

I chose this setting because sometime other members of out team will login to my Splunk instance and I don’t want them to use my SSO settings.

trustedIP = 127.0.0.1,192.168.56.1,192.168.56.101

I chose this setting because my VM has multiple interfaces and Splunk is listening on both IP’s.

remoteUser = REMOTE_USER

The default setting for this option.

tools.proxy.on = false

Even though the docs suggest it to be set to true, I found that it will only work if you set it to be false – But only for this special use case! Stick to the docs if using SSO with AD or LDAP.

Last but not least I need an Apache server and I need to configure it as well. Since this Apache will be used sole for reverse proxying to Splunk I did use a location config like this:

<Location /> 
ProxyPass http://127.0.0.1:8000/
ProxyPassReverse http://127.0.0.1:8000/
Header add REMOTE_USER "admin"
Header add Accept-Language "en-GB"
RequestHeader set REMOTE_USER "admin"
RequestHeader set Accept-Language "en-GB"
</Location>

As you can see I did set the REMOTE_USER to be the Splunk user admin, so I will always be authenticated as Splunk user admin on my Splunk Instance. You can also see a little trick to set another language default other then en-US 😉

So, where else can this be used?

For example it can be used for NOC or ITOC wallboards, where you configure the Apache on different ports for each wallboard and use dedicated users for each board, which then loads a default dashboard in Splunk.

In the end it is up to you where else you can use it … just keep a common sense of security in mind, like securing the Apache port with access restriction.

cheers!

Hey there community and welcome to the 52^nd installment of Smart AnSwerS.

A BoardAtWork group was started at Splunk HQ for folks interested in, well, playing board games at work during lunch or after hours. We had our first game night earlier this week and had a nerdy great time…even though I was the first one dead 😛 Just glad to unwind and share my love for games with fellow Splunkers after a long day!

Check out this week’s featured Splunk Answers posts:

Why is the Host IP value from udp:514 syslog input incorrect for one device?

evgenyv was collecting syslog events through a udp:514 input and needed help figuring out why only one device was reporting a host value of “2015”. nnmiller gives a very detailed and educational answer, explaining how events configured as the syslog sourcetype are parsed by Splunk, and pinpointing the issue was most likely on the device side with how its data was formatted. She gave two options to fix the issue immediately, but also recommends using a central syslog server rather than UDP/TCP and shares the widely referenced blog post by starcher on best practices collecting syslog data in Splunk.
https://answers.splunk.com/answers/315248/why-is-the-host-ip-value-from-udp514-syslog-input.html

How to hide panels with no results from a dashboard?

bclarke5765 had a dynamic drop-down with values on a dashboard and wanted to hide all panels that didn’t produce any results based on the selected value. splunkian provided one solution using JavaScript in Splunk 6.2, and after the release of Splunk 6.3, proylea answered with a working dashboard example using only Simple XML. Options are always good to have.
https://answers.splunk.com/answers/218623/how-to-hide-panels-with-no-results-from-a-dashboar.html

How to run a different rex extraction only if another rex extraction did not find anything to extract?

raby1996 had a working rex extraction, but found that the field for that pattern was not always present in the data. raby needed a way to run a different rex statement when the first one doesn’t match anything. somesoni2 suggested providing sample logs for both patterns as there possibly could have been a way to capture both in one rex expression. It’s also best practice to include sample data when asking for help with regex related questions as everyone’s data will be formatted differently. Regardless, somesoni2 still worked with what he had and provided a workaround using eval with the coalesce function.
https://answers.splunk.com/answers/314070/how-to-run-a-different-rex-extraction-only-if-anot.html

Thanks for reading!

Missed out on the first fifty-one Smart AnSwerS blog posts? Check ‘em out here!
http://blogs.splunk.com/author/ppablo

Hey there community and welcome to the 53^rd installment of Smart AnSwerS.

With Super Bowl 50 madness phasing out this week, our rescheduled San Francisco Bay Area User Group meeting is a go for tonight at Splunk HQ! Splunker Erik Cambra will be giving a talk on how Splunk splunks…(drum roll)…Splunk! If you happen to be in the area, come on by! If you can’t grace us with your presence because you’re miles away, then be sure to check out the Splunk User Groups site to find an upcoming meeting near you

Check out this week’s featured Splunk Answers posts:

Why am I getting inconsistent event counts when using wildcard characters to match event field values?

splunkIT was getting different counts using wildcards to search an extracted field value and wanted to know if this was a limitation or a bug. woodcock shared a Splunk blog that covered a solution for this by using INDEXED_VALUE = false, but with the caution that this could affect search performance. cpride came in to give a very informative overview on how strings of raw data are indexed using values configured in segmenters.conf and demonstrated why this affects results using wildcards placed in different parts of the searched value.
https://answers.splunk.com/answers/326291/why-am-i-getting-inconsistent-event-counts-when-us.html

What is the easiest way to send an alert when another alert’s trigger condition has cleared?

This topic has come up on Answers several times, so this helpful question and answer by jwelsh serves as a good reference for users searching high and low. Learn how to use the _internal index to find the last time your desired alert fired to prevent overlapping triggered alerts.
https://answers.splunk.com/answers/326872/what-is-the-easiest-way-to-send-an-alert-when-anot.html

How do I sum the counts of all the similar values in a field to show as a single item?

praneethkodali had a search that was producing a list of values and counts for a field, but needed to edit the search to sum the counts of similar values in the list. With the powers of regular expressions and eval combined, aljohnson (with some mutual help from praneethkodali) shows how to match the variations into a single uniform value to get the desired result.
https://answers.splunk.com/answers/327096/how-do-i-sum-the-counts-of-all-the-similar-values.html

Thanks for reading!

Missed out on the first fifty-two Smart AnSwerS blog posts? Check ‘em out here!
http://blogs.splunk.com/author/ppablo

Hi, I’m Mark Runals, Lead Security Engineer at The Ohio State University, and member of the SplunkTrust.

While deployed to Bosnia years ago I latched onto something I heard in a briefing once: When loosely describing when particular roadmap type things would take place, the person speaking said there were things that were going to be done Now, Next, and After Next. That fit the way I think to a tee.

In this three part series I’m going to talk about a few things Splunk administrators should do after data starts coming in. In other words a few ‘Next’ activities. These three things are:

Making sure the host field really contains the name of the server
Making sure the local time of the server is set to the correct time
Evaluating the inbound data for indexing latency

While you can get data in a variety of ways, what I’m really focusing on is data coming in from Splunk forwarders installed on servers. (Syslog-based data comes with its own set of fun challenges local to your environment and is beyond the scope of this posting.)

I put checking host field values first because this is the start of making sure the data in your Splunk instance accurately reflects your environment. It’s a data integrity thing, really. At any rate, the most frequent situation I’ve come across where a Splunk forwarder is ingesting the data and the value in the host field isn’t correct is one where a virtual server has been built, a Splunk forwarder is installed and turned on, and then the image is copied multiple times. This is an issue since the Splunk forwarder only checks the local system once to get and set the host value.

With all of that as a backdrop, let’s tackle possible solutions–or at least the solution we’ve come to use:

Windows

With Windows systems, we can leverage the ComputerName field which is on a number of events like 4624. We want to make sure we can count on the data being there, though, and at a cadence that we can control. To achieve this, we turned to wmic and are using the following script:

@echo off
wmic /node:"%COMPUTERNAME%" os get bootdevice, caption, csname, description, installdate, lastbootuptime, localdatetime, organization, registereduser, serialnumber, servicepackmajorversion, status, systemdrive, version /format:list

Getting data via wmic is great and easy! We are bringing in more fields than are needed but the data is valuable in its own right so might as well. Since in our case we are bringing this in once a day and it is small, we aren’t worried about any license impacts.

The /format:list part is nice as the data will come out in Splunk friendly field = value format. Drop that in a bat file and use a script statement like

[script://.\bin\wmic_os.bat]
disabled = 0
## Run once per day
interval = 86400
sourcetype = Windows:OS
source = wmic_os

Linux

For the Linux portion of this effort we modified the script that generates the Unix:Version data (version.sh script) that comes with the Linux TA. The script uses just about all of the uname switches except -n. We simply added uname –n, is an easy modification, and called the field ‘hostname’.

Bringing the data together

The first portion of your query will bring the data together and normalize the key fields. You could (and probably should) adjust either the data generation components or knowledge objects such that the data and query conforms more to the CIM – but there is no telling if you are using the CIM or not so will show you this method =)

At any rate that query might look like this:

sourcetype=windows:os OR sourcetype=unix:version | eval host_name = lower(coalesce(CSName, hostname)) | where isnotnull(host_name) | eval host = lower(host) | eval host_matches = if(match(host,host_name), "true", "false") | where host_matches = "false" | rex field=host "(?<first_name1>[^\.]+)" | rex field=host_name "(?<first_name2>[^\.]+)" | where first_name1!=first_name2 | eval os_type = case(isnotnull(CSName), "Windows", isnotnull(hostname), "Linux", 1=1, "fixme") | table index host host_name os_type | rename host AS "Reporting in Splunk As" host_name AS "OS Logged Host As" os_type AS "Server Type"

Once we’ve brought both sourcetypes together, the rex commands allow us to compare strings in cases where the data from one or the other field is in a fully qualified form. After that, we create a field to show whether the system is Windows- or Linux-based. Depending on the environment you are in, this can help shape the conversation with any teams you will have to reach out to.

With this data in hand now its just a matter of review and talking to whomever can make the change on the forwarder. There are several ways to make the change; we generally just request the server admin adjust the host line in $SPLUNK_HOME/etc/system/local/inputs.conf

Are you doing anything similar to this? If so let’s hear what it is in the comments so that we can generate several options for other Splunk admins out there.

See you next week for Part 2.

When looking through Splunk’s Search Reference Manual, there are a ton of search commands with their syntax, descriptions, and examples. After all, if Splunk is the platform for machine data, there needs to be an extensive list of commands, functions, and references that guide Splunkers through the Search Processing Language (SPL). But one would think that we had everything covered, right? Well, almost….

I have a couple of great customers from the Houston, Texas area to thank for this. Gabe and Andrew (you know who you are) are not only strong Splunkers, but frequent the Splunk Houston User Group (SHUG) meetings and are always looking for ways to expand their use of Splunk as well as get others just as passionate and excited about it as they are! In two separate instances they brought me a simple question – Where’s that command that converts my hexadecimal values in this field to a binary number?

As I started digging into the Search Reference Manual and across our www.splunk.com website, I quickly found what many were already finding or found at answers.splunk.com – there is not a command that does this! DOH! Various people had ideas of building searches that included eval functions, even using the replace command (something I blogged about before here), but ultimately, no SPL-based command. While it’s cool to have massive, multi-line search strings in your Splunk search bar, its not very efficient or a good use of time as compared with just doing a single command type call.

The first time I attempted to help with this it was an Energy-based use case that had some IT Security use case to it. The second time I worked on this it was with a retail/point-of-sale analytics use case. Regardless of the use case, what I quickly realized is that we needed something to make converting the hexadecimal values in fields to binary as simple as just flipping a switch…. or installing a Splunk Add-on.

Enter the Splunk Add-on – Hexadecimal to Binary Add-on (Hex2Binary Add-on)!

This is a fairly simple add-on which leverages the power of Splunk’s search macros. You download the add-on and then use the “Manage Apps” to install the app from a file or use the new feature in Splunk 6.3.x to Browse More Apps to find and download the add-on:

Once installed, the add-on is set to Global Sharing Permission which means any of your apps in Splunk should be able to leverage it.

For documentation, please refer to the README.txt file in the “…etc/apps/SA_Hex2Binary/” directory:

To use the “hex2binary()” macro, you use the SPL call format for Splunk macros but it requires you to pass the field which contains the hexadecimal values you wish to convert to binary. As a simple test (since I was not able to use any of my Splunk customers’ data) I will create a field and give it one hexadecimal value:

* | eval hex_num=”BC55″

Now that we have a field with a hexadecimal value, I can pass that field to the “hex2binary()” Splunk macro, where the binary conversion is placed into a field named “binary”:

* | eval hex_num=”BC55″ | `hex2binary(hex_num)`

That is a LOT easier than having to write eval and loop statements into your search!

Enjoy the new add-on and should there be any questions or requests for enhancement/upgrades, please let me know!

Happy Splunking!

PD2

This is part 2 of a series. Find part 1 here: http://blogs.splunk.com/2016/02/11/whats-next-next-level-splunk-sysadmin-tasks-part-1/

Hi, I’m Mark Runals, Lead Security Engineer at The Ohio State University, and member of the SplunkTrust.

In this brief series of posts we are covering 3 things Splunk admins should do soon after getting data into Splunk. In Part 1 we talked a bit about making sure values in the host field are correct. This time we are going to talk about making sure the local time on servers is set correctly.

One of the great things about Splunk is that as data comes in, Splunk will look for timestamps and automagically place events chronologically. What if the local time on server generating the logs was 15 minutes slow or fast, though? Remember that if you set the timerange selector in Splunk to look for events from the last 60 minutes, you aren’t getting back events that were generated in the last 60 minutes so much as events that Splunk has understood to have been generated in the last 60 minutes based on the timestamp in the events and other time-based Splunk settings. This is a nuanced but impactful difference. If logs are generated in UTC but Splunk doesn’t know that, when you search for events from the previous 60 minutes the events you see could be hours old depending on where in the world you actually sit. This makes it tough to identify when the server/device generating the events itself is off, and gets into a larger discussion of looking for time issues across all your data (which we will get into in the next article). The case where the server’s clock is off is nuanced enough to warrant its own discussion and should be resolved before we talk about larger time issues.

So how do we detect this clock skew in your data? Well instead of waiting for events to be generated (ie you logging into a server while looking at a clock) we are going to have the system generate one for us. Generally speaking it doesn’t much matter when the event is generated as much as it IS generated. Why not just get the average delta by host? Well what happens if the local clock is off and the server is generating data set to a couple different time zones? This way we can focus in on one event to minimize other noise.

Windows

Since one of the options for the WMIC OS call used in the last article was to show the local date and time (LocalDateTime) let’s just reuse that. As a refresher here is the info.

Batch script

@echo off
wmic /node:"%COMPUTERNAME%" os get bootdevice, caption, csname, description, installdate, lastbootuptime, localdatetime, organization, registereduser, serialnumber, servicepackmajorversion, status, systemdrive, version /format:list
inputs.conf
[script://.\bin\wmic_os.bat]
disabled = 0

## Run once per day
interval = 86400
sourcetype = Windows:OS
source = wmic_os

Linux

In the spirit of reuse we might as well go ahead and use the Unix:Version sourcetype that comes in the Linux TA. Because of the format of Linux logs, we will be able to use the default time field in Splunk ‘_time’. Just make sure to enable the script in the inputs.

Bringing the data together

Like last time, we will want to bring the data together in one query to look for issues across both OSes. Some work will need to be done in extracting the time values from the Windows events, and in both cases to compare the event generation time to when the data was brought into Splunk. We will get more into this in the next article but this particular step means exposing the otherwise hidden _indextime field. This involves a simple eval statement like | eval index_time = _indextime. There will always be some delay between when the data was generated and when it comes into Splunk, but generally speaking that should just be a few seconds or so. Because of this delay you will want to ignore what will hopefully(!) be the large majority of these logs, though. The OSU team chooses to look at logs where this time differential is over 5 minutes. The query we use is:

sourcetype=windows:os OR sourcetype=unix:version  | eval index_time = _indextime | eval type = if(isnotnull(LocalDateTime), "Windows", "Linux") | eval local_date_time = if(isnotnull(LocalDateTime), strptime(LocalDateTime, "%Y%m%d%H%M%S.%6N"), _time) | eval delta_time = local_date_time - index_time | eval abs_delta_time = abs(delta_time) | where abs_delta_time > 300 | eval sec = floor(abs_delta_time%60) | eval min = floor((abs_delta_time%3600) / 60) | eval hrs = floor((abs_delta_time - abs_delta_time%3600) / 3600) | eval skew = tostring(hrs) + "h " + tostring(min) + "m " + tostring(sec) + "s" | eval direction = case(delta_time<0, "Behind", delta_time>0, "Ahead", 1=1, "???") | eval time_str = strftime(_time, "%T %F") | eval local_date_time = strftime(local_date_time, "%T %F") | table index host, type, skew, direction, local_date_time, time_str | sort index host | rename host AS "Host", type AS "Server Type", skew AS "Time Skew", direction AS "Time Skew Direction", local_date_time AS "Host System Time", time_str AS "Splunk Timestamp"

If you have data coming in from systems in multiple timezones you will need to account for those instances. However, the query above will hopefully give you a baseline to work from; adjust as needed. The most common causes I’ve seen for this issue is the time zone being set incorrectly or NTP not being enabled on the local server.

Special thanks to Alex Moorhead, a student worker in the OSU Splunk shop, who came up with the query above for us.

It’s true. You can Splunk just about anything. As someone who is not incredibly technically inclined, understanding the power of Splunk can be difficult to wrap my head around. I find the best way to understand the power of Splunk is to apply it to something you know and love. And with SplunkLive! coming up in some of the best barbecue cities across the US this year, my personal experimentation with Splunk happens to tie-in nicely.

I’ve been happily married to a Texan and University of Texas at Austin graduate for nine years. Aside from being a top-notch husband, he has a passion and knack for cooking Texas-style barbecue and brewing beer. For now, let’s focus on the barbecue.

Technology meet meat. Is there data in your beef? Can barbecue be optimized? Can data tell us if one brisket will be better than another? Can we Splunk an open flame? Why would anyone smoke a fish? So many questions. And there are answers!

The hallmark of Texas-style barbecue is “low and slow.” While there are many variables to ensuring your meat is maximized, including wood used to smoke the meat, the cut of meat, fat content of the meat, and more, maintaining a consistent temperature is the biggest factor in determining if your barbecue will be juicy and delicious or something that tastes like a leather boot.

A true Texas-style barbecue smoker doesn’t have an electronic temperature gauge that regulates the heat and flame to avoid fluctuation. Rather, an open flame and rudimentary adjustments to the air flow to the flame are what controls the heat. In 2016 – and as card carrying technology and data nerds – we thought there had to be at least a better way to understand the temperature of the various chambers in the smoker. Enter the Tappecue. Tappecue is a Wifi thermometer that allows you to monitor the temperature of four different areas of the meat and smoker. Through an app on the phone, you can set alerts when the temperature is too high or too low and adjust your flame accordingly (note the Tappecue does not make any adjustments to the temperature).

So we bought one. At the end of the first session, the Tappecue sent us a .csv file of the data readings from the session.

Enter Splunk.

What else was in that data? How much did the temperature fluctuate? Could we use Splunk and map this against the weather and wind and draw additional insight? Does being in a low-humidity climate like Denver impact the temperature and quality of the barbecue?

I put out the equivalent of the Splunk Ninja Bat Signal and put them on the task. Turns out, they were very willing to lend a hand, especially with the promise of free barbecue and cold beer. Special thanks to @StephenLuedtke for the dashboards and @RyanMoore for the Tappecue à Splunk real-time input! On the menu: beef brisket. Here’s what brisket smoked for over nine hours looks like visualized in Splunk and mapped against the weather of the day via Weather Underground:

I was able to see when the lid was opened and how the temperature was affected in the chamber and in the meat itself.

Let’s take at another session and dashboard. Here I’m comparing two different historical smoking sessions at the same time: A 7-hour beef ribs and 4-hour smoked salmon!

The Tappecue isn’t just limited to a smoker either, @StephenLuedtke decided to give it a go in the oven for some steelhead trout and also a turkey roaster on Thanksgiving. With the trout he even added his own notes through Splunk lookups, a way to enrich your data from other sources. This allowed him to remember extra details about each of his sessions to help compare during next time.

And thanks to @RyanMoore, I can use Tappecue’s API to send data directly to Splunk real-time. I’m thinking about creating an alert that tells my husband to stop opening the lid so much. Being in a dry climate, like Denver, spritzing the meat to keep it moist is essential, but it also causes greater fluctuations in temperature, as well as smoky flavor to escape.

Below is what the live dashboard looks like (when Stephen was roasting a Turkey) updating the temperature every 30 seconds:

There is still much more investigation that we can do and more data sources to correlate but this is a fun start. Splunk and Tappecue are helping us optimize our barbecue. Imagine what they could do together at scale for the pros, other weekend pit masters and food enthusiasts alike? Better yet, what can Splunk do for you? Try it out yourself by downloading Splunk for free.

Hey there community and welcome to the 54^th installment of Smart AnSwerS.

Next Tuesday, February 23^rd, 2016, we’ll be having our SplunkTrust Virtual .conf session #4 from 12:00PM to 1:00PM PST. SplunkTrust member Mark Runals will be presenting his .conf2015 session “Taming your Data”, featuring the data onboarding maturity scoring model and dynamically having Splunk detect mis-categorized sourcetypes. Visit the event meetup page to RSVP and join the 35+ users and counting via Webex next week!

Check out this week’s featured Splunk Answers posts:

Is it recommended to install a universal forwarder on thousands of workstations or on a few dedicated syslog/Windows Event Collector servers?

flee needed to forward Windows events from about 6000 Windows workstations and was looking for advice on what deployment strategy would make the most sense for ongoing maintenance, especially having to manage universal forwarders using a deployment server. javiergn gives a pretty solid list of pros and cons to consider for going the route of installing and managing universal forwarders on each machine.
https://answers.splunk.com/answers/331926/is-it-recommended-to-install-a-universal-forwarder.html

How to index certain logs only during a certain time range (6am – 6pm)?

agoktas had four log files on one host, but only wanted one of those files to be indexed between 6am and 6pm each day. Stopping the universal forwarder service during off hours was not an option because the other three log files needed to be ingested 24 hours a day. SplunkTrust members MuS and rich7717 worked together to come up with just the right configuration in props.conf and transforms.conf on the indexer to filter out all events for this particular file from 6pm to 6am.
https://answers.splunk.com/answers/332983/how-to-index-certain-logs-only-during-a-certain-ti.html

How do I change the owner of a saved search or view in a search head cluster environment?

rphillips from the Splunk Support team shared this helpful question and answer with the community as this is a concern brought up by many admins managing a search head cluster. He shows two examples using REST endpoints via CLI to change the owner for a search and a dashboard view that will get replicated across all members in the cluster.
https://answers.splunk.com/answers/295303/how-do-i-change-the-owner-of-a-saved-search-or-vie.html

Thanks for reading!

Missed out on the first fifty-three Smart AnSwerS blog posts? Check ‘em out here!
http://blogs.splunk.com/author/ppablo

This is part 3 of a series.
Find part 1 here: http://blogs.splunk.com/2016/02/11/whats-next-next-level-splunk-sysadmin-tasks-part-1/.
Find part 2 here: http://blogs.splunk.com/2016/02/16/whats-next-next-level-splunk-sysadmin-tasks-part-2/

Hi, I’m Mark Runals, Lead Security Engineer at The Ohio State University, and member of the SplunkTrust.

There can be numerous challenges involved with ingesting data into your local Splunk environment. Because Splunk works so well out of the box against so many types and formats of data, it can be easy to overlook the complexity of what is happening behind the scenes.

So far in this series we’ve talked about ways to validate some of the basic assumptions people have as they search in and look at data in Splunk – these events happened on that server at this time. In retrospect I should have used that line at the beginning of this series. In part 1 I talked through a way to make sure the values in the host field are correct, and in part 2 that the local server time is set correctly. Time issues with your data go beyond simply making sure local server clocks are right. However, getting that set correctly is like buttoning your shirt with the correct first button and hole. Once that is addressed, the next step is to identify cases where there is an extreme or significant gap between when the data was generated and when it comes into Splunk. This is part art, part science. At a base level, the ‘science’ is pretty easy – subtract _indextime from _time. The art is masking your ire when you talk to system administrators about how they haven’t been managing their systems correctly! I kid, I kid. Actually the art is trying to identify which systems or data sources are having issues time or other data ingestion issues, if the cause is server or Splunk related, and where to apply a fix.

The two categories of time issues

I tend to lump time issues into two categories: availability and integrity.

Let’s say you have an alert set up to run every 15 minutes looking at the last 15 minutes’ worth of logs from a particular sourcetype – only it takes 20 minutes or more for the data to come in. The data will eventually be placed in its chronologically correct position but your alert will never fire. Availability.

Conversely, let’s say you are investigating an outage or security issue that happened at a particular time, only one of the data types is generated in a different, and unaccounted for, time zone compared to the rest of the data – you will likely miss related events. Integrity.

Solutions and resources

There are more issues and possible solutions in this area than I could possibly cover in one or even several blog posts. As a quick start let’s look at some Splunk configuration things to do/look for. The first is the fact that forwarders are set by default to send only 256 kbps of data. A server generating more data than the forwarder can push is one reason you might see a delay in data being ingested. This can be found with a query like the following:

index=_internal sourcetype=splunkd "current data throughput" | rex "Current data throughput \((?<kb>\S+)" | eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, "Other") | stats count sparkline by host, rate | where count > 4 | sort -rate,-count

If a forwarder has just been restarted it will likely have to catch up, which is why the query has its where statement. The sparkline output looks funky in email form, but my team has this query run to cover a midnight-to-midnight stretch. The number’s placement can give insight into when the limit was hit and what might be happening – ie, busy server that was rebooted or consistently hitting the limit and we need to up the throughput limits on the forwarder. That can be adjusted via the forwarder’s limits.conf > [thruput] maxKBps = whatever. There is a dashboard related to this and other forwarder issues on the Forwarder Health Splunk app.

There is some anecdotal evidence that some of the newest forwarders might not be generating this internal message or the conditions for the event’s generation has changed which hopefully isn’t the case(!!). I have an open case with Splunk looking into this; will have this post updated if something is determined one way or the other.

The next thing to do is update your props time related settings for your sourcetypes, especially TIME_FORMAT. This and related settings will make sure Splunk is understanding the timestamp correctly. This subtopic alone can be long and involved. While I hate to hawk my own crap, the OSU team has had to do a lot of work in this space (work months and ongoing /shudder) so I’ll refer you to the Data Curator app. I’m not sure if Splunk dropping events or recognizing timestamps incorrectly is worse but either way if events aren’t where you expect them its bad. The following is one of the queries in the Data Curator app that looks for dropped events due to timestamp issues; you could run it over the last 7 days or so:

index=_internal sourcetype=splunkd DateParserVerbose "too far away from the previous event's time" OR "outside of the acceptable time window" | rex "source::(?<Source>[^\|]+)\|host::(?<Host>[^\|]+)\|(?<Sourcetype>[^\|]+)" |rex "(?<msgs_suppressed>\d+) similar messages suppressed." | eval msgs_suppressed = if(isnull(msgs_suppressed), 1, msgs_suppressed) | timechart sum(msgs_suppressed) by Sourcetype span=1d usenull=f

Besides the Data Curator app, I recommend other Splunk resources like Andrew Duca’s Data Onboarding presentation from .conf15.

So now let’s generically say Splunk is configured to recognize your timestamp formats correctly and the forwarders are able to send data just as quickly as their little digital hearts are able to pump it out. As mentioned above, we need to look at the _indextime field. To find the delta, or if there is ‘lag’, between event generation and ingestion, you simply subtract one from the other via a basic eval ( | eval index_time = _indextime – _time). If you want to see what that index time time actually is, you’d need to create a field to operate as a surrogate like | eval index_time = _indextime and then maybe a | convert ctime(index_time) unless you are a Matrix-like prodigy who can convert the epoch time number into a meaningful date in your head.

A basic and fairly generic query to review your data on the whole might be something like this, though it tends toward looking for time zone issues. If you want to have it look a bit broader at probable time issues, adjust the hrs eval to round(delay/3600,1) or just remove the search command right after that eval.

index=* | eval indexed_time= _indextime | eval delay=_indextime-_time  | eval hrs = round(delay/3600) | search hrs > 0 OR hrs < 0 | rex field=source "(?<path>\S[^.]+)" | eval when = case (delay < 0, "Future", delay > 0, "Past", 1=1, “fixme”) | stats avg(delay) as avgDelaySec avg(hrs) max(hrs) min(hrs) by sourcetype index host path when | eval avgDelaySec = round(avgDelaySec, 1)

One thing I’m trying to do with the rex command in this query is cut out cases where the date field is appended to the data in an effort to cut down on granular noise. Note that this will take some time to churn through depending on your environment so I recommend a relatively small time slice of not more than 5 minutes or so. In reviewing this post, fellow Community Trustee Martin Müller pointed out that a query using tstats would be more efficient. I would agree, though I feel what we are talking about is chapter 3 or 4 material and tstats is like chapter 6 :). At any rate, a rough query he threw together is:

| tstats max(_indextime) as max_index min(_indextime) as min_index where index=* by _time span=1s index host sourcetype source | eval later = max_index - _time | eval sooner = min_index - _time | where later > 60 OR sooner > 10

An additional tip if you are on the North/South America side of the planet and you want a quick way to look for unaccounted for UTC logs, you could do the following. This will show logs coming in from the ‘future’:

index=* earliest=+1m latest=+24h

When it comes to investigating especially an overloaded forwarder or sourcetype a quick go to for me is:

host=foo source=bar (and/or sourcetype) | | eval delta = _indextime - _time | timechart avg(delta) p95(delta) max(delta)

What I’m looking for here are basic visual trends like: is there a constant delay or does the delay subside at night/during slow periods?

Overall, time issues can be somewhat troublesome to find and ultimately fix. This might involve adjusting the limits on forwarders as we’ve talked about, adding additional forwarders on a server to split up the monitoring load – ie, busy centralized syslog server, updating props settings, or having conversations with device/server admins. I’ve not worked in a Splunk environment that collects data from multiple time zones. If you do and care to share some of the strategies you’ve used to work through those particular challenges please share in the comments!
Hopefully you’ve found this series useful. It has been fun to write and share!

Hey there community and welcome to the 55^th installment of Smart AnSwerS.

Next Wednesday, March 2^nd @ 6:30PM, Splunk HQ will be hosting our monthly SF Bay Area User Group meeting. Since it’s during RSA, topics covered will be related to *drum roll*…SECURITY! If you happen to be local or visiting from out of town for the conference, come join fellow users over pizza and beer and listen to a talk from Monzy Merza, Chief Security Evangelist at Splunk. Be sure to visit the user group event page to RSVP and stay updated on the tentative agenda. Hopefully see you next Wednesday!

Check out this week’s featured Splunk Answers posts:

How to combine my two searches to get the duration of completed jobs with start/end events and display a list of incomplete jobs?

dpoloche had two searches that individually returned expected results, but needed to combine both into one, preferably without the transaction command for performance reasons. wpreston admits it is an expensive command, but reminds how powerful it can be by simply adding the keepevicted=t argument and using the closed_txn field in dpoloche’s existing search to get the job done. He also suggests using the fields command to improve performance by reducing field extractions. Runals provided an answer with a working search as well, using stats and eval without transaction for users to see how both approaches can work.
https://answers.splunk.com/answers/339864/stats-duration-without-using-transactions-for-even.html

How to search how much bandwidth a forwarder is using?

sbattista09 wanted to show how much bandwidth a forwarder was using by host in a timechart, but wasn’t sure where to start using _internal data. jbsplunk shows how this can be done, pulling an example search from S.o.S – Splunk on Splunk using metrics.log to calculate the outgoing thruput. sowings added that this can also be found through the Distributed Management Console.
https://answers.splunk.com/answers/340084/how-to-search-how-much-bandwidth-a-forwarder-is-us.html

Why is my rex statement unable to extract the field?

This question by jsiker is a topic that comes up often, but usually only has an answer that is useful for the original poster as everyone’s data will be formatted differently. However, the accepted answer by MuS has a comment thread with useful tips on testing out regular expressions by him and his fellow SplunkTrust members, somesoni2 and Runals. Learn how you can test your syntax directly from the Splunk CLI, in a search in Splunk Web, or external sites with tools for leveling up your regex fu.
https://answers.splunk.com/answers/305727/why-is-my-rex-statement-unable-to-extract-the-fiel.html

Thanks for reading!

Missed out on the first fifty-four Smart AnSwerS blog posts? Check ‘em out here!
http://blogs.splunk.com/author/ppablo

A customer of mine runs daily vulnerability scans using a myriad of tools across servers and applications in their data center. As you would expect, they can use the native reporting capabilities of the various tools to review a given set of scan results, but if you’ve done this sort of thing before then you understand the issues associated with trying to get any real understanding by looking at disparate, clunky reporting tools, each in their own silo.

Ingesting vulnerability data into Splunk offers tons of powerful abilities while breaking down any sort of silos that could be a hindrance to correlation and true security analytics. In fact, our Common Information Model includes a Vulnerabilities model that can help normalize this type of data, and our Enterprise Security App includes a Vulnerability Center plus other dashboards and correlations for this data.

Some common use cases for vulnerability data in Splunk include understanding key elements of a host’s security posture, correlating vulnerabilities with CVEs or other indicators of attack, and monitoring vulnerability/patching cycles and trending over time.

This customer, though, had a very specific use in mind for this vulnerability data – they wanted to see deltas (changes) in vulnerabilities reported across hosts from one day to the next. So for a given scan type, host and unique vulnerability ID we need to ask 2 questions:

Is this a new vulnerability detection, i.e. it wasn’t there yesterday, but is there today?
Has a vulnerability that was there yesterday been mitigated so that it doesn’t show up on the scan today?

Why would an organization be interested in knowing about vulnerability deltas? In this particular case it was largely about compliance reporting integration with another tool. Beyond that, organizations might find deltas useful to:

understand the effectiveness of patching/mitigation
detect benign or malicious installation of vulnerable software on critical systems
better understand and detect changes in system state

We started by ingesting the vulnerability scan data into Splunk. You can do this all sort of ways, and in fact there are many apps and add-ons available on SplunkBase to support ingestion of data from specific tools and vendors, but in our case we just consumed flat file output from the vulnerability scanners.

Once we brought the data in, we normalized it against the Common Information Model. We didn’t strictly need to do this, but not only does it simplify normalization of data across different vulnerability scanning tools, it also makes it easy to reuse the search logic without having to change field names.

After a little bit of data exploration and some trial and error, here is the search we came up with for Nessus vulnerability scan data:

index=”nessus” sourcetype=”nessus” [search index=”nessus” sourcetype=”nessus” | dedup date_mday | head 2 | fields date_mday]
| stats list(date_mday) as dates list(_time) as time count by dest,signature | where count <2
| appendcols [search index=”nessus” sourcetype=”nessus” [search index=”nessus” sourcetype=”nessus” | dedup date_mday | head 2 | fields date_mday] | stats list(_time) as _time count by dest,signature | where count <2 |stats latest(_time) as today earliest(_time) as yesterday] | filldown today yesterday
| eval condition=case(time=today,”New Detection”,time=yesterday,”Removed Detection”)
| fields – dates,count,today,yesterday
| convert ctime(time)
| eval uid=dest.” | “.signature
| table uid time dest signature condition
| outputlookup nessus_delta.csv

It’s probably not the most efficient way to do this sort of search, but since vulnerability scan data isn’t particularly voluminous and we only had to run it once per day, it worked well for us. Here’s what your output might look like:

Let’s break this search down a bit:

index=”nessus” sourcetype=”nessus” [search index=”nessus” sourcetype=”nessus” | dedup date_mday | head 2 | fields date_mday]

This is our base search, which returns all my Nessus scan data. Yes, I could have used a data model for this and probably would do that in production. The subsearch (the part in [] brackets) does the interesting work here. In short, in ensures that I only have scan results from the last 2 days on which I ran a scan by adding “date_mday=<latest day in my results> OR date_mday=<second latest day in my results>” to the base search.

Note that this works great as long as you don’t have more than 1 scan per day, and can easily handle skipped days or wrapping around month boundaries, but you would need to write something a little different if you had more than 1 scan result set per day (perhaps using the streamstats command). Also, don’t set your search time window to more than 30 days or so.

| stats list(date_mday) as dates list(_time) as time count by dest,signature | where count <2

The stats command creates a table with a row for each unique combination of a scanned system (dest field) and unique scan identifier (signature field). Those 2 fields in combination represent a unique detection. For each of those rows it lists out the month date or dates on which the detection occurred (probably not necessary, but was useful for validation) and the time stamp (absolutely necessary).

The where statement is how we filter out the results to only show rows where there’s a delta. Think about how it works – if the row has a count of 2, that means it was detected on both days in both scans, i.e. no change. If it only shows up once in those 2 scans, it means that there’s been a change. We now just need to figure out if it’s a new detection or a resolved/mitigated detection.

| appendcols [search index=”nessus” sourcetype=”nessus” [search index=”nessus” sourcetype=”nessus” | dedup date_mday | head 2 | fields date_mday] | stats list(_time) as _time count by dest,signature | where count <2 |stats latest(_time) as today earliest(_time) as yesterday] | filldown today yesterday

The appendcols command here, with nested subsearches, simply adds 2 columns with epoch time stamps representing dates/times of the 2 most recent scans (i.e. the scans we are including per the base search). We need these for comparison to know what kind of delta it is (new vs. removed detection). This works here because our data had a single timestamp for all events from a given scan. If yours doesn’t, then you’ll need to extract or eval a value representing just the date here, and for comparison in the original stats command. We also used the filldown command to populate each row with the time values for today and yesterday, instead of just the first row.

| eval condition=case(time=today,”New Detection”,time=yesterday,”Removed Detection”)

The eval command compares the time for each row against the known values for the 2 previous scans. If the detection timestamp matches the “yesterday” value, then we know it’s a removed detection (it was here yesterday, but not today), but if it matches the “today” value, then it’s a new detection (it’s here today, but wasn’t here yesterday). You can’t have a row with both yesterday and today because we filtered out the results with 2 detections after our original stats command with the where statement.

| fields – dates,count,today,yesterday
| convert ctime(time)
| eval uid=dest.” | “.signature
| table uid time dest signature condition

The above portion of the search is largely about cleanup. We remove fields we no longer need, format the epoch timestamps into something more human-readable, create a unique ID field (uid) consisting of a concatenation of dest and signature fields, and then re-arrange the columns in the table to make more sense.

| outputlookup nessus_delta.csv

Finally, we used the outputlook command to create a nicely formatted CSV file on the system for integration with another tool. If you do it this way, you can use a cron job to script out moving the file from the <splunk_home>/etc/apps/<your app>/lookups/ directory to a network share and perhaps putting a timestamp in the file name.

Even better would be to save this search (minus the outputlookup) as an Alert, then use a scripted alert response to write the data directly to a network share with a unique file name that includes a timestamp. You could probably do it in well under 20 lines of python or similar scripting language.

If you’ve got vulnerability or other scan data, pull it into Splunk. Try to understand your trends and deltas, and see what else you can learn from it. It’s a treasure trove of valuable security and compliance data.

Happy Splunking!

Technological democratization (say that 3 times fast – that’s one), the power of technology to allow the budget-constrained rest-of us to do things only well funded professionals could do before, is everywhere. And, if you give me a few minutes of your time, I’ll tell you how it’s even coming to the smallest of IT organizations.

So what got me thinking about technological democratization (that’s two) was that famous opening scene in Saving Private Ryan in which Stephen Spielberg showed an incredibly realistic re-enactment of D-Day and the Allied landing on Omaha Beach. Back in 1998 BD (“before Disney”) that scene cost $12 million to film (or 18% of the cost of the entire movie) and involved over 1,500 extras and tons of highly paid professionals at Industrial Light and Magic (ILM). The work was so specialized that they actually had one person whose job title (printed on their business card and everything) was “manager – underwater ballistic effects.” No, seriously, I met her.

Now flash forward 10 years to 2008 (still BD for ILM but not so much for Pixar) in which that scene is re-enacted by 3 graphic designers using only 2 military uniforms, 2 fake rifles, some rope, a (really BIG) camera and 1 station wagon. Total time to produce this knock-off masterpiece? 4 days. Although the budget was never shared, just looking at the station wagon (and, let’s face it, the graphic designers) makes it clear that it was WAY below $12 million.

How did they do it? Technology of course. In the intervening 10 years the technology had improved such that you didn’t need all the specialized hardware and personnel to do almost the same work. In short, technology has democratized the process of elaborate filmmaking such that it could be done with less people, less costumes, less money and less time (OK, maybe more Red Bull).

This trend of “technological democratization” (that’s three) is everywhere and you all know the stories. Industries such as music and publishing (along with associated distribution work) have all seen their costs lowered dramatically while the number of producers/stars has gone up just as dramatically. Ask your kids their favorite television channel and they will probably say “Television? Oh you mean YouTube.” (and chances are they are watching a mysterious woman unwrap Disney toys – as she is YouTube’s highest paid star and made $5 million last year doing that.)

Grommet-Comic-03012016-blog How is this democratization helping you solve your IT, Sys Admin and Dev Ops problems in 2016 AD (“after Disney”)? Well just as above it’s all about finding technology that allows you to do the work of 1,500 extras (or employees) when all you have are 3 people and a station wagon (have patience, you’ll get that Tesla eventually).

So since you are here you’ve probably already figured out that the equivalent of iMovie or Adobe Premier for IT is automated log search and analysis. But what should you look for in terms of the features of your log search and analysis solution, as they all aren’t created equally? In terms of technology democratization, think of these as your “rights.” As an IT problem solver you have the right to:

Automatic pattern recognition – So much of the work in solving a problem is detecting the root cause, and detecting the root cause involves spotting either a pattern OR a break in the normal pattern. If you have software that can speed that process you can speed time to resolution
Sophisticated Visualizations – While we dream of a system that can automatically detect key patterns we know that is a long way off. In the interim we still have to rely on our own eyes and experience, so it is important to make sure rich enough visualizations are fed into our brains (or to use my own rich visualization – “a computer made out of meat”)
No coding overhead – Many tools designed to help reduce the number of “extras” you need to have should say right on the label “some assembly required, batteries not included”. You are out to buy a solution, not a development project, because you’ll need more bodies for that. When I buy a hammer I don’t want a hunk of iron and a portable smelter, I want a hammer. Any typing I do should be to build a query to answer my question and even then it would be nice if the queries sometimes get written for me
A complete set of deployment options – Some of you do classic “on-prem” IT, some of you do “cloud.” Whatever your IT “lifestyle choice” is, you should be able to get the same benefit (can you tell we’re headquartered in San Francisco?)

So when looking at a solution for log search and analysis don’t forget your rights. The good news is thanks to technological democratization (that’s four, now I’m just showin’ off) you can now exercise your rights in a package you can afford. Seriously you need like $3 a day* lying around to get that capability from Splunk. And with any leftover budget I know where you can get a former manager of underwater ballistic effects on the cheap.

Marc Itzkowitz
Director of Product Marketing
Splunk Light

The $3 per day price is based on an annual license fee of US $900 for indexing up to 1 gigabyte of data per day using Splunk Light Software, and an annual subscription fee of US $1080 for indexing up to 1 gigabyte of data per day using Splunk Light Cloud Service.