Happy New Year!
Following on Dennis Bourg’s post about using event generation, I’d like to post some of my notes about planning and building a technology add-on for use with Splunk. As we all know, getting data into Splunk is remarkably easy — we’re going to focus here on mapping the data to the Common Information Model so that it will be easy to use from other applications.
How do I analyze a data source for a TA?
First, we want to identify the data source and make sure that we understand how that data will be input.
- Search for the latest product guide. Go through the product guide to understand the product’s features and functionality.
- Try to understand the types of reports, information the software gathers and stores.
- Sometimes the information may not be directly available in log, but rather in APIs or databases. Database Connect or a modular input script are probably the best options in that case.
- Check the level of data that can be gathered with a default setup of the data source. If additional configuration is required, document it.
Next, we plan a list of logs available and the events produced in each. Most importantly, we should set sourcetypes here; this allows for efficient searching later, and it prepares your TA to be re-deployed easily at another site with lots of previously recorded data under a different sourcetype.
- List out the types of logs found for the data source. E.g., Oracle iPlanet Web server has three types of logs: Access log, Error logs & Audit logs.
- List out events found for the data source. E.g., Access log may have information regarding log-in, log out, update of credentials, create of new account, update of account, delete of account.
- Gather sample log lines for different events or priorities.
Finally, we decide which CIM domains the data source can fill.
- Not all fields are required, but the required fields for a domain should be accurately filled by the data source — any fields in the data model that are not provided by the data source will be auto-filled with ‘unknown’
- Use props and transforms to extract fields — you may need to redirect in some cases. For instance, if the data source provides a “severity” field with a 3-digit number in it, you should fieldalias that to “vendor_severity” and use a lookup to populate “severity” with a CIM-compliant value.
- Use lookups to fill in required field values that are not drawn from data. A good example of this is the vendor and product fields, which are often not recorded in the events.
- Use eventtypes and tags to make events appear in the data model interfaces for the domains desired
- Test by browsing the data model in Search->Pivot and making sure that you see the expected results
That’s enough to use the data source internally — now you can deploy your TA to any search heads or indexers where you want this data source to be modeled. However, it can go a lot further if you prepare it to work with event generation.
How do I prepare event generation for a TA?
The Eventgen package should not be built into a TA, but it can be easily activated for building a test or demonstration scenario. Token replacement is very advisable for this purpose, as it allows other engineers to rapidly modify the data. This also helps to ensure that proper data anonymization is being done.
A Technology Add-On should include an eventgen.conf and samples directory which enable minimal exercising of the data type. That data must be anonymized and should use token replacement. A sample of data is very useful for building and testing Splunk apps, and anonymization of that data is critical. Here are some tips for getting high quality anonymous data that can be easily represented with Eventgen:
- quantity. We don’t necessarily need lots of data, but we do need samples of each type of event that can occur.
- rate. Things that occur more frequently should be more frequent in the sample files.
- transactions. Events that must occur in a specific order should be in that order. Put them into a separate sample file and use mode=replay instead of mode=sample.
- time. There’s no need to change time stamps, and in fact it can cause problems by making it hard for future users to understand activity patterns.
- anonymization. Replace locations, addresses, usernames, and emails with strings, such as XXXXXXXXXX. This allows you to easily differentiate between internal and external addresses for instance, or show that there are three offices involved in a transaction (YYYYYYYYY1, YYYYYYYYY2, YYYYYYYYY3). This isn’t entirely necessary with RFC1918 addresses (10.*.*.*, 172.16.*.*-172.32.*.*, 192.168.*.*), local control addresses (224.0.0.*), or ISC space (169.254.*.*), but is important to do with external IPs that may identify a network that should be left obscure. Token strings should be identifiable by purpose, for instance UUUUUUUUU for User or HHHHHHHHHH for Host. You’ll note that Eventgen’s samples directory includes many token files, such as internal_ips.sample or useragents.sample, which make your new TA appear to be well-integrated with demo and test scenarios. If a needed type of replacement sample isn’t there, write one and put it in your TA’s sample directory.
Now the TA is ready to publish, and can be used seamlessly in lab and demo environments. Congratulations!