Posts Tagged ‘patterndb’
Valentijn has published (blog post, mailing list archive) a nice hack using syslog-ng to actively react to intrusion attempts with patterndb and iptables. The blocking part is implemented using iptables recent match that is capable of closing an opened port for certain amount of time. This is controlled by syslog-ng: whenever a login failure is received, syslog-ng informs the recent module about that.
And please note that it doesn’t matter which application the intruder is trying to use, by feeding new rules into patterndb, you can have the same functionality for any of your applications, with the syslog-ng configuration unchanged.
Nice idea, thanks Valentijn.
There’s a good writeup on syslog-ng correllation functions on LWN. Since it is currently for subscriber’s only, here’s a link that you can use to see until it is published.
LWN is a great publication by the way, so consider subscribing if you can.
A regular reader of this blog may already have heard about patterndb, a collection of syslog-ng db-parser() rules that will make syslog-ng the center of the universe
OK, I was joking, patterndb is a collection of log samples which make it possible to do more to your logs than merely processing them: For example if you are interested in login failure events, searching for these in simple log files you usually find on a central log server these days might be daunting. Each and every application will log this event different, and simply grepping for this event is unfeasible except in really trivial cases.
Here comes patterndb, which contains patterns for a lot of different applications to recognize this (and other) event and turn it into a unified message, something that is easy to recognize and thus search for.
Until now, the patterndb project aimed at creating its own dictionary & taxonomy to categorize events, simply because there was no such thing when we started.
However times change and the “Common Event Expression” (or CEE for short) project seems to have made some progress. That projects aims at defining a generic dictionary for event fields, which would come quite handy to patterndb land: instead of doing a lot of work to define our own, we can use their results. Since this project is backed by the US government, we have a good chance that it will be adopted by the industry.
So now, patterndb is going towards using the CEE results and we even have converted our previous patterns in the past few weeks.
Please read Peter Czanik’s blog post on the topic for more information.
If you are a regular reader of this blog, you’ll probably know that syslog-ng is now entering the log message processing scene with its db-parser functionality. In order to improve our pattern coverage, Peter has started a log sample collection initiative. Please help him with good quality samples so our login/logout coverage becomes significantly better.
Here’s the blog post where he describes what he’d need in order to proceed:
Thanks for helping him.
You probably know that during the 3.2 development series a lot of functionality has been added to db-parser() (aka patterndb). All of this functionality was upward compatible with the old XML file format, so at first I’ve decided not to change the patterndb version number, it remained at v3.
However, after a talk with Robert, our documentation maintainer, he convinced me to bump it to v4. I’ve now added checks to syslog-ng into v3.1 and v3.2 to actually verify this. v3.1 will only accept v3 formatted files and complain otherwise. v3.2 will accept both v3 and v4.
I also added an XML schema to cover this format.
You’ll find these checks in v3.1.4 (too bad I’ve just released v3.1.3) and in v3.2.1.
I’m trying to push syslog-ng 3.2beta1 out on the door, but as I was writing the NEWS entry I had to realize that the latest state of the patterndb correllation functions are undocumented so far. So here goes a blog post which tries to summarize how it works, so that I can include it in the NEWS entry
My previous post on the topic used a syntax that included explicit “store” and “join” elements, but I’ve decided to drop those, as they stood in the way for some more juicy functionality. What remained is that the correllation is still focused around a “correllation state”, or as I’ve called internally a “context”.
A context consists of a series of log messages related to each other in some way. As new messages come in, they may be associated with a context (e.g. added to the context). Also, when an incoming message is identified it can trigger some actions to be performed. And these actions can use all the information that was stored previously in the context.
Let’s see how this work out with concrete examples: each rule in the patterndb has a “context-id” attribute telling db-parser() which context the given message should be associated with.
This example covers an SSH login message:
<rule id=”…” context-id=”$HOST:$PROGRAM:$PID” context-timeout=”86400″ context-scope=”global”>
<pattern>Accepted @ESTRING:usracct.authmethod: @for @ESTRING:usracct.username: @from @ESTRING:usracct.device: @port @ESTRING:: @@ANYSTRING:usracct.service@</pattern>
Since multiple rules can reference the same context, multiple different kind of messages may be added into the same context as a result. E.g. the logout event looks like this:
<rule context-id=”$HOST:$PROGRAM:$PID” context-timeout=”0″ context-scope=”global”>
<pattern>pam_unix(sshd:session): session closed for user @ANYSTRING:usracct.username:@</pattern>
As you can see a “session” is identified using the triplet ($HOST, $PROGRAM, $PID) and these two rules correllate the login/logout events into the same context, which means that you can create a derived event that contains information from both of them.
Please note that it is fairly common that messages only need to be correllated if they originate from the same host: e.g. this SSH login message needs only to be correllated to its logout counterpart if they both originate from the same host. In the previous example this was achieved using explicit macros in the context-id attribute, however since this is quite often the case, this was worked into a function of its own right: each rule can have a context-scope attribute:
<rule id=”…” context-scope=”process” context-id=”ssh-login” context-timeout=”5″>
The context-scope tells syslog-ng which messages need to be considered when looking for correllations:
- process: only consider messages that have matching $HOST, $PROGRAM and $PID values
- program: only consider messages that have matching $HOST and $PROGRAM values
- host: only consider messages that have matching $HOST values
- global: any kind of message is fine
The default is to use “process”, which means that if it is true that the same process is emitting all the messages that you want to correllate, then you don’t need to use a variable part in your context-id attribute. But it is also important to know that it is way faster to specify the scope this way than it’d be to add all relevant macros to your context-id attribute.
So far so good, we have all the functions that we used to have with the previous versions of the functionality. But I mentioned something about “actions” to be performed. Until now a patterndb rule basically only identified the incoming message, possibly associated tags and name-value pairs, but didn’t perform anything else. This is being changed: one or more actions can be associated with a patterndb rule in order to make it possible to react to more complex situations.
Here’s an example action:
<value name=”MESSAGE”>a patterndb rule matched</value>
Right now the only real response to a message is to generate another message, but this allows us to do a couple of powerful transformations, especially with the following options that you can specify for an action tag:
- condition: specifies a syslog-ng filter expression that needs to be matched in order to really perform the action. It is evaluated on the current message that matched the rule.
- rate: <num>/<period> specifies how much messages are to be generated (num), in the specified time period (period). Excess messages are dropped. For example: “1/60″ allows 1 message / minute. Rates apply to the given scope for the given rule/action. E.g. context-scope=”host”, rate=”1/60″ means that one message gets generated for _each host_ per minute.
- trigger: specifies when to execute the action, there are two possible triggers right now:
- match: execute immediately once the rule matches
- timeout: execute when the correllation timer expires
I’d like to highlight two things:
- it is possible to react to the expiration of a correllation timer (e.g. trigger=”timeout”)
- it is possible to generate a message only in case a given condition is met (e.g. “$PID” == “”)
Right now new messages are posted to the internal() driver. This is not the way I wanted it to be, but doing my original plan would require an enormous refactorization of the code, and it is too late for that too happen. My original idea was to let the db-parser() emit multiple messages, but since the current state of affairs in syslog-ng assumes that only sources generate messages, that needs a lot of work. But hey, we need something to do for syslog-ng 3.3, right?
Just a quick post to let you know that I’ve integrated the gyp’s patternize patches, so if you check out the latest greatest revision from git, patternize will be included.
I’ve also fixed a couple of memory leaks and decreased memory usage a lot, so you might want to try this, if you were experimenting with the older version.
For those who don’t know what SLCT is about: it takes a logfile and automatically generates patterns that cover the contents of the log file. Some manual labour is still needed to process its output, but I’d say patternize does about 80% of the job. To use it:
$ pdbtool patternize /var/log/messages
There are a few configuration knobs, but it’s usually as simple as that.
I think we’ve reached an important milestone with syslog-ng: log message correllation was added to db-parser(). As you probably know dbparser and its sister project patterndb is able to transform unstructured syslog messages into a normalized format: the human readable string content becomes a set of name-value pairs. The problem is that in a lot of cases messages miss one or two details that would really be needed to understand them and this information usually comes in a followup message.
For example: one message in postfix logs contain the sender address and while the recipient information comes in the next message. It is trivial to understand that in reality most cases you want the information in sender,recipient pairs. Another example is sshd, where the authentication failure comes in one and the exact reason for the failure comes in the next.
Currently what you can do with syslog-ng is to put the separate messages into two SQL tables and join them at query time. This gets ugly quite fast: increased storage needs, the hassle with managing two tables instead of one and not to mention the increase of the time needed to query the database. Sometimes the sole reason for creating SQL tables in this case is to perform the correllation, otherwise you’d be happier with a CSV file.
And that’s what became possible now with the latest git commit of syslog-ng 3.2. The idea is simple: when a patterndb rule matches, you can tell syslog-ng to remember that message by adding it to a correllation state. This state is identified with information extracted from the message making it a unique session identifier. When the next line comes in you can reference the information stored earlier.
Basically the correllation state is a list of log messages associated with a session id. To add a new message to this state, you need a store rule:
<pattern>foo session: @STRING:sessionid@, param: @STRING:param@</pattern>
<store id=”$sessionid” timeout=”60″/>
The id attribute of the store element specifies a template containing any syslog-ng name-value pairs, probably extracted from the current message itself.
When the final information comes in you can use the join attribute of the values tag:
<pattern>bar session: @STRING:sessionid@</pattern>
here the join attribute specifies the session to look up (which must match in the two messages), and if there’s a match all messages stored in the correllation state becomes available when evaluating the name-value pairs associated with the current message.
The key here is the new syntax in the template string “@1″ appended to a name-value pair reference. After the “@” character, you can reference a message in the correllation state by specifying the index backward from the current message. This way @0 is the current message, @1 is the one prior to the current one, @2 is before that and so on.
There are more complex ways to use/query the contents of the correllation state, but those will appear in a followup post. Stay tuned!
Since our new website with a wiki engine has launched (finally) I started to write the patterndb project homepage, which you can find at http://www.balabit.com/wiki/patterndb. From a set of links there’s also a new article describing how to deploy the patterndb rules in a syslog-ng installation. Hopefully it’ll make experimentation easier.
I have added some more functionality to “pdbtool test” which I needed while working on the official syslog-ng patterndb patterns. It now can process several pdb files in a single invocation and also it is now able to validate the patterndb XML files against the official schema.
This is the shell command I’ve used:
$ pdbtool test --validate `find . -name *.pdb`
If you compiled the alpha2 release, this is only one patch on top of that, so it should be simple. You can check out the patch here.
I was giving a lot of thought recently to the topic of naming name-value pairs in syslog-ng. Until now the only documented rule is stating somewhat vaguely that whenever you use a parser you should choose a name that has at least one dot in it, and this dot must not be the initial character. This means that names like MSG or .SDATA.meta.sequenceId are reserved for syslog-ng, and APACHE.CLIENT_IP is reserved for users.
However things became more complex with syslog-ng OSE 3.2. Let’s see what sources generate name-value pairs:
- traditional macros (e.g. $DATE); these are not name-value pairs per-se, but behave much like them, except that they are read-only
- syslog message fields (e.g. $MSG) if the message is coming from a syslog source
- filters whenever the ‘store-matches’ flag is set and the regexp contains groups
- rewrite rules, whenever the rewrite rule specifies a thus far unknown name-value pair, e.g. set(“something” value(“name-value.pair”));
- and of course parsers when you tell syslog-ng to parse an input as a CSV, or use db-parser together with the patterns produced by the patterndb project
The latest stuff generating name-value pairs is the support for process accounting logs, in this case even the syslog related fields are missing and only things like “pacct.ac_comm” (to contain the program name) are defined.
So I was thinking whether it should be “pacct.ac_comm” or “.pacct.ac_comm”. With the quoted rule it should be simple: it is generated by syslog-ng itself, thus it should be in the syslog-ng namespace and should start with a dot. However in the era of syslog-ng plugins, what consists of syslog-ng at all?
First, I wanted to use “pacct.ac_comm” (e.g. without a dot), because I liked this name better. I was trying to explain myself why it would not violate the rule above. The explanation I had for myself was: I’m going to “register” names such as this in the patterndb SCHEMAS.txt file. With this – not yet published – explanation, I’ve committed a patch to convert the pacctformat plugin to use a dotless prefix.
Next, I was figuring that it is true that process accounting creates name-value pairs without going through patternization, but I’ve felt, that nothing ensures that these name-value pairs would be directly usable, when trying to analyse the logs. The patterndb concept uses tags and schemas to convert the incoming unstructured data into a consistent structure. However, pacct may not completely match what the user needs. And, in the future, when SNMP traps or SQL table polling are going to be supported, it is going to be even more true: these name-value pairs may need a conversion: from the SNMP/pacct structure to the patterndb schema described structure in order to handle these message sources consistently with regular syslog (and to make it easy to correllate these).
So at the end, I’ve committed another patch, this time going back to “.pacct” as a prefix and leaving the original naming rule intact. The “pacct” prefix is up to the users to use, they may want the same information in a “pacct” schema, but that may come from data not directly tied from process accounting (e.g. from syslog messages).
So this post is about doing nothing with regards to the naming policy, but I thought it’d be important to shed a light behind the scenes. Giving such decisions enough thought and coming up a with a long-term plan makes our lives much easier in the future.
This post may be a bit more involved than the others, but feel free to ask me to elaborate, if you are interested.
I thought I’d post a quick update on the patterndb project status. Our first aim was to draft a basic policy which governs how patterns should be created. This is available in the patterndb git repository as a README.txt file.
Although not completely finished, I feel the current description is enough for some basic work to start, to gather more experience. Here is the current version:
Also, after discussing the policy we’ve set a target to cover login/logout events from all parts of a generic Linux system. Currently sshd is quite nicely covered, su is coming along and I still have some submitted log samples that need marking up.
With the sshd/su patterns a quite nice percentage of my “auth.log” file is covered and using pdbtool “grep on steroids” feature, the marked up patterns are already quite useful.
Further log samples and a hand in helping me out to mark up the patterns would be appreciated.
You may have heard of my last project to collect log samples from various applications, in order to convert log data from free-form human readable strings into structured information.
The first round to collect login/logout messages from sshd is now complete.
You could ask: ok, but what is the immediate benefit? You supposedly have a lot of unprocessed log files, and syslog-ng’s db-parser() has not been used to process them, thus they are stored as good-old plain text files.
I spent a couple of hours to add a “grep”-like functionality to pdbtool which makes it easy to process already existing log files, giving you immediate benefit for each and every sample added to patterndb.
For example, if you are interested in login failure events, you could say:
zcat logfile.gz | pdbtool match -p access/sshd.pdb \
–file – \
–filter ‘tags(“usracct”) and match(‘REJECT’ type(string) value(“secevt.verdict”));’ \
What the command above does is the following:
- reads a compressed logfile from logfile.gz
- tells pdbtool to use access/sshd.pdb (in the patterndb git repo) as its pattern database file
- tells pdbtool to read its stdin as a logfile, and
- apply the db-parser() for each log message
- apply the syslog-ng filter specified above
- and print matching messages using the template also specified above
As a combination, it results in a CSV file, containing login failure records found in the logfile. Also please note that as long there’s a pattern in the pdb file, it doesn’t really matter how that originally looked like, the fact that ssh can use 3-5 different messages for the same meaning is hidden nicely under the hood.
And imagine we’d have patterns for all common applications running on our computers: this would mean that the same command above would produce login-failure reports independently from the application/OS combination being used.
Try that with grep.
This pdbtool is in the OSE 3.2 tree, clone the tree from: git://git.balabit.hu/bazsi/syslog-ng-3.2.git
By now probably most of you know about patterndb, a powerful framework in syslog-ng that lets you extract structured information from log messages and perform classification at a high speed:
Until now, syslog-ng offered the feature, but no release-quality patterns were produced by the syslog-ng developers. Some samples based on the logcheck database were created, but otherwise every syslog-ng user had to create her samples manually, possibly repeating work performed by others.
Since this calls out to be a community project, I’m hereby starting one.
Create release-quality pattern databases that can simply be deployed to an existing syslog-ng installation. The goal of the patterns is to extract structured information from the free-form syslog messages, e.g. create name-value pairs based on the syslog message.
Since the key factor when doing something like this is the naming of fields, we’re going to create our generic naming guidelines that can be applied to any application in the industry.
It is not our goal to implement correllation or any other advanced form of analysis, although we feel that with the results of this project, event correllation and analysis can be performed much easier than without it.
I know there are other efforts in the field, why not simply join them?
CEF – is the log message format for a proprietary log analysis engine, primarily meant to be used to hold IP security device logs (firewalls, IPSs, virus gateways etc). The patterndb project aims to create patterns for a wider range of device logs and be more generic in the approach. On the other hand we feel that it might be useful to create a solution for converting db-parser output to the CEF format.
CEE – Common Event Expression project by Mitre has a focus on creating a nv pair dictionary for all kinds of devices/log messages out there. Although I might be missing something, but I didn’t find the concrete results so far, apart from a nicely looking white paper. If the CEE delivers something, then patterndb would probably adapt the naming/taxonomy structure. But I guess not all devices will start logging in the new shiny format, thus the existing devices would need
their logs converted, so the patterndb work wouldn’t be wasted.
Our original patterndb related plans were to create an easy to use web based interface for editing patterns, but since that project is progressing slowly, I’m calling for a minimalist approach: git based version control of simple plain text files. Of course once the nice web based interface is finished, we’re going to be ready to use it.
I have created a git repository at:
This contains the initial version of the naming policy document and a simple schema for SIEM-style and a user login-logout naming schema.
If you are interested please read the file README.txt in the git archive, or if you prefer a web browser, use this link:
I do not have a decision yet, but for sure this is going to use one of the open source licenses or Creative Commons. Let me know if you have a preference in this area.
Join the syslog-ng mailing list, a start discussing! If you have existing patterns, great. If you don’t, it is not late to join.
The posting address of the mailing list (to subscribers only) is:
You may probably know that starting with syslog-ng 3.0, we started poking into the message payload by being able to extract information from the log messages and use that information in structured form for message routing, filtering and storing them as separate fields in a database table.
The reason I’m raising the topic here again is that we have now released about 8000 patterns covering about 200 applications for patterndb and are now in the process of creating a community site to maintain this database.
You can download the database from www.balabit.com.
Also an important thing to know that syslog-ng OSE 3.1 features enhanced performance with regard to handling information extracted from the message payload and it also has support for the latest patterndb database format. So if you want to try the new database, fetch a copy of the latest 3.1beta2 release.