Posts Tagged ‘performance’
It seems that the BalaBit syslog-ng team that produces the Premium Edition of syslog-ng has beaten the community project this time, at least in terms of release date.
syslog-ng Premium Edition 4F1 (e.g. the first feature release past 4.0) has been released this week. It is the first release of PE in a long time that is actually based on an actual OSE core, namely 3.3.
I still have about 100 patches to review and integrate into OSE, hopedully with community involvement. But more about that in an upcoming post.
It is also interesting that some performanc testing was also done, and the new core does pretty well, and scales nicely on an 8 core machine, up to 800k msg/sec in @some onfigurations. Here’s the post the has some more details.
Now, if only the fixes they did were integrated properly to the OSE repository. but hey, life would be easy without challenges.
I just wanted to let you know, that fixes are nicely coming into the 3.3 beta tree, although it might not be very visible from th outside.
So if you consider trying out 3.3, I’d suggest to try a git snapshot instead of the 3.3beta1 tarball.
I’m trying to release a beta2 or rc1 in the near future. The version number depends on how much feedback we get until then
With the recent maintenance policy updates in my last post, I plan quickly release a maintenance version for 3.2 (with version number 3.2.3) and then to concentrate on getting 3.3 into a stable form, starting with a beta release.
As a reminder, here are the new features of syslog-ng 3.3:
- performance improvements:
- new multi-threaded core that allows syslog-ng to scale into the hundred thousand message/sec range by using all the CPU cores available in the system
- use epoll() system call instead of traditional poll() (where available)
- transaction support in the SQL destination driver, resulting in significant performance improvements (not LOAD DATA though)
- buffered output for destination files at the cost of some latency
- other miscallenous changes the improve performance
- MongoDB destination driver with support for creating documents based on the dynamic syslog-ng message structure
- $(format-json) template function that converts messages into a JSON representation
- systemd support (which was backported to the 3.2 release as well to support distributions in their integration work on systemd)
As you can see, this release is clearly performance oriented, hopefully 3.4 will also come with new and exciting features. For now, I’ve opened the 3.4 branch in order to have a place where new stuff can go, instead of languishing as patches on the mailing list. I’m quite excited with the new threaded core, I see further opportunities, although I can hardly imagine someone with several hundred megabytes/sec of logs which the current core can deliver.
Also, the non-performance related items on the list above were contributed by members of the community, so by all means this release contains much more community work than previous ones. Thanks guys.
I’ve achieved an important milestone on the current threading stuff and I’m happy to tell you that multi-processing and epoll related performance improvements work is progressing nicely. The current master branch of the syslog-ng-3.3 tree runs the testsuite (make check) and performs much better than earlier releases.
The only performance data was measured on my laptop, there it grew from about 60k to 180k msg/sec. While doing fixes and adding locking here and there, it went down to 160k and I didn’t investigate why that happened. But anyway, 160k msg/sec is not really bad either, from a single client and my guess is that adding more clients (and CPUs) to the picture will scale syslog-ng to several hundred thousand messages per second range.
I still have some locking job to do, I’ve just found problems with the udp() destination driver, so it is currently quite fragile, but I’d appreciate any kind of feedback you could have by installing it on your test systems. Production is of course out of the question.
Until now, the work was available in the “wip/epoll” branch, which I rebased it regularly, so that fixes of problems that I’ve found were incorporated into the original “threading” patchset. However that patch grew quite large by now and I now feel it’d be easier to track changes as individual patches instead of folding them back into the original series. Therefore I merged it back to “master”, and from now on the wip/epoll branch will be removed, further fixes will be published on the “master” branch.
In order to compile this stuff you’ll need one dependency library: ivykis. Ivykis is written by Lennert Buytenhek and encapsulates an epoll based event loop. It also supports other systems like FreeBSD’s kqueue, Solaris’s /dev/poll and of course the traditional select/poll system calls. I needed a couple of modifications against ivykis, those are hosted on git.balabit.hu, more specifically at: git://git.balabit.hu/bazsi/ivykis.git. I’m working with Lennert to incorporate my changes, so that hopefully no changes to upstream ivykis will be necessary.
In the coming days, I’m trying to fix up things that broke, and then quite possibly do a 3.3alpha1 release once I feel that it is getting stable enough for anyone to try.
Although I was not posting on this blog, I was working on syslog-ng multi-thread support in the last couple of weeks. Most of the preparation was done during the Netfilter Workshop (I know it wasn’t netfilter related and I’ve since used up any possible occassions to work on the code instead of writing about it.
I’ve decided that instead of using a per-connection thread model, I’d like to use something that’d keep the number of threads close to the number of CPU cores to avoid bad cache effects and context-switch overhead. Since syslog-ng may serve thousands of clients, the per-connection thread model would have meant that we’d have thousands of threads, so I gave up on that thought.
In my current architecture a single thread would be watching the file descriptors for events and a set of worker threads would perform the work. Since most of the time, most of the fds are idle, this would definitely use a lower number of threads and if I’m smart enough the same thread would be dispatching for the same client, which means that our cache would be hot by the time the 2nd round of events are coming in.
Also, since the model of GLib’s main loop is inherently slow, I’ve decided to switch away from it. GLib uses a linked list of GSource objects, which are iterated _twice_ every poll iteration, once for the prepare() phase, and once for check(). In case we’re polling 1000 fds, that’s a loop over 2000 items, which, if done thousands of times a second, poses a serious overhead.
Getting away of GLib is not easy, since a lot of logic is implemented in those prepare/check callbacks, but anyway it was an aim worth pursuing.
Also, as an added bonus I wanted to use faster kernel interfaces instead of poll. Linux has epoll, FreeBSD has a kqueue based implementation, Solaris has /dev/poll, these all advertise themselves as much more performant than the traditional interfaces.
I was both looking at alternative main loop libraries and also thinking about rolling my own. Here are the ones I’ve considered:
- I’ve immediately closed out C++ libraries.
- libevent: probably the most widespread availability, but I didn’t like the API too much.
- libev: has a libevent compatible API and a lower level one. This API was better, but reading through the CVS history, I didn’t like the stance of the primary author to things like AIX support and the like.
- ivykis: has the best API, quite Linux kernel-like, lightweight enough for me to confidently navigate in the code or to modify it. Has nice thread integration, with worker thread polls and cross-thread calls. Not really available anywhere, so syslog-ng will probably have to carry a copy.
Right now, ivykis is the winner, but I don’t know about its real portability (uses __thread keywords for example, which may not be available everywhere). So changing this still has a chance.
I’ve quickly gave up the idea of rolling my own similar implementation, doing this for all the various kernel interfaces correctly would be too much work.
Even with a single client and one destination without using threads, I could measure about 10% (55k msg/sec -> 60k msg/sec) performance increase on my development laptop. This is certainly promising.