rsyslog Archives - Page 21 of 42

2009-05-08

Can “more reliable” actually mean “less reliable”?

On the rsyslog mailing list, we currently have a discussion about how reliable rsyslog should be. It circles about a small potential window of message loss in the case of sudden power failure. Rsyslog can be configured to put all messages into a disk queue (instead of main memory), so these messages survive such a powerfail condition. However, messages dequeued and scheduled for processing during the power outage may be lost.

I now consider a case where we have bursty UDP traffic and rsyslog is configured to use a disk-only queue (which obviously is much slower than an in-memory queue). Looking at processing speeds, the max burst rate is limited by using an ultra-reliable queue. To avoid using UDP messages, a second instance could be run that uses an in-memory queue and forwards received messages to the one in ultra-reliable mode (that is with the disk-only queue). So that second instance queues in memory until the (slower) reliable rsyslogd can now accept the message and put it into the reliable queue. Let’s say that you have a burst of r messages and that from these burst only r/2 can be enqueued (because the ultra reliable queue is so slow). So you lose r/2 messages.

Now consider the case that you run rsyslog with just a reliable queue, one that is kept in memory but not able to cover the power failure scenario. Obviously, all messages in that queue are lost when power fails (or almost all to be precise). However, that system has a much broader bandwidth. So with it, there would never have been r messages inside the queue, because that system has a much higher sustained message rate (and thus the burst causes much less of trouble). Let’s say the system is just twice as fast in this setup (I guess it usually would be *much* faster). Than, it would be able to process all r records.

In that scenario, the ultra-reliable system loses r/2 messages, whereas the somewhat more “unreliable” system loses none – by virtue of being able to process messages as they arrive.

Now extend that picture to messages residing inside the OS buffers or even those that are still queued in their sources because a stream transport blocked sending them.

I know that each detail of this picture can be argued at length about.

However, my opinion is that there is no “ultra-reliable” system in life, only various probabilities in losing messages. These probabilities often depend on each other, what makes calculating them very hard to impossible. Still, the probability of message loss in the system at large is just the product of the probabilities in each of its components. And reliability is just the inverse of that probability.

This is where *I* conclude that it can make sense to permit a system to lose some messages under certain circumstances, if that influences the overall probability calculation towards the desired end result. In that sense, I tend to think that a fast, memory-queuing rsyslogd instance can be much more reliable compared to one that is configured as being ultra-reliable, where the rest of the system at large is badly influenced by this (the scenario above).

However, I also know that for regulatory requirements, you often seem to need to prove that a system may not lose messages once it has received them, even at the cost of an overall increased probability of message loss.

My view of reliability is much the same as my view of security: there is no such thing as “being totally secure”, you can just reduce the probability that something bad happens. The worst thing in security is someone who thinks he is “totally secure” and as such is no longer actively looking at potential issues.

The same I see for reliability. There is no thing like “being totally reliable” and it is a really bad idea to think you could ever be. Knowing this, one may begin to think about how to decrease the overall probability of message loss AND think about what rate is acceptable (and what to do with these cases, e.g. “how can they hurt”).

2009-04-27

Levels of reliabilty

We had a good discussion about reliability in rsyslog this morning. On the mailing list, it started with a question about the dynafile cache, but quickly morphed into something else. As the mailing list thread is rather long, I’ll try to do a quick excerpt of those things that I consider vital.

First a note on RELP, which is a reliable transport protocol. This was the relevant thought from the discussion:

I’ve got relp set up for transfer – but apparently I discovered
that relp doesnt take care of a “disk full” situation on the receiver
end? I would have expected my old entries to come in once I had cleared the disk space, but no… I’m not complaining btw – just remarking that this was an unexpected behaviour for me.

That has nothing to do with RELP. The issue here is that the file output writer (in v3) uses the sysklogd concept of “if I can’t write it, I’ll throw it away”. This is another issue that was “fixed” in v4 (not really a fix, but a conceptual change).

If RELP gets an ack from the receiver, the message is delivered from the RELP POV. The receiving end acks, so everything is done for RELP. Some thing if you queue at the receiver and for some reason lose the queue.

RELP is reliable transport, but not more than that. However, if you need reliable end-to-end, you can do that by running the receiver totally synchronous, that is all queues (including the main message queue!) in direct mode. You’ll have awful performance and will lose messages if you use anything other than RELP for message reception (well, plain tcp works mostly correct, too), but you’ll have synchronous end-to-end. Usually, reliable queuing is sufficient, but then the sender does NOT know when the message was actually processed (just that the receiver enqueued it, think about the difference!).

This explanation triggered further questions about the difference in end-to-end reliability between direct queue mode versus disk based queues:

The core idea is that a disk-based queue should provide sufficient reliability for most use cases. One may even question if there is a reliability difference at all. However, there is a subtle difference:

If you don’t use direct mode, than processing is no longer synchronous. Think about the street analogy:

http://www.rsyslog.com/doc-queues_analogy.html

For synchronous, you need the u-turn like structure.

If you use a disk-based queue, I’d say it is sufficiently reliable, but it is no longer an end-to-end acknowledgement. If I had this scenario, I’d go for the disk queue, but it is not the same level of reliability.

Wild sample: sender and receiver at two different geographical locations. Receiver writes to database, database is down.

Direct queue case: sender blocks because it does not receive ultimate ack (until database is back online and records are committed).

Disk queue case: sender spools to receiver disk, then considers records committed. Receiver ensures that records are actually committed when database is back up again. You use ultra-reliable hardware for the disk queues.

Level of reliability is the same under almost all circumstances (and I’d expect “good enough” for almost all cases). But now consider we have a disaster at the receiver’s side (let’s say a flood) that causes physical loss of receiver.

Now, in the disk queue case, messages are lost without the sender knowing. In direct queue case we have no message loss.

And then David Lang provided a perfect explanation (to which I fully agree) why in practice a disk-based queue can be considered mostly as reliable as direct mode:

> Level of reliability is the same under almost all circumstances (and I’d
> expect “good enough” for almost all cases). But now consider we have a
> disaster at the receiver’s side (let’s say a flood) that causes physical loss
> of reciver.

no worse than a disaster on the sender side that causes physical loss of the sender.

you are just picking which end to have the vunerability on, not picking if you will have the vunerability or not (although it’s probably cheaper to put reliable hardware on the upstream reciever than it is to do so on all senders)

> Now, in the disk queue case, messages are lost without sender knowing. In
> direct queue case we have no message loss.

true, but you then also need to have the sender wait until all hops have been completed. that can add a _lot_ of delay without nessasarily adding noticably to the reliability. the difference between getting the message stored in a disk-based queue (assuming it’s on redundant disks with fsync) one hop away vs the message going a couple more hops and then being stored in it’s final destination (again assuming it’s on redundant disks with fsync) is really not much in terms of reliability, but it can be a huge difference in terms of latency (and unless you have configured many worker threads to allow you to have the messages in flight at the same time, throughput also drops)

besides which, this would also assume that the ultimate destination is somehow less likely to be affected by the disaster on the recieving side than the rsyslog box. this can be the case, but usually isn’t.

That leaves me with nothing more to say ;)

2009-04-01

rsyslog going to outer space

Rsyslog was designed to be a flexible and ultra-reliable platform for demanding applications. Among others, it is designed to work very well in occasionally connected systems.

There are some systems that are inherently occasionally connected – space ships. And while we are still a bit away from the Star Trek way of doing things, current space technology needs a “captain’s star log”. Even for spacecraft, it is important when and why systems were powered up, over- or under-utilized or malfunction (for example, due to “attack” not of a Klingon, but a cosmic ray). And all of this information needs to be communicated back to earth, where it can be reviewed and analyzed. For all of this, systems capable of reliable transmission in a disconnected environment are needed.

Inspired by NASA’s needs, the Internet Resarch Task Force (the research branch of the IETF) is working on a protocol named DTN, usually called the interplanetary Internet.

As we probably all know, Microsoft Windows flies on the Space Shuttle. And, more importantly, Linux also did. With the growing robustness of Open Source, future space missions will most probably contain more Linux components.

This overall trend will also be present in NASA’s and ESA’s future Jupiter mission. There is a lot of information technology on the upcoming spacecraft, and so there is a lot of things worth logging. While specialized software is usually required for spacecraft operations, it is considered the rsyslog as the leading provider of reliable occasionally connected logging infrastructures may extend its range into the solar system. It only sounds logical to use all the technology we already have in place for reliable logging even under strange conditions (see “reliable forwarding“). Of importance is also rsyslog’s speed and robustness.

As a consequence, we have today begun to implement the DTN protocol for the interplanetary Internet. That will be “omdtn” and is available as part of the rsyslog spaceship git branch. This branch is available as of now from the public git repository.

We could also envision that mission controllers will utilize phpLogCon to help analyze space craft malfunction. A very interesting feature is also rsyslog’s modular architecture, which could be used to radiate a new communication plugin up to the space ship, in case this is required to support some alien format. This also enables the rsyslog team to provide an upgrade to the Interstellar Internet, should this finally be standardized in the IETF. If so, and provided the probe has enough consumables, it may be in the best spot to work as a stellar relay between us and whoever else.

2009-03-27

is freshmeat now dead?

I used freshmeat.net -both as an user and a project author- for several years and like the clean and efficient interface. Now, they have revamped the whole thing and I have to admit I personally think they screwed up while doing so.

First of all, a project has a structure that consists of various branches, each of them coming in different versions (see my post on the rsyslog family tree. In the old interface, you had branches and versions, and everyone could clearly see what belonged to where. In the new interface (as I understand it), you have a bunch of links that you can label. So I now have to deal with a flat structure and labels. This is NOT how software grows. And as this no longer is a real-world abstraction, it has become quite complicated to assign meaningful values. Not to mention that the big bunch of links is probably quite confusing to users.

I’ll probably deal with that by removing all but the development branches. Better to have consistent information than to have everything…

I also miss the statistics counters. They provided some good insight into what users where interested in and what effect releases had. Very valuable for me as an author, but also valuable for me as a user, for example, when I want to judge how active a project is. Freshmeat promised (on March, 15th) to bring back statistics “in a few days”, but today (March, 27th), they are still missing. And if they eventually appear and follow the rest of the design paradigm, I am skeptical if there is really value in them.

All in all, I am very dissatisfied. I am sad to have lost a valuable open source resource. So what to do now? Sourceforge again – don’t like that either. Ohloh? Maybe. Probably it’s best to concentrate on our own web resources… But first of all, I’ll wait a couple of days/weeks and hope that freshmeat will become usable again. But please don’t expect too many announcements on freshmeat from me for the time being.

There is also an interesting discussion thread on the new freshmeat design, I suggest to give it a read (you’ll also find that others like it!)

2009-03-23

rsyslog “family tree”

I have created a rsyslog “family tree” showcasing how the various branches and versions go together. It is a condensed graph of the git DAG and shows a few feature branches as an example. I personally think it provides a good overview of how rsyslog work progresses (click picture for larger version).

In red is the git master branch, blue are currently supported stable branches. Branch head “v1-stable” is dotted, because it is no longer officially supported. Dashed nodes are versions on feature branches, solid nodes are versions on main branches. Solid lines are direct ancestors, dashed lines indicate that there are some versions in between. Lots of feature branches have not been show. Bug fixes are typically applied to the oldest code experiencing the problem and then merged into the more recent versions, thus the code flow for bug fixes is kind of reverse. This bug fixing code flow is not shown inside the graph.

Note that you can use gitk to create the full DAG from the git archive. The purpose of my effort is to show the relationships that are well-hidden in gitk’s detailled view.

I have written a much more elaborate post about the “evolution of software“, unfortunately, it is available currently only in German (with very questionable results by Google Translate).

2009-03-172018-06-11

Why is there still PRI in a syslog message?

This is the first of a couple of blog posts I intend to do in response to Raffy’s post on syslog-protocol. I am very late, but better now than never. Raffy raised some good points. To some I agree, to some not and for some others it is probably interesting to see why things are as they are.

The bottom line is that this standard – as probably every standard – is a compromise of what could be agreed on by a larger group of people and corporate interests. Reading the IETF mailing list archives will educate much about this process, but I will dig out those interesting entry points into the mass of posts for you.

I originally thought I reply with a single blog post to Raffy. However, this tends to be undoable – every time I intend to start, something bigger and more important comes into my way. So I am now resorting to more granualar answers – hopefully this work.

Enough said, on the the meat. Raffy said:

Syslog message facility: Why still keeping this? The only reason that I see people using the facility is to filter messages. There are better ways to do that. Some of the pre-assigned groups are fairly arbitrary and not even really implemented in most OSs. UUCP subsystem? Who is still using that? I guess the reason for keeping it is backwards compatibility? If possible, I would really like this to be gone.

Priority calculation: The whole priority field is funky. The priority does not really have any meaning. The order does not imply importance. Why having this at all?

And I couldn’t agree more with this. In my personal view, keeping with the old-style facility is a large debt, but it was necessary to make the standard happen. Over time, I have to admit, I even tend to think it was a good idea to stick with this format, it actually eases transition.

Syslog-protocol has a long history. We thought several times we were done, and the first time this happened was in November, 2005. Everything was finalized and then there was a quite unfortunate (or fortunate, as you may say now ;)) IETF meeting. I couldn’t attend (too much effort to travel around the world for a 30-minute meeting…) and many other WG participants also could not.

It took us by surprise that the meeting agreed the standard was far from ready for publishing (read the meeting minutes). The objection raised a very long (and productive, I need to admit) WG maling list discussion. To really understand the spirit of what happened later, it would be useful to read mailing list archives starting with November, 14th.

However, this is lots of stuff, so let me pick out some posts that I find important. The most important fact is that backward compatibility became the WG charter’s top priority (one more post to prove the point). Among others, it was strongly suggested that both the PRI as well as the RFC 3164 timestamp be preserved. Thankfully, I was able to proof that there was no common understanding on the date part in different syslog server (actually, the research showed that nothing but PRI is common among syslogds…). So we went down and decided that PRI must be kept as is – to favor compatibility.

As I said, I did not like the decision at that time and I still do not like the very limited number of facilities that it provides to us (actually, I think facility is mostly useless). However, I have accepted that there is wisdom in trying to remain compatible with existing receivers – we will stick with them for a long time.

So I have to admit that I think it was a good decision to demand PRI begin compatible. With structured data and the other header fields, we do have ways of specifying different “facilities”, that is originating processes. Take this approach: look at facility as a down-level filtering capability. If you have a new syslogd (or write one!) make sure you can filter on all the other rich properties and not just facility.

In essence, I think this is the story why, in 2009, we still have the old-style PRI inside syslog messages…

2009-03-12

How Software gets stable…

I have received a couple of questions the past days if this or that rsyslog feature can be introduced into the stable branch soon. So I thought it is time to blog about what makes software stable – and what not…

But let me first start by something apparently unrelated: let me confess that, from time to time, I like to enjoy some good wine (Californian Merlot and Cabernet especially – ask my for my mailing address if you would like to contribute some! ;)). And at some special occasions, I spend way to much money just to get the “old stuff”: those nice wines that have aged in oak barriques. To cut a long story short, those wines are stored in barrels not only for storage, but because the exposure to the oak, as well as some properties of the storage container, interact with the wine and make it taste better. Wikipedia has the full story, and also this interesting quote:

The length of time that a wine spends in the barrel is dependent on the varietal and style of wine that the winemaker wishes to make. The majority of oak flavoring is imparted in the first few months that the wine is in contact with oak but a longer term exposure can affect the wine through the light aeration that the barrel allows which helps to precipitate the phenolic compounds and quickens the aging process of the wine.[8] New World Pinot noir may spend less than a year in oak. Premium Cabernet Sauvignon may spend two years. The very tannic Nebbiolo grape may spend four or more years in oak. High end Rioja producers will sometimes age their wines up to ten years in American oak to get a desired earthy, vanilla character.

Read it again: “High end Rioja producers will sometimes age their wines up to ten years in American oak to get a desired earthy, vanilla character.“

So what would the Riojan winemaker probably say if you asked him for a great 2008 wine (we are in early 2009 currently, just for the records)? How about “Be patient, my friend – wait another 9 years, and you can enjoy it!” And what if you begged him you need it now, immediately? “I am sorry, but I can’t accelerate time…“. And if you told him you really, really need it because otherwise you can not close an important business deal? Maybe he says “Listen my friend. Some things simply need time. You can’t hurry them. But if you need to have something that can’t really exist, I can get you a bottle of that wine and label it as ‘Famos Riojan 10-year aged Wine from 2008’ – but we both know what is in the bottle!“. Technically speaking, the winemaker is not even cheating – he claims that the wine is from 2008, and so how can it be aged 10 years? If anyone buys that (today), the onlooker is probably very much in fault.

As a side-note, all too often our society works in that way: someone requests something that is impossible to do, someone begs long enough until someone else cheats, everybody knows – and we all are happy (at least up to the point where the cheat gets us into real trouble… – pick your favorite economic crisis to elaborate).
The moral from the story? Some things need time. And you can’t replace time by anything else. If you want to have the real taste of a wine aged 10 years in oak… you need 10 years.

By now you probably wonder what all of this has to do with software. A lot! Have you ever thought what makes software stable? In closed source, you hopefully have a large testing department that helps you nail down bugs. In open source, you usually do not have many of these folks, but you have something much better: a community of loyal users eager to break their systems with the latest and greatest of what you happen to have thrown together ;)

In either case, you start with a relatively unstable program and with each bug report (assuming you fix it), the software gets more stable. While fixing bugs, however, you may introduce new instabilities. The larger the fix, the larger the risk. So the more you change, the larger the need to re-test and the larger the probability that while one issue is fixed one (or more!) issues have been newly created. For very large fixes, you may even end with a much worse version of the software than you had before.

Thankfully, a patch to fix a bug is usually much smaller than what was fixed. Often, it is just a few lines of code, so the risk to worsen things is low. Why is the patch usually just a few lines long? Simply because you fix some larger thing that usually works quite well. So you need to change some details which were not properly thought out and thus resulted in wrong behavior (if you made a design error, that’s a different story…).

So the more bug reports you get, and the more of them you fix, the more stable a software gets. You may have seen some formal verifications in computer science, but in practice, for most applications, this is the simple truth on how things work.

Now to new features: features are usually the opposite from a bugfix: introducing a new feature tends to be a larger effort, touching much more code and adding code where code never has been ;) If you add new features, chances are great that you introduce new bugs. So with each feature added, you should expect that the stability of your code decreases (and, oh boy, it does!). So how to iron out these newly introduced bugs? Simply wait for bug reports, fix them, wait for more – until you have reached at least a decent level of stability (aka “no new/serious bug reports received for a period of n days, whatever you have n defined to be).

And what if you then introduce a new feature? I guess by now you know: that’ll decrease stability so you need to iterate through the bugfixing process … and so on.

But, hey, we are doing open source. I *love* to add features every day! Umm… I guess my program will never reach a decent level of stability. Bad…

What to do? Taking a long vacation (seducing…) is not a real solution. Who will fix bugs while I am away (shame on me for mentioning this…)? But a pattern appears if you follow this thought: what you need to do to make a program stable is fix bugs for a period of time but refrain from adding new features!

Thanks to git, this can easily be done: you simply create one code branch for a version that shall become stable, and create another branch for the version where you create new features (the development branch). With a bit of git vodoo, you can even import fixes from your stabilizing branch to the development branch. Once you are happy with the stability of your code (in the stabilizing branch), you are ready to declare it to be stable! For that, you’ll probably have a separate branch. Then, you can start the game again: copy the state of your development branch to the stabilizing branch, do not touch that branch except for bug fixes and continue adding new features to the development branch. Iterate this as long as you are interested in your project.

This, in short form, is how rsyslog is created. Currently, there are four main branches, plus a number of utility branches that aid the development of specific features (let’s ignore them in this context here): we have the development (also called “master”) branch which equates to the … yes… development branch from the sample above;). The stabilizing branch is called “beta” in rsyslog terms. Then, we have a v2-stable and a v3-stable branch. Both are actually stable, but v2-is probably even more stable because it has – except for bug fixes – not been touched for many months more. It also has the fewest features, so it is probably the best choice if you are primarily interested in stability and do not need any of the new features. As rsyslog is further developed, we will add extra stable branches (e.g. there will probably be a v4- and v5-stable branch – but we may also no longer maintain v2-stable at this point because nobody uses it any longer [just like dinosaurs are no longer maintained ;)]).

Did you read carefully? Did you get the message? So let me ask:
What makes software stable?

Bug fixes? Testing? Money (yes, yes, please throw at me!)?

REALLY? Let me repeat:
WHAT MAKES SOFTWARE STABLE?

There is only one real ingredient and that is: TIME! Just like good wine, software needs to age. Thankfully, age, for software, is defined in number of different test cases. So money can accelerate aging of software (as some chemistry guru may be able for wine, probably with the same side-effects…). But for the typical open source project, stability simply goes along with the rate at which the community adopts new releases, tests them AND submits bugs, so that the authors can work on fixing broken things.

And what is the moral of the story? Finally, I am coming back to the opening questions: there is nothing but time that make rsyslog stable. So if you ask me to add a feature today, and I do, you can not expect it to be immediately stable – simply because this is not how things work (thanks, btw, for trusting so much in my programming abilities ;)). The new feature needs to go through all the stages, that is it must be applied to the current development build (otherwise we would de-stabilize the current beta, what is not desirable). Then, this is migrated to the stable build over time, where it can finally fully stabilize and, whenever the bug rate seems to justify this, it can move on to the stable build. For rsyslog, this typically means between three to four, sometimes more month are needed before a new feature hits the stable branches. And there is little you can do against that.

“But… hey, I need a stable version of that cool feature now! My manager demands it. Hey, I’ll also pay you for it…” Guess what? I can do the same the winemaker did. Of course, and if you ask really nicely, I can create a v3-stable-cool version for you, which is a version with the cool feature that I have declared immediately stable (btw, it’s mostly the same thing that all others just cal l “the beta”). If that satisfies your boss, I’ll happy to do. But we both know what you have gotten… ;)

Of course, I am exaggerating a bit here: in software, we can somewhat increase the speed of stabilizing by adding testers. Money (and even more motivation) can do that. We can also backport single new features to so-far stable branches (note the fine print!). This reduces the stability a bit, but obviously not as much as for the development version. However, this requires effort (read: time and/or money) and it may be impractical for many features. Some features simply rely on others that were newly introduced in that development version and if you backport the whole bunch of them, you’ll have something as much changed as the development version, but in an environment where the component integration is not as well tested and understood. Of course, some company policies (seem to) force you to do that. If so, the end result is that you have a system that is much less stable than the development version, but has a seemingly “stable” label. Wow, how cool! As the common sense says says: “everyone gets what one asks for” ;)

So what is the bottom line? Good software and good wine has something in common: time to ripen! Think about this the next time to ask me to offer a new feature as part of a stable branch. Its simply impossible. But, of course, you can bribe me to stick that “stable” label onto a mangled-with version…

2009-03-12

Platform importance for rsyslog

If you follow my blog or the rsyslog mailing list, you probably already know that rsyslog is available on a number of platforms. Thanks to contributors, rsyslog runs on BSD and is seen on Solaris and HP-UX too. The later two are not real ports yet and each of them has their restrictions. Also, I’d like to see support for AIX, but was not even able yet to obtain a compile platform.

HOWEVER… as much as I desire multi-platform support, it is the truth that rsyslog stems from and is fueled by the Linux community. This is where the major contributions come from and this is also where the major interest originates. Plus, this is the only truly free platform, so it lives up to the same spirit that rsyslog has.

When it comes to putting effort into the project, I have limited resources. Naturally, I put those resources to where they create the most effect. For that reason, most of the development is focused towards Linux (followed by BSD, where there is also an active community). Solaris and friends live mostly in the corporate world and so questions asking for rsyslog on these platforms mostly come from for-profit organizations. And there are very few of these requests. So I can not give them priority, because they do not benefit the project sufficiently large. HOWEVER, if the corporations put some money up and sponsor development, that is definitely in the interest of the project, because it allows us to grow and the sponsorship will probably allow us to do other things as well. Everyone benefits.

Once a platform is implemented, it must be maintained. Obviously, there is little point in orphaning a platform that we already run on. But for platforms with little interest, it is probably not justified to test each and every new release (just think of the testing time required). I’d call those platforms “tier 2” platforms and think I can look at them only in response to a problem report. Of course, we offer rsyslog support contracts and if a sufficiently large number of users decide to purchase these contracts (extremely low numbers today, to phrase it politely) and these purchasers are interested in e.g. Solaris, we will most probably change priorities and all out of sudden Solaris will become “tier 1”. Of course, this may push away some community-requested work, but again I think this is in the overall interest of the project: if we can secure continuous funding, not only from one source (Adiscon), but many, we can be much more sure we can implement more and more cool things in the future.

I hope this clarifies my position on the importance of the various platforms for rsyslog and how I will handle them.

Oh, and one final note: if a platform requires me to even purchase hardware (Solaris/Sparc for example), I will not do that unless someone donates a machine (NOT LEND it, but donate, so that at least for the next three years I can ensure maintaining rsyslog on it – a virtual machine, of course, is sufficient if you happen to have some inside a cloud ;)). It would be just plainly silly to put real money at supporting a community that does not contribute back ;)

2009-03-12

rsyslog video tutorials…

I started thinking about video tutorials a few days ago. Videos are cool and more and more people use them. So why not create a couple of them for rsyslog?

The idea is simple and I think it will work equally well for teaching both conceptual topics as well as practical “how to” types of problems. The later probably works even better…

I could investigate, design and build my tutorial in a perfect way. The result would obviously be very useful and perfect – but most probably there never would be any result due to time constraints and priorities. With this on my mind, I created a very first trial tutorial this morning, all in all in less than an hour. It took me some more minutes to get it up on the web site, but this effort will never again be required.

The question this trial shall answer is: is it possible to create something useful (not perfect) in little time? My personal feeling is mixed. I think one notices quickly that the material is not as much organized as you would expect from a talk. Also, some additional slides would definitely have enhanced the usefulness – but also increased production time very much. On the other hand, I think some information is conveyed by the presentation. And, even better, information that you can not obtain with reasonable effort from any other place.

So: is it useful or not? What could improve the usefulness without causing a large increase in production time? Does it make sense to create sub-optimal content but be able to create it as it can quickly be done? If so, which other topics would you like to see covered?

Please have a look at the rsyslog message flow video tutorial and let me know your thoughts!

2009-03-06

rsyslog and solaris

This week, I had the opportunity to work a bit on rsyslog on Solaris. Most importantly, I could set up a compile and test environment (*not* that easy if you don’t know your way around Solaris…) and have integrated those patches that folks have sent over time (unfortunately I have lost many of the contributor names, so if you are among them please let me know for proper credits!).

I was able to integrate those patches and make sure that they don’t break the linux build (I am still a bit in the verification process, but it looks good). I have created a solaris branch in git and will in the future keep solaris-specific additions in that branch. I will merge that branch back into the master branches every time I am confident enough that it doesn’t break anything in the main stream build.

I was satisfied to see that not that many changes were required for a Solaris build. So the initial effort, some month ago, seems to have paid well. I have seen that the solaris git branch compiles, but I have not done any serious testing on Solaris. Still, I am short on time and I have to admit I have spent more time on it this week than I should. So testing is off-limits for now…

However, I got some good impression on what it takes to make rsyslog really run on Solaris. First of all, even gcc4 does not provide the atomic instructions that it is used to provide on Linux. This case is not really handled in the code, so the end result is that the binary will be racy. I guess it will run, but it will have subtle issues on high-volume log servers and/or serves that run asynchronous action queues. Especially if the later is used, I’d expect rsyslogd to segfault every now and then (but without async actions it should not be that bad, at least I think).

There also still does no kernel input plugin exist (or an imklog driver). I also guess there may be issues with the local log socket. I’d still caution everybody to be very, very careful when experimenting with the local log socket. I remember earlier testing where rsyslogd simply destroyed the socket but never was able to re-create it. Some other tweaks are probably required to core and runtime files. Some compiler messages point into that direction (and part of that may even be nasty).

I have compiled only the bare essentials, without TLS, database drivers or anything else fancy. I expect some mild to moderate problems with them, too.

So in short, the current code base is probably be used to run a relatively stable syslog relay or file-only receiver. I wouldn’t put it in too much production, though. For folks interested in rsyslog on Solaris, we now at least have a version again that can be build and serve as a basis for extension. I am glad I could do that.

As a side-note, I am still looking for sponsors of a full rsyslog Solaris porting effort. If you would like to sponsor (or know someone who does), just mail me and I’ll help settle the dirty details ;)

I hope this update – and the progress made – on rsyslog on Solaris is useful for a couple of folks.