Thursday, May 24, 2007

Reputations and Accountability

The internet is a wonderful tool, but the reality is that it's a hostile environment. There are a lot of bad actors out there trying to cause mischief and the internet provide them a vast playground.

In the spam arena we've had a lot of work put into RBL's (lists of naugty IP addresses) but I believe it's time we took this to the next level. All spam filtering companies collect IP based statistics, identifying the individual sending bad guys isn't terribly difficult - but we can do better.

All IP addresses are assigned from ARIN - and you can look this information up for any given IP. This ties the IP into a network that was assigned to a specific entity (and possible delegated) - what this represents is the chain of accountability for that IP space. It is time to start getting really serious about combining the ARIN data and our spam statistics and light a more serious fire under all network owners.

We need a new generation of publicly available tools for holding these organizations to account, my expertise is spam fighting, but this holds just as true for security threats - networks that originate hostile network attacks need to be held to account just as much as the spam networks do. ARIN gives us physical addresses and possible company names - add in some other databases, and let's start seriously applying reputation scores, and get these in the public eye. Some parts of the internet are always going to be cess pools - let's identify it and make a framework that responsible network administrators can use to start walling off the worst of it.

I would particularly like to see a reputation score like this prominently displayed in google search results for a company. Let those search results give the searcher fair warning that they are about to step into the internet slums.

Wednesday, May 16, 2007

nagios

Another application I would like to draw your attention to is nagios. Nagios is an open source monitoring tool that has all the bits and pieces you need to be able to keep tabs on your environment.

One of the first complex applications I made was a suite of monitoring programs I called Imp (Never publicly released) and for it's time it was nice and all, but it was all command line driven and extending it wasn't the most friendly thing. Then one year we came across Nagios and we found it was an exceptional replacement across the board and we phased Imp out and moved to Nagios. I'm bringing this up just to highlight an important lesson - You will come up with a brilliant tool, and then someone will come up with a better one - you need to retain objectivity and be able to know when it's best to put your pride asside and move to the better solution.

So back to nagios - the tool has some great logic built into it, by supporting both active and passive checks you can monitor anything, even behind NAT. The main genius of nagios is the ease of writing checks. They defined a very easy to use standard (using STDIN, STDOUT) for your checks and they let NRPE/NSCA do all the network heavy lifting. So you have a script you can run on it's own and it'll tell you if what you are checking is working or not, all in all it makes building and debugging new checks a snap.

This was a great example of elegant design, by making it so easy to make checks they pushed us to make a decision -> If any outage occurs that we didn't catch via monitoring than you must write a check that will find it next time. This policy has been in place for a number of years and we almost never experience any outages of any kind without the monitoring service letting us know about it.

Another note I'd like to bring up is the value of using RRD heavily (with Cacti if you'd like) and Nagios. Graphs are great, but they tend to just grow in the corner and no one looks at them. Nagios gives you an easy mechanism to create alerts that will go off that will make an admin go look at those graphs - Ideally they will make that happen long before anything is service impacting.

RRDtool

Understanding what is really going on with a complex system is a tough job - fortunately there is a relatively easy to use tool that can help you paint a visual picture.

That tool is RRD written by Tobias Oetiker . An RRD is a "Round Robin Database" the principle is that you collect data over time, and the most interesting data is recent, but you still want to know about historic data in general. What this means is that you can define a data set to say have a granularity of every 15 minutes for the last week, but by the time you get to looking at data from a year ago, it only has a 12 hour granularity.

The beauty of this tool is that once you get it setup and start shoving data into it, it manages all the magic of compressing your data over time for you. Just keep pushing things in and it'll take care of the rest.

Now if this was all it did, it wouldn't be terribly useful - so where it's real power comes in is the ability to generate graphs. Now instead of just seeing a giant pile of numbers in your reports, you can actually see the flow of how your operation functions over time. This can help you zero in on problem areas, particularly the ones that happen at 2 am when no one is watching all the closely.

Another huge value of having tons of data shoveling into these RRD graphs all the time is that many problems don't spring up over night. So when that service starts flaking out, and you don't know why, you can go look at your graphs and see "oh wow, it's increasing the amount of Disk I/O performed every day for the last week" - so you can zero in on what action you need to take to fix a situation much more rapidly. If you aren't collecting that data all you would know is that you were now out of disk I/O and you might assume your only option is to add disks, because you can't see that the problem only really started a week ago when you added a poorly written application to the server that is slowly gobbling up all your I/O.

What I'm trying to paint out with this series of blogs is the fact that there are some excellent tools out there, and you can make really amazing things if you know how to find and then use these tools. I started this all off by talking about how I'm a perl junkie - the fact is perl is exceptional at stringing all kinds of totally unrelated applications together to make something new and uniquely useful.

On a closing note relating to RRD I would encourage you to also take a look at Cacti. The cacti developers have put together a pretty click tool with tons of templates, so if you have a relatively standard environment (routers, switches, servers, etc) odds are good you can drop Cacti in and hit the ground running with great graphs of the most important things you need to keep an eye on. You should spend you time solving new and interesting problems, not re-solving old problems.

MySQL

Next on the list is an application that shouldn't be a stranger to most - MySQL.

Before diving into it, I should not that the real power house here is SQL and there are other options such as Postgres that almost certainly could do as well or better.

The reason I wanted to bring this up is that having a powerhouse database server sitting behind you when you are trying to put applications together is amazing. There are a ton of data management problems that you solve in one feel swooop by just choosing to use MySQL as a back end to store your data.

If you have an application that needs to have hundreds of thousands of entries added per day, and manage that insane quantity of information so that you can use it in a productive way - then MySQL is a great tool.

A thing to note here - if you just have an application that doesn't need any persistent data and doesn't ever need to interact with anything than of course you don't need a relational database added into the mix. The really interesting projects tend to be really complicated and so I tend to lean on MySQL a lot for it. In my last blog I talked about postfix - a key lesson I took away from postfix is that it's great to write small app's that each do targeted jobs. MySQL is a great enabler for this approach. Say you have an application that requires real time threat analysis and triggers actions to stop hostile attackers. Using MySQL as the point of IPC (Inter Process Communication) it let's you split the job up into relatively easy to manage chunks. You have agents that collect your data and feed it back, and then you have a series of analysis scripts that look for trends in that data, and finally you have scripts designed to process that analysis into actions.

One of the big weaknesses of relying too much on MySQL is that it creates a central point of failure for your application. I hate designing applications with this kind of a dependency in them - fortunately there are ways to deal with this to. First I like to use queue's for data that needs to be written to a SQL server. Rather than add a dependincy on a mission critical service (like email delivery) to the MySQL server, I instead make my application dump it's logs into a queue. I then have a secondary program that processes that queue of log data and loads them into the SQL server. In this way if the sql server is down, service still progresses just fine, and when it comes back up it get's all the logs sent over to it, so it doesn't miss anything.

If you are using your SQL server to impact the configuration of applications there are other tricks that work to get around that. The most basic one is to use SQL for all of your configuration, then write scripts that generate application specific configuration files that are then distributed to your remote servers. In this way you get all the wonderful advantages of SQL for managing your data, and you keep all of your services from depending on the SQL server being up in order to function.

The preceding is all underscoring an important reality for many applications - the application must have 99.99% uptime, but the ability to modify settings for that application has a whole lot more breathing room. To illustrate - how mad are you if you can't get your email? how mad are you if you can't change your password?

I'm trying to communicate a lot of best practices in these blogs - you can solve any problem a whole lot of different ways, but I would encourage you to look over these ideas and at least think of them when you are diving in to make your own "useful things".

Tuesday, May 15, 2007

Postfix - better than sendmail

I'll dive into the C# stuff as I actually get a clue what I'm talking about, in the mean time I'm going to start going over a number of the tools I've used over the years that I really like and want to share some information on them.

To start this off I'd like to talk to you about postfix. Postfix is incredibly powerful mail server
designed by Wietse Venema. You can read up on it's history on their site, but it is a much nicer tool to use that sendmail in every way I've ever cared about.

I ran my ISP using sendmail prior to 2002, and it worked well enough. Upgrades were a pain in the neck, and security exploits were disturbingly common. Making configuration changes was horrifically complicated. Sendmail is extremely powerful, but it made things far too complicated. I like to make things I can hand to someone else so that I don't have to maintain it for all time - sendmail configuration complexities made this very hard to do.

When we first wrote mailarmory we put it together using milters, which are a very cool idea. The very first iteration we tried out used mimedefang from Roaring Penguin Software. It was an excellent tool, but it couldn't alter the weaknesses of sendmail. If anything the situation was made much worse with the addition of the complexities of milters.

For the uninitiated, milters allow you enormous amounts of control over everything that happens in an SMTP discussion. This could let you do something like rejecting all mail at the door that has the word "monkey" in it, and send a response like "550 Server does not permit the discussion of monkeys"

I really don't want to trash talk sendmail - it's a great application, and many people have had great luck using it. But postfix was designed by someone who took a really good look at the weaknesses of sendmail, and built it better. One of the first things I'd like to draw attention to is the compartmentalized nature of postfix. Sendmail is basically one executable that does everything remotely related to mail. By contrast postfix has tons of smaller daemons that each just do their little part. This compartmentalization has allowed each one of these little daemons to be optimized and very well secured. It's hard to make huge applications immune to exploitation, it's relatively easy to secure simple programs though.

One of the big differences between sendmail and postfix is how the message queues themselves are handled. Postfix now allows control over the "before queue" filtering via the recently added Postfix Milters or by using the Postfix Policy Daemon's, but back in 2001 neither of these existed, and the only option for postfix was after queue filtering.

This distinction is really important - Once you queue a message that means you have accepted responsibility for that message for better or worse. So which is better - before content filtering, or after content filtering? The answer isn't as clear as you might think, and it is because filtering is really expensive processing wise, and anything that makes your smtp conversation take longer kills your performance. So ideally you would want all content filtering to happen before the queue, and you would want it to take no time at all so that your server can handle an unlimited amount of email. In the sendmail model we originally employed it worked more or less like this, with the downside that there was a rather serious performance hit by adding the content filtering in. There was also a really nasty side effect of this approach - you are handing spammers the ability to DOS your server. By putting so much processing near the edge you are giving them a tool to use to grind your servers to a halt. This hit us pretty hard in the early days as we were learning the ropes. (FYI - morally bankrupt != stupid, many spammers are quite intelligent).

Let me try and show an example of why this can hurt so badly - if a single SMTP process can accept one message per second, then to accept 10 messages per second you would need 10 SMTP processes. If you are doing before-queue filtering and it now would take 5 seconds to accept that message, you would now need 50 SMTP process. Now figure on the ugly fact that a server under load does everything slower, so that 5 seconds becomes 10 seconds, which means it would need 100 SMTP processes to keep up, which slows it further and .... you are having a really bad time of it. The thing to keep in mind is that basically when things start going bad, they go really bad really fast, and your server can handle less and less messages per second the more that tries to come in.

After queue filtering doesn't entirely solve this problem, but it gives some great tools that help quite a bit - most particularly it let's you control the rate that you send messages to your filter engine, while still allowing you to accept messages from the outside world. This is still not an ideal situation as the unprocessed mail backs up in your queues, so you get some mail delays, but they are much more minor than what you get with before content filtering.

So if both filter methods would let you handle say 10 messages/second, they both behave very differently when the number of incoming messages goes higher than 10.

Before queue
15 msgs/sec incoming -> only can handle 5 msgs/sec
20 msgs/sec incoming -> only can handle 1 msg /sec
100 msgs/sec incoming -> server probably crashes, at the least it will fail incoming SMTP requests

After queue:
15 msgs/sec incoming -> handles 10msgs/sec
20 msgs/sec incoming -> handles 10msgs/sec
100 msgs/sec incoming -> handles 10msgs/sec

The numbers above are made up, but the impact described is very real, the lessons I'm describing there were learned the hard way.

But - this isn't to say before queue filtering is not a really cool and powerful thing, remember it lets us stop a message before we accept it into our queue, if we block it at the door we don't have to bounce it if we later decide we don't want it. The key lesson you should walk away with is - Keep all of your before queue filtering extremely light weight. Do your heavy lifting in after queue filters.

Which brings me back to the point of this blog -> postfix puts the controls for all of this in your hands in a much easier to understand way than sendmail, and it also gives you a whole lot more options for doing whatever you want to do. So if you do things one way at first, it is often fairly straightforward to change your mind and do them a different way later.

There is a philosophy I picked up from perl that I encourage you to think about "The simple things should be easy, and the hard things should be possible". I believe postfix nails this, and it's one of the reasons I strongly endorse it.

I'd like to close on a note for you to consider: when our environment ran sendmail, I believe we had to do emergency sendmail upgrades due to late breaking remote root exploits about a dozen or so times. Since we moved to postfix, we have not made a single security related upgrade ever. Every upgrade we've done (and they are pretty easy) has been done because postfix added some cool new feature we wanted to take advantage of.

If you have any questions about anything I am writing about speak up - I know I get rambly on explaining all of this, and if you'd like to know more (or some actual details) about how to do things with postfix I'd be happy to share them as well.

Take care,
Neil

Introduction

I'm a long time user of the internet, but new to actually publishing content of any kind. I've done system administration in one form or another for about 11 years now, and know about as much as there is to know about running a unix environment for an ISP. I've accumulated a considerable amount of expertise in that time frame and I'd like to work on giving back.

I am a long time lover of the programming language perl - it's such a wonderful tool, and making it do your job for you is a lot of fun. I was introduced to the language back in '96 by Nathan Torkington - not long after converting many of us to the Perl cult he went on to write the Perl Cookbook an exceptional guide to doing Useful Things(tm).

In my pursuit of doing "Useful Things" I coauthored the Mailarmory spam and virus filter system, it's heuristic analysis engine is loosely based on the excellent work of the SpamAssassin project. My favorite aspect was in designing progressively adaptive automated response systems that proactively identifies spammer IP's and works to stop them.

In all of my work in perl the one consistent limitation I have always encountered is user interface. Mostly I've built command line interfaces, and if your a unix guru that is fine, but it really isn't a good answer. Beyond that I've built backend API's for use by web developers to make pretty user self-management tools. I remain dissatisfied with the lot of it though.

Which brings me to now - I'm finally catching up to the modern era and starting to learn how to code on Windows. My current pursuit is learning to use Visual Studio 2005 and write in C# and the DirectX libraries. There are a few reasons for this, but mostly it comes down to trying to maximize the possibilities for learning how to do new "useful things". I'd like to see what new doors open up when a hardcore server junky crosses over into 3D design. I won't go into details yet, but if you have ever had to manage an enormously complicated system and couldn't wrap your head around all the details, consider what being able to view that problem in 3D could do for you.

Take Care,
Neil