Spam Detection using an Artificial Immune System

by Terri Oda, Tony White

Welcome Slashdot readers: You may wish to read my FAQ about this paper

This article was published in Crossroads Magazine, November 2004 edition. It was supposed to be on their website, but since it no longer seems to be available, I have provided this copy for reference.

Abstract

As anti-spam solutions evolve to limit junk email, the senders quickly adapt to make sure their messages are seen. This article describes the application of an artificial immune system model to effectively protect email users from unwanted messages. In particular, it tests a spam immune system against the publicly available SpamAssassin [9] corpus of spam and non-spam. It does so by classifying email messages with the detectors produced by the immune system. The resulting system classifies the messages with accuracy similar to that of other spam filters, but it does so with fewer detectors.

Introduction

"Spam" represents the electronic equivalent of junk mail. While several definitions exist, all typically include the characteristics of being unsolicited, undesired and bulk e-mail messages commonly used for advertisement. Humans are very good at finding and handling spam messages. However, as the quantity of spam and the ratio of illegitimate to legitimate messages increases, it becomes more difficult, as well as more time-consuming and costly, to have email filtered manually.

It seems logical to have a spam detector that adapts as spam changes, and an adaptive system based on Bayes' rules was proposed in 1998 [7] [6]. This model was more recently popularized by Graham [4], who reported extraordinarily good results with his Bayesian classifier. Several variations on Graham's approach, integrating other technologies have also been implemented [8].

As an alternative, the human body provides us with an interesting metaphor that provides the ability to differentiate between the desirable -- normal cellular activity -- and the undesirable -- invasion by bacteria or viruses, known as pathogens. The immune system model lends itself reasonably well to the creation of another adaptive spam detection system [5]. Like Bayesian systems, the Artificial Immune System (AIS) described in this article contains a unique set of detectors, making it harder for junk email senders to create messages that will penetrate such systems. It can detect not only repeat messages, but also materials that have not already been seen by the system. An AIS has many of the advantages seen in the Bayesian systems. In addition, it can use fairly complex heuristics as detectors, which leads to very accurate matching. For example, the word "free" may not be a good unsolicited email detector by itself, but when the system can distinguish between "free software" and "absolutely FREE!!!" it can sort e-mail more accurately. The remainder of this article provides details on the AIS-based spam detection system, including how detectors are created, how the detectors are trained to accurately model the user's email interests and how the system can continue to evolve as those interests change.

The Human Immune System

The human immune system distinguishes between self and non-self, whereas the AIS distinguishes between a self of legitimate email (non-spam) and a non-self of illegitimate email (spam). However, the biological immune system has an advantage over a spam immune system. The biological self does not change in ways that matter to the immune system since the proteins used to distinguish the self remain the same over the lifetime of the organism. Unfortunately, the spam immune system has the same problem found in computer security immune systems [3]: the self changes over time. As a person meets new friends and business contacts, discusses current issues, develops new interests, and even learns new languages, the content and meta-information for the messages that person receives will change.

This does not mean that the immune system model cannot be used for spam detection, only that the model must be used with some caution. The system must be able to forget as well as learn things. A healthy person might have no interest in pharmaceuticals, so any mail containing information about drugs could be dropped. But if that same person is diagnosed with a disease, he or she may start to have discussions which include words and phrases which formerly indicated unwanted email. Therefore, to be effective, the system must be able to adapt accordingly.

There are several items needed for a spam AIS: The input messages, the lymphocytes, and the library used to create the antibodies. (Lymphocytes and antibodies are parts of the human immune system which are replicated in the spam AIS.) The system also captures the idea of the lifecycle of a digital lymphocyte; this explains how the antibodies are created and used, as well as how they eventually die.

An immune system's main goal is to distinguish between self and potentially dangerous non-self elements. In a spam immune system, we want to distinguish desirable messages from undesirable messages. Like biological pathogens, spam comes in a variety of forms with some pathogens being only slight variations (mutations) of others. As a result, self and non-self can be translated to non-spam and spam through the identification of pathogens.

The immune system also possesses other qualities that make it attractive as a spam filter model. The systems produced are diverse, meaning that it would be difficult for a single email to be crafted to penetrate multiple filters. The biological immune system deals appropriately with non-self except in the case of exceptional diseases. (If the immune system were inaccurate, the lifespan of the average human would be much shorter as the system would mistakenly attack vital cells or fail to attack viruses and other dangerous pathogens.) As well as handling known infections, the immune system can adapt to new pathogens reasonably quickly, and it remembers what it has seen before.

What does the immune system do?

The immune system exists to protect organisms from potentially harmful pathogens, such as bacteria, viruses, and other foreign life forms and substances. The immune system accomplishes this goal by carefully distinguishing the parts of the organism protected by this immune system (self) from foreign entities (non-self). It is this classification of self and non-self that makes the immune system an appealing model for spam detection, which also requires a classification between wanted (self) and unwanted (non-self) email.

Most commercial-grade spam detection solutions use multiple techniques to achieve higher detection accuracy when compared to a single technique. Similarly, the immune system is not reliant upon a single technique; it consists of multiple layers which all help to protect the body:

  1. The Skin and Mucous Membranes form the outer layer of defense, a physical barrier against attack.
  2. Physiological Defenses such as high pH or temperature make it harder for pathogens to survive in the body.
  3. The Innate Immune System is a non-adaptive system available at birth. Its quick reaction to potential infections can stop many attacks before they can get underway, and it also serves to activate the adaptive immune system.
  4. The Adaptive or Acquired Immune System can handle invaders which have been missed by the innate immune system. It takes longer to mount a response the first time a pathogen is encountered, but after that can detect the same or similar threads very quickly.

While all of the layers of the immune system are interesting and potentially useful as models for computer programs, the adaptive immune system proves most useful in this context.

The Adaptive Immune System

The central component of the adaptive immune system is specialized white blood cells called lymphocytes. These serve to identify and act upon those objects which are not part of the self. Rather than attacking everything, the body uses a system called the Major Histocompatibility Complex (MHC) that marks the cells of the body as self. Anything not carrying these markings may be attacked by the immune system.

When a cell is infected, the MHC molecules present fragments of surface proteins called peptides on the surface of the cell. These fragments are called antigens. The immune system checks the antigens using a specialized detector called an antibody. Each lymphocyte has many copies of the same antibody on its surface and detects pathogens when these antibodies bind to antigens. Interestingly, this binding does not have to be perfect. If the antibody and antigen match reasonably well, binding will still occur, although not as strongly as it would for a perfect match.

To apply these ideas to spam, we treat spam as a pathogen, and the complete email message is used as an antigen. The lymphocytes are digital bits of information, and each one includes a pattern that is used as an antibody.

Creation of Antibodies

Antibodies are constantly being created by the body and each associated lymphocyte can live for as little as days or as long as years. The body has a library of gene fragments which represent all the necessary information to create detectors for all possible pathogen types. In order to create the repertoire or population of lymphocytes in the body at any given time, elements from this library are randomly recombined to produce a diverse population of receptors. As seen in Figure 1, a small set of gene fragments can be used to produce many different results.

Figure 1: Random Recombination
of Gene Fragments
Figure 1: Random Recombination of Gene Fragments

The creation of digital antibodies for the spam AIS is also done with a library, but instead of gene fragments, this library contains fragments of text-matching patterns, also known as regular expressions.

Detection/Binding

Detection is done through binding, but it should be noted that this binding is not an exact process. One lymphocyte's antibodies may bind to many different antigens, although some will bind more closely than others.

The strength of the binding depends on how closely the shapes of the antibody and antigen match. This strength is determined not only by their three-dimensional physical shapes, but also by the charges involved which may attract or repulse different parts of the detector and detectee. A given detector will bind to many targets, and a given target might have multiple detectors that can match it.

To better understand this concept, it may help to think of the antigen as having the shape of a car. One antibody might have a shape similar to that of a "hatch" trunk, and would thus match many types of compact hatchback cars. This antibody might also match other vehicles whose components resemble this shape, like small vans with similarly-shaped rear ends. Conversely, the antibody might be unable to detect some cars with alterations in this area, like a large spoiler attached to a hatchback car The strength of the binding is determined by how close the receptor and the thing it's detecting can become attached, including attraction of charges as well as simple geometry; thus, a threshold level must be reached before the two are said to bind.

There are approximately 1016 different foreign proteins which the immune system must recognize, yet the repertoire of the immune system contains a much smaller number of actual receptor types, closer to 108. Therefore, this approximate binding enables the immune system to use a smaller number of antibodies to detect a much larger number of potential pathogens, as long as pathogens have similar shapes.

Memory

Since lymphocytes with new antibodies are being created constantly, it seems that the immune system would need to constantly re-learn the same pathogens over and over again. However, we know that for some diseases, such as chicken pox, most people are immune for life.

One theory as to how this occurs suggests that there may be special memory cells that are created once a pathogen has been successfully detected. These are a special type of long-lived lymphocyte that stays in the body forever, which means that these lymphocytes are always available to react to an infection.

Another theory suggests that some pathogens stay in the body as low-level infections, so new lymphocytes are always being made which detect these pathogens.

Both of these theories are useful when applied to the idea of employing lymphocytes for spam. Email viruses and many spams contain substantial amounts of identical material; therefore, the recognition of identical or similar messages for extended periods of time, though not necessarily forever, is definitely desirable. After all, there's little benefit in keeping lymphocytes related to spams that you won't see again, such as those for products related to current events. In junk email, a constant level of infection exists. By combining the idea of long lived lymphocytes and constant stimulation, we can develop a system which balances the need to remember with the need to forget.

How it works

Spam detection already works in layers like the human immune system: there is a layer of legal protection [2], as well as a variety of technical protections that can be layered together. Blacklisting [11], for example, can work as one line of defense for a system that might also have other filters in place. Systems such as SpamAssassin [9] combine many different methods. The spam immune system should be able to integrate into such a layered system, but for the purpose of this paper, we look at it as a single entity.

The spam immune system works as follows:

First a library of appropriate gene fragments must be collected. Then a collection of email that is already classified as spam or non-spam should be assembled for training. Lymphocytes are generated and initially trained using a corpus containing both types of message. Each lymphocyte retains two pieces of weight information: spam_matched, the weighted number of spams matched by this lymphocyte and msg_matched, the weighted number of messages matched by this lymphocyte.

Once the system is running normally (i.e. has been trained), lymphocytes learn continually as new emails come in and are assigned a spam score by the system. Periodically, the lymphocytes are culled and aged, and new lymphocytes are created to replace those lost. The lifecycle of a digital lymphocyte is shown in Figure 2

Figure 2: Lifecycle of a digital lymphocyte
Figure 2: Lifecycle of a digital lymphocyte.

In this way, while the system is distinguishing self and non-self, it can also both learn and unlearn information, thus adapting to the changing nature of email.

The Library

The library consists of a set of gene fragments, each one representing a particular regular expression. The words used in spam and legitimate messages represent certain subsets of the written language.

There are advantages in speed to using a smaller library rather than one that contains every gene fragment that could possibly be used in an email. Unfortunately, there are also drawbacks to this approach. One of the most significant problems for learning occurs when a message is found that no detector matches. With an incomprehensive library, no gene combination may be able to produce such a detector.

Researchers working with Bayesian-type spam systems have circumvented this problem by having detectors for every new word found in a message and combining the probilities of the most interesting individual matches. In the future, we hope to take a similar approach for creating new detectors. Initial gene libraries for the spam immune system can be derived from a variety of sources, including:

  1. Words from one or many languages
  2. Words found in a collection of messages (spam, non-spam, or both)
  3. Phrases found in a collection of messages
  4. Contact information in spam messages. Since many spam messages are attempting to sell something, the telephone numbers and web addresses are often constant even if the rest of the message changes.
  5. Header information
  6. Bits of Javascript and HTML code often used in spam.

The partial patterns in these gene libraries are combined to create the antibodies used by the immune system.

The library that generates the best results is based upon a library of heuristics. By concentrating on words and phrases that are more likely to indicate a classification for the message, the system produces more "useful" detectors and can achieve results with a much smaller set of detectors. By using the valuable knowledge available about email messages, we can avoid common subsets: common words such as "the" or "and" tell us little about the classificatin of the message.

The heuristic library is much smaller than its counterparts. The heuristics used are drawn from SpamAssassin and information from the training results of Bayes classifiers, as well as directly from the examination of spam. Some potential heuristics include:

  1. Does the message claim that you can unsubscribe from the list by replying with the word "remove" in the subject line?
  2. Does the string "NO QUESTIONS ASKED" appear anywhere in the message?
  3. Does the message claim that it is not spam?
  4. Does the message claim that the product has been seen on a well-known news source, such as a large television network?

Lifecycle

The lifecycle of a digital lymphocyte starts when the lymphocyte is created. Once the lymphocyte is created, it can be used to match messages. Eventually, as patterns of junk email change, a lympocyte will cease to match any messages, so then it expires and dies. A lymphocyte must continue to match spam in order to remain in the repertoire of detectors.

This lifecycle is described more precisely in the Lifecycle Algorithm.

Lifecycle Algorithm {Spam Immune System}
BEGIN
  repertoire <- {} {Initialize repertoire (list) of lymphocytes to be empty}
  update_time <- time of next lymphocyte update {e.g. 10 days from now}

  Generate lymphocytes (See Generation Algorithm)
  Do initial training (See Training Algorithm)
  WHILE{Immune System is running}
    IF{message is received}
      Apply lymphocytes (See Apply Algorithm)
    ENDIF
    IF{current time > update_time}
       Cull lymphocytes (See Culling Algorithm)
       Generate lymphocytes to replace those culling (See Generation Algorithm)
       update_time <- time of next lymphocyte update
    ENDIF
  ENDWHILE
END  

Creation of Lymphocytes and their Antibodies

As described above, a lymphocyte contains weighting information and an antibody. The weighting information is simply initialized to zero, but the antibody must be created from a gene library such as those described earlier.

For simplicity, the antibodies are created randomly. As described by the Generation Algorithm, each antibody starts with a gene fragment randomly chosen from the library. A random number between 0 and 1 is generated, and if that number is smaller than the user-selected probability for appending to occur, then another randomly chosen gene fragment is appended to the antibody. It continues to grow in this manner until the random number generated is larger than or equal to the probability of appending. Between each gene fragment a wildcard (a pattern which matches 0 or more characters, in this case) is placed. These wildcards are meant to help simulate the partial matching done by the biological immune system.

This process may be clearer with an example. Suppose that our library consists of only three gene fragments:

library = {A, B, C}

And the probability of appending is 0.7 or 70%, p_appending = 0.7

One of the gene fragments is randomly chosen to be the first gene fragment in the antibody. Suppose that the one chosen is B.

antibody = B

A random number is generated. On this first round, the number is 0.3. This is smaller than p_appending so another gene fragment is added to the antibody, along with a wildcard ($.*$) to separate the two:

antibody = B.*A

Another random number is generated. In this second round, the number is 0.62 so another gene fragment and wildcard are added:

antibody = B.*A.*A

Note that there is no problem with the same gene fragment appearing multiple times in the final antibody. Another random number is generated, but this time the number is 0.87. As a result, the antibody is now finished and its associated lymphocyte will be added to the repertoire of the immune system.

Generation Algorithm {Generation of lymphocytes}
BEGIN
  library <- a gene fragment library (cannot be empty)
  repertoire <- the list of existing lymphocytes (may be empty)
  p_appending <- the probability of appending to antibody
  WHILE {repertoire is smaller than the required size} 
	lymphocyte <- a new empty memory structure with space for an antibody,
and the numbers msg_matched and spam_matched 
	antibody <- ""  {An empty string to start the new antibody being created.
This will be a regular expression made up of genes and wildcards.}

	lymphocyte.msg_matched <- 0 
	lymphocyte.spam_matched <- 0 
	REPEAT
   	  antibody <- randomly chosen gene fragment from library
  	  x <-  randomly chosen number between 0 and 1
   	  WHILE{x < p_appending}
      	new gene <- new randomly chosen gene fragment from library
      	antibody <- concatenate antibody, an expression that matches 0 or
more characters, and new gene
      	x <- new randomly chosen number between 0 and 1
   	  ENDWHILE
	UNTIL {an antibody is created that does not not match any in the repertoire}
	
	lymphocyte.antibody <- antibody
	Add lymphocyte to repertoire of lymphocytes
  ENDWHILE
END

Note that the repertoire of the immune system does not contain duplicates. These were allowed in earlier versions of this system, but it was concluded that doing so required the system to waste processing time by applying the same antibody to a message multiple times.

Training of Lymphocytes

In the initial training phase, each training message has been classified as spam or non-spam. This classification may be done by the user, or the user may choose to use a corpus of messages which someone else has classified. For each message that matches a given antibody, the associated lymphocyte's msg_matched score is incremented by one. If the message is known spam, then the lymphocyte's spam_matched score is also incremented by one.

Training Algorithm {Training of lymphocytes}
BEGIN
  repertoire <- the list of lymphocytes (cannot be an empty list)
  message <- a message which has been marked as spam or non-spam
  IF {the message is user-determined spam}
    spam_increment <- 1
  ELSIF {the message is user-determined non-spam}
    spam_increment <- 0
  ELSE
    spam_increment <- a number between 0 and 1 indicating how likely the
message is to be spam
  ENDIF
  FOR {each lymphocyte in the repertoire}
    IF {lymphocyte.antibody matches the message}
       increment lymphocyte.msg_matched value by 1
       increment lymphocyte.spam_matched value by spam_increment
    ENDIF
  ENDFOR
END

Note that matching in the spam immune system is a binary process: a regular expression either matches a message or it does not. The partial matching is simualted by the fact that each regular expression may match many variations of a pattern, but the final threshold for binding is whether that particular regular expression matches that particular message.

Application and weighting of lymphocytes

The two numbers spam_matched and msg_matched can be used to give a weighted percentage of the time an antibody detects spam. The field msg_matched gives an indication of how often this antibody has been used, which helps determine how important it should be in the final weighting. An antibody that matches with a rate of 100% over a sample of 2 messages is probably not as useful as one that matches with accuracy 80% over a sample of 1000 messages.

This simple scoring system was used in [5]. The problem with this method was that one highly weighted lymphocyte could easily overpower the sum for long periods of time, even if the lymphocyte no longer matched much spam. A variant which helps to solve this problem is a weighted average, which is the sum of the spam_matched values from all matching lymphocytes divided by the sum of all the msg_matched values from all matching lymphocytes.

weighted average =
sum_matchinglymphocytes(spam_matched) /
sum_matchinglymphocytes(msg_matched)

This weighted average allows lymphocytes that have matched more often to have more effect on the final score than those that only match occasionally. It is also worth noting that the weighted sum has bounded results (all results are between 0 and 1, inclusive).

Apply Algorithm {Application of antibodies with dynamically updated
weights}
BEGIN
  repertoire <- the list of antibodies (cannot be an empty list)
  message <- a message to be marked
  threshold <- a cutoff point valued between 0 and 1 inclusive; anything
with a higher score than this is spam {chosen by user} 
  increment <- a value between 0 and 1 inclusive, depending upon the user's
confidence in the system. {chosen by user}
  total_spam_matched <- 0 {initialize # of spams matched to 0}
  total_msg_matched <- 0 {initialize # of messages matched to 0}
  matching_lymphocytes <- {} {Initialize empty list of matching lymphocytes}

  FOR{each lymphocyte in the repertoire}
    IF {lymphocyte.antibody matches message}
      total_spam_matched <- total_spam_matched + lymphocyte.spam_matched
      total_msg_matched <- total_msg_matched + lymphocyte.msg_matched
      lymphocyte.msg_matched <- lymphocyte.msg_matched + 1
      {increment the # of messages matched by this antibody}
      add lymphocyte to matching_lymphocytes
    ENDIF
  ENDFOR

  score <- total_spam_matched/total_msg_matched {Determine the score using a
weighted sum}
  IF {score < threshold}   {Message is spam}
    FOR{each lymphocyte in matching_lymphocytes}
      lymphocyte.spam_matched <- lymphocyte.spam_matched + increment
    ENDFOR
  ELSE
    Message is not spam
  ENDIF
END

The increment is chosen by the user and is used to update lymphocytes automatically. It represents the likelihood that a message is spam given that the system thinks it is spam. A neutral response would be 0.5 (meaning that the user doesn't think it's more likely either way), and an affirmative response would be 1 (the user thinks that anything tagged by the system is guaranteed to be spam). There is little use for a response of 0 (meaning that the user has no confidence in the system's detection ability) during regular learning, but it is useful for re-training purposes.

Culling of antibodies: Ageing and Death

To cope with the fact that both the self of legitimate messages and the non-self of junk email change over time, the spam immune system needs to be able to unlearn as well as learn things.

Each lymphocyte stores the information about the weighted number of messages and spam messages it matches. Periodically (perhaps once per week), the system looks at all the lymphocytes and culls those that haven't been used as much recently, although those that have matched many times in the past still have an advantage.

Lymphocytes also "age" during this culling process. The two values spam_matched and msg_matched are decreased by a percentage (so that the ratio between the two stays the same). Eventually, if the lymphocyte does not match new messages, the value of msg_matched will become small enough that the lymphocyte will be culled.

In this way, any lymphocyte that matches many messages can potentially become a "memory cell." In order for it to remain a memory cell, the system must experience "re-infection" of similar unwanted messages.

Culling Algorithm {Culling of antibodies: aging and death}
BEGIN
  FOR{each lymphocyte in the repertoire (list of all lymphocytes) }
    decrement lymphocyte.msg_matched
    decrement lymphocyte.spam_matched so that the ratio between it and
lymphocyte.msg_matched is the same as it was before the aging
    IF {lymphocyte.msg_matched < a set threshold}
      remove antibody from data store
    ENDIF
  ENDFOR
END

This algorithm ensures that lymphocytes need to be continually stimulated in order to be retained and that culled antibodies are replaced with new lymphocytes.

Results

1000 lymphocytes were generated from a library of less than 200 genes. The genes were fairly complex, based on heuristic phrases used for spam detection.

The system was initially trained with 1500 spam and 1500 non-spam messages [10]. Once the training was completed, only 156 lymphocytes had been assigned a non-zero weight. Of these, 127 matched only spam, while the others had also matched legitimate messages. Messages were not expired and new lymphocytes were not generated in these tests, which were only intended to give a comparison between the two final scoring systems for messages.

With more lymphocytes generated initially or with more genes in the original library, the system would be more accurate, but these parameters were chosen because they seemed to give reasonable results with a fairly lightweight system [5].

The trained lymphocytes were then tested against a collection of 501 non-spam and 401 spam messages. (These numbers were chosen to give a fairly large sample and to reflect the percentage of overall mail that was then estimated to be spam [1] [12].) These trained lymphocytes were then used to score the messages. No further training took place as the messages were scored. With many detection systems, you can either have the system detect most spam messages or have the system be accurate in its detection, not both. The idea behind many systems, including this one, is that a threshold must be set, with messages on one side of the threshold (typically above) classified as spam, and messages on the other side of this threshold classified as non-spam.

With the threshold set at .7, the immune system correctly classifies 90% of the messages. More specifically, it correctly classifies 84% of spam and 98% of non-spam. More detailed results are given in the Figure 3.

Figure 3: Message scores
Figure 3: Message Scores [5]

There is a fairly clear threshold at 0.7 where most spam scores are above that value and most non-spam scores are below. There is the notable exception of the 49 messages for which no detectors existed.

It should be noted that the SpamAssassin public corpus used for testing and training this filter is known to be a difficult corpus for spam filters. Better results would be expected on an individual's personal email, since the self would be more clearly defined.

Conclusions

This article has described an innovative approach to spam detection based upon inspiration from the human immune system. While the human system has a knowledge of self that is constant throughout life, its artificial counterpart has to deal with a changing sense of self. The digital model accomplishes that by forgetting detectors that are not stimulated for lengthy periods of time and by generating new ones to retain a constant population. Ultimately, such an approach leads to an adaptive system that learns what a particular user considers to be spam versus legitimate email.

The results applied to a well-known corpus of spam are promising. We are confident that further work currently in progress will enhance the robustness of the algorithms presented. The lightweight nature of this solution -- requiring significantly smaller number of detectors when compared to SpamAssassin -- will doubtlessly prove attractive to those looking to implement a server-based solution where processing overhead may well be an issue. A server-based solution would be a one-size-fits-all mold since the filter is not personalized and does not learn for each particular user, but the reduced processing and storage time makes such a solution attractive.

References

1
Atkins, S. Size and cost of the problem. In Proceedings of the Fifty-sixth Internet Engineering Task Force (IETF) Meeting (Mar. 16-21, San Francisco, CA), SpamCon Foundation, 2003.
2
Coalition Against Unsolicited Commercial Email. Pending legislation. http://www.cauce.org/legislation August 2002.
3
Forrest, S., and Hofmeyr, S.A., and Somayaji, A. Computer immunology. Communications of the ACM, vol. 40, no. 10, pp. 88--96, 1997.
4
Graham, P. A plan for spam. http://www.paulgraham.com/spam.html August 2002.
5
Oda, T., and White, T. Developing an immunity to spam. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2003), (July, Chicago), 2003.
6
Pantel, P., and Lin, D. Spamcop: A spam classification & organization program. In Learning for Text Categorization: Papers from the 1998 Workshop, (Madison, Wisconsin), AAAI Technical Report WS-98-05, 1998.
7
Sahami, M., and Dumais, S., and Heckerman, D., and Horvitz, E. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, (Madison, Wisconsin), AAAI Technical Report WS-98-05, 1998.
8
Salib, M. Heuristics in the blender. In Proceedings of the 2003 Spam Conference, (Cambridge, US), 2003.
9
SpamAssassin, SpamAssassin website. http://spamassassin.org/ 2002.
10
SpamAssassin, SpamAssassin public corpus. http://spamassassin.org/publiccorpus/ February 28 2003.
11
Vixie, P. MAPS RBL rationale. http://mail-abuse.org/rbl/rationale.html July 19 2000.
12
Weaver, J. AOL escalates spam warfare. MSNBC, March 5 2003.

Biographies

Terri Oda is a Master's student at Carleton University where she has been working on approaches to spam detection using an artificial immune system since late 2002. She has written two papers on her research work. Outside of academic work, she manages the mailing lists for Linuxchix, a global organization that encourages women interested in Linux and involved in computing. She is also the lead document writer for the GNU Mailman project. Terri has a B.Math in Mathematics and Computer Science.

Tony White is an associate professor of Computer Science at Carleton University and Director of Technology for Symbium Corporation. His principal interests are in the application of biological metaphors to solving problems in computer science and engineering. He currently undertakes research in the areas of autonomic computing and artificial immune Systems. He has published over 50 papers and is coauthor on 6 patents with 2 others pending. Tony has an M.A. in Theoretical Physics and a Ph.D. in Electrical Engineering.