Evolutionary pressure on spam
Oct. 5th, 2007 03:59 pmI was just cleaning out my email filter and found one with a spam-score of +49! (Anything above 0 is considered spam. Legit email has, at best, a score of about -6.)
I looked at all the rules that tagged it, and honestly? I think that at this point, you might do better just listing various pharmaceutical names in the clear, rather than replacing every A with a 4 and every I with a 1.
I looked at all the rules that tagged it, and honestly? I think that at this point, you might do better just listing various pharmaceutical names in the clear, rather than replacing every A with a 4 and every I with a 1.
no subject
Date: 2007-10-05 11:09 pm (UTC)I did my "Look, I know how to write a paper, you can graduate me" Masters paper on spam filtering. "vviagra" had a much higher probability of being spam than "Viagra" which is at least occasionally in my email (forwarded jokes, probably).
no subject
Date: 2007-10-05 11:16 pm (UTC)It's like all the peacock tails got so elaborate in so many directions at once that the peacock with the most boring display is now the one most likely to get laid...
no subject
Date: 2007-10-05 11:49 pm (UTC)no subject
Date: 2007-10-06 12:05 am (UTC)WHY?! Isn't Bayesian the only thing that actually works?
no subject
Date: 2007-10-06 12:10 am (UTC)How could it be? Machine learning has more than one technique, you know.
no subject
Date: 2007-10-06 12:45 am (UTC)no subject
Date: 2007-10-06 01:06 am (UTC)It's just comparable to the, "isn't SVM the only way to do classification?" camp. No. It's not.
(And, of course, IR-style keyword filtering is one of the earliest learning techniques...)
no subject
Date: 2007-10-06 01:53 am (UTC)And ML is hardly the only component of good spam filtering. The latest catch from SpamAssassin in my box is below. Only 3.5 out of 28.2 came from Bayes.
* 0.1 FORGED_RCVD_HELO Received: contains a forged HELO
* 0.0 ADVANCE_FEE_1 Appears to be advance fee fraud (Nigerian 419)
* 0.3 MIME_BOUND_NEXTPART Spam tool pattern in MIME boundary
* -0.8 AWL AWL: From: address is in the auto white-list
* 1.8 MILLION_USD BODY: Talks about millions of dollars
* 0.5 HTML_40_50 BODY: Message is 40% to 50% HTML
* 0.0 HTML_MESSAGE BODY: HTML included in message
* 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
* 2.0 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address
* 1.6 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
* 1.6 URIBL_SBL Contains an URL listed in the SBL blocklist
* 3.8 URIBL_AB_SURBL Contains an URL listed in the AB SURBL blocklist
* 4.1 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
* 2.1 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist
* 3.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist
* 4.5 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist
Incidentally, it's a penis enlargement spam, the "millions of dollars" came from the random news fragment it included: The recall could cost over 30 million