NAME

Regexp::Common::profanity_us -- provide regexes for U.S. profanity

SYNOPSIS

  use Regexp::Common qw /profanity_us/;

  my $RE = $RE{profanity}{us}{normal}{label}{-keep}{-dist=>3};

  while (<>) {
      warn "PROFANE" if /$RE/;
  }

Or easier

 use Regexp::Profanity::US;

 $profane = profane     ($string);
 @profane = profane_list($string);

OVERVIEW

Instead of a dry technical overview, I am going to explain the structure of this module based on its history. I consult at a company that generates customer leads primarily by having websites that attract people (e.g. lowering loan values, selling cars, buying real estate, etc.). For some reason we get more than our fair share of profane leads. For this reason I was told to write a profanity checker.

For the data that I was dealing with, the profanity was most often in the email address or in the first or last name, so I naively started filtering profanity with a set of regexps for that sort of data. Note that both names and email addresses are unlike what you are reading now: they are not whitespace-separated text, but are instead labels.

Therefore full support for profanity checking should work in 2 entirely different contexts: labels (email, names) and text (what you are reading). Because open-source is driven by demand and I have no need for detecting profanity in text, only "label" is implemented at the moment. And you know the next sentence: "patches welcome" :)

Spelling Variations Dictated by Sound or Sight

Creative use of symbols to spell words (el33t sp3@k)

Now, within labels, you can see normal ascii or creative use of symbols:

Here are some normal profane labels: suckmycock@isp.com shitonastick

And here they are in ascii art: s\/cKmyc0k@aol.com sh|+0naST1ck

A CPAN module which does a great job of "drawing words" is Acme::Tie::Eleet. I thought I knew all of the ways that someone could "inflate" a letter so that dirty words could bypass a profanity checker, but just look at all these:

 %letter = 
    ( a => [ "4", "@" ],
      c => "(",
      e => "3",
      g => "6",
      h => [ "|-|", "]-[" ],
      k => [ "|<", "]{" ],
      i => "!",
      l => [ "1", "|" ],
      m => [ "|V|", "|\\/|" ],
      n => "|\\|",
      o => "0",
      s => [ "5", "Z" ],
      t => [ "7", "+"],
      u => "\\_/",
      v => "\\/",
      w => [ "vv", "\\/\\/" ],
      'y' => "j",
      z => "2",
      );

Soundex respelling

Which of course brings me to the final way to take normal text and vary it for the same meaning: soundex.

The way a word sounds can lead to different spellings. For example, we have shitonastick

Which we can soundex out as: shitonuhstick

Or, given: nigger

We can rewrite it as: nigga nigguh niggah

There are two CPAN modules, Text::Soundex and Text::Metaphone which do this sort of thing, but after they resolved "shit" and "shot" to the same soundex, I forgot about them :).

So to conclude this OVERVIEW, (or is that oV3r\/ieW :), this module does profanity checking for:

  labels and not text

and for:

  normal and not eleet spelling

with a bit of hedging to support soundexing (and only definite obscene words are searched for. Ambiguous / contextual searching is left as an exercise for the reader).

In Regexp::Common terminology, which is the infrastructure on which this module is built, we have only the following regexp for your string-matching ecstasy:

    $RE{profanity}{us}{normal}{label}

and patches are welcome for:

    $RE{profanity}{us}{label}{eleet}
    $RE{profanity}{us}{text}{normal}
    $RE{profanity}{us}{text}{eleet}

But do note this if you plan to implement text parsing,

"[^:alpha:]" and not "\b" should be used because "_" does not form a word boundary and so

  \bshit\b

will match

  shit head

and

  shit-head

but not

  shit_head

Another thing about text is that it may be resolved into labels by splitting on whitespace. Thus, one could have one engine and a different pre-processor.

USAGE

Please consult the manual of Regexp::Common for a general description of the works of this interface.

Do not use this module directly, but load it via Regexp::Common.

This module reads one flag, "-dist" which is used to set the amount of characters that can appear between components of an obscene phrase. For example

  suck!!!my!!!cock

will match the following regular expression

  suck-my-cock

as long as the flag "-dist" is set to 3 or greater because this module changes "-" into ".{0,$dist}" with $dist defaulting to 7. Why such a large default? It is done so that the profanity list can omit certain words such as my or your. Take this:

  poop on your face

We have the following regular expression

  poop--face

which is transformed to

  poop.{0,7}.{0,7}face

which will match the possible prepositions and adjectives in between "poop" and "face" and also match the hideous term "poopface".

Capturing

Under "-keep" (see Regexp::Common):

$1: captures the entire word

AUTHOR

T. M. Brannon, tbone@cpan.org

I cannot pay enough thanks to

  Matthew Simon Cavalletto, evo@cpan.org.

who refactored this module completely of his own volition and in spite of his hectic schedule. He turned this module from an unsophisticated hack into something worth others using.

Useful brain picking came from William McKee of Knowmad Consulting on the Data::FormValidator mailing list.