|
NAMERegexp::Common::profanity_us -- provide regexes for U.S. profanitySYNOPSISuse Regexp::Common qw /profanity_us/; my $RE = $RE{profanity}{us}{normal}{label}{-keep}{-dist=>3}; while (<>) { warn "PROFANE" if /$RE/; } Or easier use Regexp::Profanity::US; $profane = profane ($string); @profane = profane_list($string); OVERVIEWInstead of a dry technical overview, I am going to explain the structure of this module based on its history. I consult at a company that generates customer leads primarily by having websites that attract people (e.g. lowering loan values, selling cars, buying real estate, etc.). For some reason we get more than our fair share of profane leads. For this reason I was told to write a profanity checker.For the data that I was dealing with, the profanity was most often in the email address or in the first or last name, so I naively started filtering profanity with a set of regexps for that sort of data. Note that both names and email addresses are unlike what you are reading now: they are not whitespace-separated text, but are instead labels. Therefore full support for profanity checking should work in 2 entirely different contexts: labels (email, names) and text (what you are reading). Because open-source is driven by demand and I have no need for detecting profanity in text, only "label" is implemented at the moment. And you know the next sentence: "patches welcome" :) Spelling Variations Dictated by Sound or SightCreative use of symbols to spell words (el33t sp3@k)Now, within labels, you can see normal ascii or creative use of symbols: Here are some normal profane labels: suckmycock@isp.com shitonastick And here they are in ascii art: s\/cKmyc0k@aol.com sh|+0naST1ck A CPAN module which does a great job of "drawing words" is Acme::Tie::Eleet. I thought I knew all of the ways that someone could "inflate" a letter so that dirty words could bypass a profanity checker, but just look at all these: %letter = ( a => [ "4", "@" ], c => "(", e => "3", g => "6", h => [ "|-|", "]-[" ], k => [ "|<", "]{" ], i => "!", l => [ "1", "|" ], m => [ "|V|", "|\\/|" ], n => "|\\|", o => "0", s => [ "5", "Z" ], t => [ "7", "+"], u => "\\_/", v => "\\/", w => [ "vv", "\\/\\/" ], 'y' => "j", z => "2", ); Soundex respelling Which of course brings me to the final way to take normal text and vary it for the same meaning: soundex. The way a word sounds can lead to different spellings. For example, we have shitonastick Which we can soundex out as: shitonuhstick Or, given: nigger We can rewrite it as: nigga nigguh niggah There are two CPAN modules, Text::Soundex and Text::Metaphone which do this sort of thing, but after they resolved "shit" and "shot" to the same soundex, I forgot about them :). So to conclude this OVERVIEW, (or is that oV3r\/ieW :), this module does profanity checking for: labels and not text and for: normal and not eleet spelling with a bit of hedging to support soundexing (and only definite obscene words are searched for. Ambiguous / contextual searching is left as an exercise for the reader). In Regexp::Common terminology, which is the infrastructure on which this module is built, we have only the following regexp for your string-matching ecstasy: $RE{profanity}{us}{normal}{label} and patches are welcome for: $RE{profanity}{us}{label}{eleet} $RE{profanity}{us}{text}{normal} $RE{profanity}{us}{text}{eleet} But do note this if you plan to implement text parsing, "[^:alpha:]" and not "\b" should be used because "_" does not form a word boundary and so \bshit\b will match shit head and shit-head but not shit_head Another thing about text is that it may be resolved into labels by splitting on whitespace. Thus, one could have one engine and a different pre-processor. USAGEPlease consult the manual of Regexp::Common for a general description of the works of this interface.Do not use this module directly, but load it via Regexp::Common. This module reads one flag, "-dist" which is used to set the amount of characters that can appear between components of an obscene phrase. For example suck!!!my!!!cock will match the following regular expression suck-my-cock as long as the flag "-dist" is set to 3 or greater because this module changes "-" into ".{0,$dist}" with $dist defaulting to 7. Why such a large default? It is done so that the profanity list can omit certain words such as my or your. Take this: poop on your face We have the following regular expression poop--face which is transformed to poop.{0,7}.{0,7}face which will match the possible prepositions and adjectives in between "poop" and "face" and also match the hideous term "poopface". CapturingUnder "-keep" (see Regexp::Common):
SEE ALSORegexp::Common for a general description of how to use this interface.Regexp::Common::profanity for a slightly more European set of words. Regexp::Profanity::US for a pair of wrapper functions that use these regexps. AUTHORT. M. Brannon, tbone@cpan.orgI cannot pay enough thanks to Matthew Simon Cavalletto, evo@cpan.org. who refactored this module completely of his own volition and in spite of his hectic schedule. He turned this module from an unsophisticated hack into something worth others using. Useful brain picking came from William McKee of Knowmad Consulting on the Data::FormValidator mailing list.
Visit the GSP FreeBSD Man Page Interface. |