|
NAMELingua::EN::Squeeze - Shorten text to minimum syllables using hash table lookup and vowel deletionSYNOPSISuse Lingua::EN::Squeeze; # import only function use Lingua::EN::Squeeze qw( :ALL ); # import all functions and variables use English; # to use readable variable names while (<>) { print "Original: $_\n"; print "Squeezed: ", SqueezeText(lc $_), "\n"; } # Or you can use object oriented interface $squeeze = Lingua::EN::Squeeze->new(); while (<>) { print "Original: $_\n"; print "Squeezed: ", $squeeze->SqueezeText(lc $_); } DESCRIPTIONThis module squeezes English text to the most compact format possible, so that it is barely readable. Be sure to convert all text to lowercase before using the SqueezeText() for maximum compression, because optimizations have been designed mostly for lower case letters.Warning: Each line is processed multiple times, so prepare for slow conversion time You can use this module e.g. to preprocess text before it is sent to electronic media that has some maximum text size limit. For example pagers have an arbitrary text size limit, typically around 200 characters, which you want to fill as much as possible. Alternatively you may have GSM cellular phone which is capable of receiving Short Messages (SMS), whose message size limit is 160 characters. For demonstration of this module's SqueezeText() function, this paragraph's conversion result is presented below. See yourself if it's readable (Yes, it takes some time to get used to). The compression ratio is typically 30-40% u _n use thi mod e.g. to prprce txt bfre i_s snt to elrnic mda has som max txt siz lim. f_xmple pag hv abitry txt siz lim, tpcly 200 chr, W/ u wnt to fll as mch as psbleAlternatvly u may hv GSM cllar P8 w_s cpble of rcivng Short msg (SMS), WS/ msg siz lim is 160 chr. 4 demonstrton of thi mods SquezText fnc , dsc txt of thi prgra has ben cnvd_ blow See uself if i_s redble (Yes, it tak som T to get usdto compr rat is tpcly 30-40 And if $SQZ_OPTIMIZE_LEVEL is set to non-zero u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo elrnicMdaHasSomMaxTxtSizLim.F_xmplePag hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8 w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz limIs160Chr.4DemonstrtonOfThiModsSquezText fnc,DscTxtOfThiPrgraHasBenCnvd_Blow SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto comprRatIsTpcly30-40 The comparision of these two show Original text : 627 characters Level 0 : 433 characters reduction 31 % Level 1 : 345 characters reduction 45 % (+14% improvement) There are few grammar rules which are used to shorten some English tokens considerably: Word that has _ is usually a verb Word that has / is usually a substantive, noun, pronomine or other non-verb Read following substituting tokens in order to understand the basics of converted text. Hopefully, the text is not pure Geek code (tm) to you after some practice. In Geek code (Like G++L--J) you would need an external parser to understand it. Here some common sense and time is needed to adapt oneself to the compressed format. For a complete up to date list, you would be better off peeking the source code automatically => 'acly_' for => 4 for him => 4h for her => 4h for them => 4t for those => 4t can => _n does => _s it is => i_s that is => t_s which is => w_s that are => t_r which are => w_r less => -/ more => +/ most => ++ however => h/ver think => thk_ useful => usful you => u your => u/ you'd => u/d you'll => u/l they => t/ their => t/r will => /w would => /d with => w/ without => w/o which => W/ whose => WS/ Time is expressed with big letters time => T minute => MIN second => SEC hour => HH day => DD month => MM year => YY Other big letter acronyms, think 8 to represent the speaker and the microphone. phone => P8 EXAMPLESTo add new words e.g. to word conversion hash table, you'd define a custom set and merge them to existing ones. Do similarly to %SQZ_WXLATE_MULTI_HASH and $SQZ_ZAP_REGEXP and then start using the conversion function.use English; use Squeeze qw( :ALL ); my %myExtraWordHash = ( new-word1 => 'conversion1' , new-word2 => 'conversion2' , new-word3 => 'conversion3' , new-word4 => 'conversion4' ); # First take the existing tables and merge them with the above # translation table my %mySustomWordHash = ( %SQZ_WXLATE_HASH , %SQZ_WXLATE_EXTRA_HASH , %myExtraWordHash ); my $myXlat = 0; # state flag while (<>) { if ( $condition ) { SqueezeHashSet \%mySustomWordHash; # Use MY conversions $myXlat = 1; } if ( $myXlat and $condition ) { SqueezeHashSet "reset"; # Back to default table $myXlat = 0; } print SqueezeText $ARG; } Similarly you can redefine the multi word translation table by supplying another hash reference in call to SqueezeHashSet(). To kill more text immediately in addition to default, just concatenate regexps to variable $SQZ_ZAP_REGEXP KNOWN BUGSThere may be lot of false conversions and if you think that some word squeezing went too far, please 1) turn on the debug 2) send you example text 3) debug log log to the maintainer. To see how the conversion goes e.g. for word Messages:use English; use Lingua::EN:Squeeze; # Activate debug when case-insensitive word "Messages" is found from # the line. SqueezeDebug( 1, '(?i)Messages' ); $ARG = "This line has some Messages in it"; print SqueezeText $ARG; EXPORTABLE VARIABLESThe defaults may not apply to all types of text, so you may wish to extend the hash tables and $SQZ_ZAP_REGEXP to cope with your typical text.$SQZ_ZAP_REGEXPText to kill immediately, like "Hm, Hi, Hello..." You can only set this once, because this regexp is compiled immediately when "SqueezeText()" is called for the first time.$SQZ_OPTIMIZE_LEVELThis controls how optimized the text will be. Currently there is only level 0 (default) and level 1. Level 1 removes all spaces. That usually improves compression by average of 10%, but the text is more harder to read. If space is real tight, use this extended compression optimization.%SQZ_WXLATE_MULTI_HASHMulti Word conversion hash table: "for you" => "4u" ...%SQZ_WXLATE_HASHSingle Word conversion hash table: word => conversion. This table is applied after %SQZ_WXLATE_MULTI_HASH has been used.%SQZ_WXLATE_EXTRA_HASHAggressive Single Word conversions like: without => w/o are applied last.INTERFACE FUNCTIONSSqueezeObjectArg($)
SqueezeText($)
new()
SqueezeHashSet($;$)
SqueezeControl(;$)
SqueezeDebug(;$$)
SEE ALSOThe following modules may also be of interested. I haven't tested them, unless noted.
AVAILABILITYLatest version of this module can be found at CPAN/modules/by-module/Lingua/AUTHORJari Aalto <jariaalto@cpan.org>COPYRIGHT AND LICENSEThis software is Copyright (c) 1998-2016 by Jari Aalto.This is free software, licensed under: The GNU General Public License, Version 2, June 1991 You can redistribute it and/or modify it under the terms of GNU General Public License v2 or later.
Visit the GSP FreeBSD Man Page Interface. |