GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
MCE::Grep(3) User Contributed Perl Documentation MCE::Grep(3)

MCE::Grep - Parallel grep model similar to the native grep function

This document describes MCE::Grep version 1.878

 ## Exports mce_grep, mce_grep_f, and mce_grep_s
 use MCE::Grep;

 ## Array or array_ref
 my @a = mce_grep { $_ % 5 == 0 } 1..10000;
 my @b = mce_grep { $_ % 5 == 0 } \@list;

 ## Important; pass an array_ref for deeply input data
 my @c = mce_grep { $_->[1] % 2 == 0 } [ [ 0, 1 ], [ 0, 2 ], ... ];
 my @d = mce_grep { $_->[1] % 2 == 0 } \@deeply_list;

 ## File path, glob ref, IO::All::{ File, Pipe, STDIO } obj, or scalar ref
 ## Workers read directly and not involve the manager process
 my @e = mce_grep_f { /pattern/ } "/path/to/file"; # efficient

 ## Involves the manager process, therefore slower
 my @f = mce_grep_f { /pattern/ } $file_handle;
 my @g = mce_grep_f { /pattern/ } $io;
 my @h = mce_grep_f { /pattern/ } \$scalar;

 ## Sequence of numbers (begin, end [, step, format])
 my @i = mce_grep_s { %_ * 3 == 0 } 1, 10000, 5;
 my @j = mce_grep_s { %_ * 3 == 0 } [ 1, 10000, 5 ];

 my @k = mce_grep_s { %_ * 3 == 0 } {
    begin => 1, end => 10000, step => 5, format => undef
 };

This module provides a parallel grep implementation via Many-Core Engine. MCE incurs a small overhead due to passing of data. A fast code block will run faster natively. However, the overhead will likely diminish as the complexity increases for the code.

 my @m1 =     grep { $_ % 5 == 0 } 1..1000000;          ## 0.065 secs
 my @m2 = mce_grep { $_ % 5 == 0 } 1..1000000;          ## 0.194 secs

Chunking, enabled by default, greatly reduces the overhead behind the scene. The time for mce_grep below also includes the time for data exchanges between the manager and worker processes. More parallelization will be seen when the code incurs additional CPU time.

 my @m1 =     grep { /[2357][1468][9]/ } 1..1000000;    ## 0.353 secs
 my @m2 = mce_grep { /[2357][1468][9]/ } 1..1000000;    ## 0.218 secs

Even faster is mce_grep_s; useful when input data is a range of numbers. Workers generate sequences mathematically among themselves without any interaction from the manager process. Two arguments are required for mce_grep_s (begin, end). Step defaults to 1 if begin is smaller than end, otherwise -1.

 my @m3 = mce_grep_s { /[2357][1468][9]/ } 1, 1000000;  ## 0.165 secs

Although this document is about MCE::Grep, the MCE::Stream module can write results immediately without waiting for all chunks to complete. This is made possible by passing the reference to an array (in this case @m4 and @m5).

 use MCE::Stream default_mode => 'grep';

 my @m4; mce_stream \@m4, sub { /[2357][1468][9]/ }, 1..1000000;

    ## Completed in 0.203 secs. This is amazing considering the
    ## overhead for passing data between the manager and workers.

 my @m5; mce_stream_s \@m5, sub { /[2357][1468][9]/ }, 1, 1000000;

    ## Completed in 0.120 secs. Like with mce_grep_s, specifying a
    ## sequence specification turns out to be faster due to lesser
    ## overhead for the manager process.

A common scenario is grepping for pattern(s) inside a massive log file. Notice how parallelism increases as complexity increases for the pattern. Testing was done against a 300 MB file containing 250k lines.

 use MCE::Grep;

 my @m; open my $LOG, "<", "/path/to/log/file" or die "$!\n";

 @m = grep { /pattern/ } <$LOG>;                      ##  0.756 secs
 @m = grep { /foobar|[2357][1468][9]/ } <$LOG>;       ## 24.681 secs

 ## Parallelism with mce_grep. This involves the manager process
 ## due to processing a file handle.

 @m = mce_grep { /pattern/ } <$LOG>;                  ##  0.997 secs
 @m = mce_grep { /foobar|[2357][1468][9]/ } <$LOG>;   ##  7.439 secs

 ## Even faster with mce_grep_f. Workers access the file directly
 ## with zero interaction from the manager process.

 my $LOG = "/path/to/file";
 @m = mce_grep_f { /pattern/ } $LOG;                  ##  0.112 secs
 @m = mce_grep_f { /foobar|[2357][1468][9]/ } $LOG;   ##  6.840 secs

The MCE::Grep module lacks an optimization for quickly determining if a match is found from not knowing the pattern inside the code block. Use the following snippet as a template to achieve better performance. Also, take a look at examples/egrep.pl, included with the distribution.

 use MCE::Loop;

 MCE::Loop->init(
    max_workers => 8, use_slurpio => 1
 );

 my $pattern  = 'karl';
 my $hugefile = 'very_huge.file';

 my @result = mce_loop_f {
    my ($mce, $slurp_ref, $chunk_id) = @_;

    ## Quickly determine if a match is found.
    ## Process slurped chunk only if true.

    if ($$slurp_ref =~ /$pattern/m) {
       my @matches;

       ## The following is fast on Unix. Performance degrades
       ## drastically on Windows beyond 4 workers.

       open my $MEM_FH, '<', $slurp_ref;
       binmode $MEM_FH, ':raw';
       while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
       close   $MEM_FH;

       ## Therefore, use the following construct on Windows.

       while ( $$slurp_ref =~ /([^\n]+\n)/mg ) {
          my $line = $1; # save $1 to not lose the value
          push @matches, $line if ($line =~ /$pattern/);
       }

       ## Gather matched lines.

       MCE->gather(@matches);
    }

 } $hugefile;

 print join('', @result);

The following list options which may be overridden when loading the module.

 use Sereal qw( encode_sereal decode_sereal );
 use CBOR::XS qw( encode_cbor decode_cbor );
 use JSON::XS qw( encode_json decode_json );

 use MCE::Grep
     max_workers => 4,                # Default 'auto'
     chunk_size => 100,               # Default 'auto'
     tmp_dir => "/path/to/app/tmp",   # $MCE::Signal::tmp_dir
     freeze => \&encode_sereal,       # \&Storable::freeze
     thaw => \&decode_sereal          # \&Storable::thaw
 ;

From MCE 1.8 onwards, Sereal 3.015+ is loaded automatically if available. Specify "Sereal => 0" to use Storable instead.

 use MCE::Grep Sereal => 0;

MCE::Grep->init ( options )
MCE::Grep::init { options }

The init function accepts a hash of MCE options. The gather option, if specified, is ignored due to being used internally by the module.

 use MCE::Grep;

 MCE::Grep->init(
    chunk_size => 1, max_workers => 4,

    user_begin => sub {
       print "## ", MCE->wid, " started\n";
    },

    user_end => sub {
       print "## ", MCE->wid, " completed\n";
    }
 );

 my @a = mce_grep { $_ % 5 == 0 } 1..100;

 print "\n", "@a", "\n";

 -- Output

 ## 2 started
 ## 3 started
 ## 1 started
 ## 4 started
 ## 3 completed
 ## 4 completed
 ## 1 completed
 ## 2 completed

 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

MCE::Grep->run ( sub { code }, list )
mce_grep { code } list

Input data may be defined using a list or an array reference. Unlike MCE::Loop, Flow, and Step, specifying a hash reference as input data isn't allowed.

 ## Array or array_ref
 my @a = mce_grep { /[2357]/ } 1..1000;
 my @b = mce_grep { /[2357]/ } \@list;

 ## Important; pass an array_ref for deeply input data
 my @c = mce_grep { $_->[1] =~ /[2357]/ } [ [ 0, 1 ], [ 0, 2 ], ... ];
 my @d = mce_grep { $_->[1] =~ /[2357]/ } \@deeply_list;

 ## Not supported
 my @z = mce_grep { ... } \%hash;
MCE::Grep->run_file ( sub { code }, file )
mce_grep_f { code } file

The fastest of these is the /path/to/file. Workers communicate the next offset position among themselves with zero interaction by the manager process.

"IO::All" { File, Pipe, STDIO } is supported since MCE 1.845.

 my @c = mce_grep_f { /pattern/ } "/path/to/file";  # faster
 my @d = mce_grep_f { /pattern/ } $file_handle;
 my @e = mce_grep_f { /pattern/ } $io;              # IO::All
 my @f = mce_grep_f { /pattern/ } \$scalar;
MCE::Grep->run_seq ( sub { code }, $beg, $end [, $step, $fmt ] )
mce_grep_s { code } $beg, $end [, $step, $fmt ]

Sequence may be defined as a list, an array reference, or a hash reference. The functions require both begin and end values to run. Step and format are optional. The format is passed to sprintf (% may be omitted below).

 my ($beg, $end, $step, $fmt) = (10, 20, 0.1, "%4.1f");

 my @f = mce_grep_s { /[1234]\.[5678]/ } $beg, $end, $step, $fmt;
 my @g = mce_grep_s { /[1234]\.[5678]/ } [ $beg, $end, $step, $fmt ];

 my @h = mce_grep_s { /[1234]\.[5678]/ } {
    begin => $beg, end => $end,
    step => $step, format => $fmt
 };
MCE::Grep->run ( sub { code }, iterator )
mce_grep { code } iterator

An iterator reference may be specified for input_data. Iterators are described under section "SYNTAX for INPUT_DATA" at MCE::Core.

 my @a = mce_grep { $_ % 3 == 0 } make_iterator(10, 30, 2);

MCE::Grep->finish
MCE::Grep::finish

Workers remain persistent as much as possible after running. Shutdown occurs automatically when the script terminates. Call finish when workers are no longer needed.

 use MCE::Grep;

 MCE::Grep->init(
    chunk_size => 20, max_workers => 'auto'
 );

 my @a = mce_grep { ... } 1..100;

 MCE::Grep->finish;

MCE, MCE::Core

Mario E. Roy, <marioeroy AT gmail DOT com>
2022-02-20 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.