Help-Site Computer Manuals
  Algorithms & Data Structures   Programming Languages   Revision Control
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

sort huge lists

Sort::External - sort huge lists


    for my $input_level ( 0 .. $#{ $self->{sortfiles} } ) {

        if ($final) {



        elsif ( @{ $self->{sortfiles}[$input_level] } > MAX_SORTFILES ) {





# merge multiple sortfiles sub _consolidate_one_level { my ( $self, $input_level ) = @_; my $sortsub = $self->{-sortsub};

    # create a holder for the output filehandles if necessary.

    $self->{sortfiles}[ $input_level + 1 ] ||= [];

    # if there's only one sortfile on this level, kick it up and return

    if ( @{ $self->{sortfiles}[$input_level] } == 1 ) {

        push @{ $self->{sortfiles}[ $input_level + 1 ] },

            @{ $self->{sortfiles}[$input_level] };

        $self->{sortfiles}[$input_level] = [];



    # get a new outfile

    my $outfile = File::Temp->new(

        DIR    => $self->{workdir},

        UNLINK => 1


    # build an array of Sort::External::Buffer objects, one per filehandle

    my @in_buffers;

    for my $tf_handle ( @{ $self->{sortfiles}[$input_level] } ) {

        seek( $tf_handle, 0, 0 );

        my $buff = Sort::External::Buffer->_new( tf_handle => $tf_handle, );

        push @in_buffers, $buff;



=begin comment

Our goal is to merge multiple sortfiles, but we don't have access to all the elements in a given sortfile, only to Buffer objects which allow us to ``see'' a limited subset of those elements.

A traditional way to merge several files is to expose one element per file, compare them, and pop one item per pass. However, it is more efficient, especially in Perl, if we sort batches.

It would be convenient if we could just dump all the buffered elements into a batch and sort that, but we can't just pool all the buffers, because it's possible that the elements in buffer B are all higher in sort order than all the elements we can ``see'' in buffer A, plus some more elements in sortfile A we can't ``see'' yet. We need a gatekeeper algo which allows the maximum number of elements into a sortbatch, while blocking elements which we can't be certain belong in this sortbatch.

Our gatekeeper needs to know a cutoff point. To find it, we need to examine the last element we can ``see'' in each buffer, and figure out which of these endposts is the lowest in sort order. Once we know that cutoff, we can compare it against items in the other buffers, and any element which is lower than or equal to the cutoff in sort order gets let through into the batch.