Extracting data from a file with multi-line records

There are lots of software and devices that generate log files in which each record of data spreads multiple lines. If the file is too big to fit into memory - and log files can be huge - then we have no choice but read the file line-by-line, recognize the records manually and process each record after we collected all the lines belonging to this record.

A log file with multi-line records

This is a "pseudo file" that resembles the generic case.

There might be some data at the beginning - several lines of Header.

Then, each record or section starts with some recognizable line. There is usually some kind of a string or character that marks the beginning of each section of data. I used the Start word in this example.

Then, there are lines of data and finally the section ends with another recognizable string. In our case this is the End word.

There can be even some data between the sections that we called Garbage.


  Header

  Start
  data 1
  End

  Garbage

  Start
  data 2
  data 3

  End

The simple but long solution

Let's see how can we extract the sections from this file. This is the longer solution but the one that might be easier to understand:


  use strict;
  use warnings;
  use v5.10;
  use autodie;

  my $file = shift or die "Usage: $0 FILENAME\n";
  open my $fh, '<', $file;

  my $in_section;
  my @data;
  while (my $line = <$fh>) {
    if ($line =~ /^Start/) {
      $in_section = 1;
      next;
    }

    if ($line =~ /End/) {
      $in_section = 0;
      process_data();
      next;
    }
    if ($in_section) {
      push @data, $line;
      next;
    }
    # arrives here only outside of sections (Header, Garbage)
    # shall we report those line or just disregard them?
  }

  sub process_data {
    return if not @data;
    say 'Data:';
    print @data;
    say '-' x 10;
    @data = ();
  }

After the compulsory boiler-plate header, and after getting and opening the file, we declare two global variables. One of them @data will hold the lines of the current section. It starts out empty. The other one, $in_section is a flag that indicates if the current line is within a section. It starts out as undef which indicates false.

The while loop goes over the file line-by-line and has 3 parts. The first part recognizes the beginning of the section and sets the $in_section flag to some true value (1 in our case).

The second part recognizes when the section ends. Turns off the $in_section flag and calls the process_data() subroutine that will, well, process the data in @data.

The third if statement will be true only when we are inside a section and the only thing it does is saving the current line in the @data array.

In all 3 cases we call next at the end of the if block as we have finished processing the current line. So we go for the next line.

What is left is dealing with the header and the garbage. Depending on the task you might want to disregard any data outside of the sections or warn about the fact that you encountered such data.

As you can see the process_data function acts on the global @data array. Our version does not do much, just prints the data to the screen, but there is an important part in that subroutine. At the end we remove all the content from the @data array so lines from one section won't be included in the next section.

Obviously recognizing the beginning and the end of the section will be different and probably more difficult in your case but at lest now you have a skeleton for the solution.

.. the flip-flop operator

Here is another, shorter solution:


  use strict;
  use warnings;
  use v5.10;
  use autodie;

  my $file = shift or die "Usage: $0 FILENAME\n";
  open my $fh, '<', $file;

  my @data;
  while (my $line = <$fh>) {
    if ($line =~ /^Start/ .. $line =~ /End/) {
      next if $line =~ /^Start/;
      if ($line =~ /End/) {
        process_data();
      } else {
        push @data, $line;
      }
      next;
    }
    # report (or not) Header and Garbage
  }

  sub process_data {
     return if not @data;
     say 'Data:';
     print @data;
     say '-' x 10;
     @data = ();
  }

We have eliminated the need for the $in_section flag and we changed the content of the while loop. We are using the .. operator which is also called the range or in this case the flip-flop operator.

We created an if-statement with two conditions and .. between the two. This if-statement will be true exactly within the sections (between the Start and End tags) including both the line with the Start and the line with the End.

At the the if-statement with the flip-flop checks only the first condition. As long as that is false (in the Header) the whole if-statement will be false.

Once the first condition becomes true - meaning on the "Start" line - the whole if-statement becomes true. From that point when this if-statement is executed it will only check the second condition. As long as the second condition is false the whole if-statement will remain true. At the line where the second condition is true - meaning on the "End" line - the whole if-statement becomes false. (Though for the End line itself it is still true).

Basically the first condition is the "on" button and the second condition is the "off" button.

In the block of the if-statement we have to deal with both the "Start" line that we just skip and the "End" line that triggers the processing of the data we collected.

A more idiomatic way using $_

As mentioned by a couple of people in the comments, the while loop of the second solution can be written in a more idiomatic way if, instead of the explicit $line variable we relyed on $_, the implicit topic variable. After the change the code will look like this:


  my @data;
  while (<$fh>) {
    if (/^Start/ .. /End/) {
      next if /^Start/;
      if (/End/) {
        process_data();
      } else {
        push @data, $_;
      }
      next;
    }
    # report (or not) Header and Garbage
  }

Both the read-line operator and all the regexes work by default on $_. The only statement that does not default to using that variable is push. Hence we had to explicitely push @data, $_.

I am usually quite against any code that uses $_ explicitly but maybe in this case the code that uses $_ is clearer.

(Thanks for Gego for suggesting this)

Missing start or end-marks

In the above solution we assumed that all the "Start" and "End" marks are there and have not dealt with cases when some of them might be missing. It could be an interesting add-on to deal with those cases as well.

Records without an obvious end-marks

There are also files where there are no (obvious) "End" marks of the records. Where the beginning of a new record marks the end of the previous record. We'll deal with those in a separate article.

Official Documentation

The official documentation of the flip-flop operator can be found in perlop under the title Range Operators.

Published on 2012-03-10 by Gabor Szabo
Code::Maven
Python, JavaScript, Node.js, Ruby, and more.

Twitter RSS feed