Matching numbers using Perl regex

I have a file with lines looking like this:


  Usage:524944/1000000 messages

How can I match the two numbers and extract them so I can process them later?

Let's see the regex:

The string starts with "Usage:" so the regex will start like this:


  /Usage:/

There is no need to escape the : as in the regexes of Perl 5 the colon is not a special character.

If we know and require that this will be at the beginning of the string we should say that explicitly by adding a caret ^ at the beginning.


  /^Usage:/

Maybe we don't need that, so I'll leave out the caret.

That string is followed by a number that, as I assume, can have any number of digits. \d matches a single digit, and the + quantifier modifies that to have "1 or more digits".


  /Usage:\d+/

So far it is good, but we would like to capture and reuse the number so we put the expression matching it in parentheses:


  /Usage:(\d+)/

This will allow code like this:


  my $str = 'Usage:524944/1000000 messages';
  if ( $str = /Usage:(\d+)/) {
     my $used = $1;
     # here we will have the 524944 in the $used variable
  }

The next thing is to match the /. Because slash is the delimiter of the regular expression we need to escape that. We write:


  /Usage:(\d+)\//

which is not very nice. Luckily we can modify the delimiters of the regexes in Perl 5 by using the m (which stand for matching) at the beginning. This way we can use many other characters instead of the slash. I personally like a pair of curly braces around the regex because that makes it very readable:


  m{Usage:(\d+)/}

The / in the original string is followed by another multi-digit number that we want to capture again:


  m{Usage:(\d+)/(\d+)}

That is followed by a space and then the word 'messages'.


  m{Usage:(\d+)/(\d+) messages}

If we would like to make sure that nothing follows this string we can add a $ sign at the end of the regex:


  m{Usage:(\d+)/(\d+) messages$}

This will make sure, that a string like this: "Usage:524944/1000000 messages sent" won't match.

If we now go back to the earlier example, we could also use the ^ at the beginning:


  m{^Usage:(\d+)/(\d+) messages$}

This would mean we make sure nothing is before the "Usage" and nothing comes after the "messages".

Depending on the situation that might be a good or a bad thing. I'll remove those for the final example as I think we did not want to enforce that:


  my $str = 'Usage:524944/1000000 messages';
  if ( $str =   m{Usage:(\d+)/(\d+) messages} ) {
     my ($used, $total) = ($1, $2);
     # here we will have the 524944 in the $used variable
     # and 1000000 in $total.
  }

Extra flexibility

I am not sure how much flexibility we might need there. Maybe that space after the second number might be several spaces. Maybe even tabs? To allow for that we would use \s+ for one or more white-spaces.


  m{^Usage:(\d+)/(\d+)\s+messages$}

Maybe there can be spaces and even tabs between the : and the first digit. If we want to allow for that, we can add \s* where the * is a quantifier meaning "0 or more".


  m{^Usage:\s*(\d+)/(\d+)\s+messages$}

We could even try to make this more readable by adding the x modifier at the end. If we use that, we can add spaces and comment in the regex to make it look nicer:


  m{^Usage:
    \s*
    (\d+)/(\d+)    # used / total
    \s+messages
   $}x


The data changes

That's great but at some point the input changed. It now looks like this string:


 Usage:524,944 of 1,000,000 messages

Instead of trying to change our regex, let's start it from scratch.

This time we already start with the /x modifier to make more space in the regex:

The beginning is the same and we allow for some spaces after the colon:


  /^Usage:\s*/x

Then there is something that contains digits and commas so we create a character class that can match a single character that is either a digit or a comma: [\d,] and use the + quantifier on the character class:


  /^Usage:\s*
   [\d,]+
  /x

We would like to capture that number so we put it in parentheses and I also added a comment for clarification.


  /^Usage:\s*
   ([\d,]+)     # used
  /x

After the number there is a space but because of the /x modifier, perl will disregard any spaces. We take some liberty here and allow both spaces and tabs using \s instead of a space on both sides of the "of" word.


  /^Usage:\s*
   ([\d,]+)      # used
   \s+of\s+
  /x

That is followed by another number with commas:


  /^Usage:\s*
   ([\d,]+)      # used
   \s+of\s+
   ([\d,]+)      # total
  /x

followed by another (few) spaces and the word "messages":


  /^Usage:\s*
   ([\d,]+)      # used
   \s+of\s+
   ([\d,]+)      # total
   \s+messages
  /x

This can be used in the if statement:


  my $str = 'Usage:524,944 of 1,000,000 messages';
  if ($str =~ /^Usage:\s*
               ([\d,]+)      # used
               \s+of\s+
               ([\d,]+)      # total
               \s+messages
              /x) {
     my ($used, $total) = ($1, $2);
     ...
   }


Finally if we would like to use the numbers as numbers, we can eliminate the commas with two global substitutes:


   $used =~ s/,//g;
   $total =~ s/,//g;

Source

The original question was brought up by Russell Johnson on the mailing list of the Portland Perl Mongers.


Perl tutorial and video course

For further articles see the Beginner Perl Maven tutorial book and video course.

Published on 2012-02-16 by Gabor Szabo

In the comments, please wrap your code snippets within <pre> </pre> tags and use spaces for indentation.
blog comments powered by Disqus
EduMaven
Python, Dart, JavaScript, and more.


Google Plus Twitter RSS feed