As I was working on the “Working with Bits” chapter of Mastering Perl, I thought about the prime number checker from Abigail. I’ve known about that for years. Abigail came up with it in 1998, and it’s listed in the CPAN JAPH file. I never bothered to look into how it worked since I figured it was something clever:
% perl -E 'say "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' 1234567
His use of 1
made me think there might be something interesting to say about bit vectors, but that turned out not to be so. I could use any character (or sequence) in place of the 1
.
Still, I’ll break it down. The program has this structure:
say "Prime" if STRING !~ REGEX
The !~
is a negated match. The condition is true if the string does not match the regular expression. If the string matches, then it’s a composite number so the condition is false. It might make you feel better to negate the condition with unless
instead:
say "Prime" unless STRING =~ REGEX
Abigail’s pattern looks a bit more complicated than it really is because of the 1
character being both a literal and a back reference:
^1?$|^(11+?)\1+$
When I see regexes where I don’t immediately see what is going on, I look for some hook in it so I can break it down. In this regex, there’s the alternation, and both sides of the alternation use the beginning of line (^
) and end of line anchors ($
). I suspect Abigail did that for symmetry and to avoid the parentheses he’d need to solve the precedence problem. Otherwise, it would look like this, spread out a bit:
^ (?: 1? | (11+?) \1+ ) $
Assuming that the anchors match, the inside portion is two branches:
1? | (11+?) \1+
It’s one of these:
1?
(11+?) \1+
The first one is easy. It’s zero or one characters, and nothing more. That matches the trivial condition where the input number is 1
. If it’s not exactly one character, the regex tries the other side of the alternation:
(11+?) \1+
There are two parts to this: the capture and the back reference. The back reference can match one or more times. So, it’s looking for complete groups, and nothing more, of the same thing. The actual thing we match hardly matter. It comes down being able to break up the string evenly into groups.
For example, I take the number nine. The string for that is a sequence of nine 1‘s:
111111111
Can I break up that string into equal groups with nothing left over? Sure; there are three groups of three:
111 111 111
The number of groups represent one factor, and the length of any one of the groups represents another factor. Since it finds two factors, the number must be composite, so not prime.
I can do it again with seven:
1111111
There isn’t a way to create equally-sized groups. The regex tries the longest string it can for the capture, then checks that it can match the back reference one or more times. It keeps failing:
111111 1
11111 11
1111 111
111 111 1
11 11 11 1
Thus, the pattern doesn’t match 1111111
, so the number is prime because there’s not another factor.
That’s the basic idea, but the regex engine gets there in many more steps. I can see how it tries the longest first group, fails, and keeps backtracking. I ran this under Regexp::Debugger.
That’s all nice, but there’s nothing about bit vectors there. Oh well.