Mike Haertel <mike_at_ducky.net> writes: > Dag-Erling Smørgrav <des_at_des.no> writes: > > You don't really need to "isolate the containing line" unless you have > > an actual match, do you? > Theoretically no. However, suppose the pattern was /foo.*blah/. > The Boyer-Moore search will be for "blah", since that's the longest > fixed substring. But verifying a match for the full regexp either > requires a regexp matcher with the feature "start here, at THIS point > in the middle of the RE and THAT point in the middle of the buffer, > and match backwards and forwards", or else running a more standard > RE matcher starting from the beginning of the line. Stupidly enough, I only considered the case where the fixed search string is at the end of the regexp :) For the general case, you'd have to either build a second FSA that mirrors the first, or design your engine and data structures in such a way that the FSA can be traversed in either direction. Then you note which states correspond to the beginning and end of the fixed substring, and run the FSA forward from the end and backward from the beginning. That could prove advantageous when the input consists of very long lines and the frequency of partial matches is relatively high; large files with very long lines are very common in bioinformatics. > As you mentioned, you can avoid searching forwards for the next > newline if your RE matcher supports using newline as an exit marker. Umm, I don't understand why it needs to support using *anything* as an exit marker. The newline character will simply not match any of the transitions from whichever state the FSA is currently in, and neither will the null character. The only bit that needs to know about them is the bit that handles end-of-line and end-of-word anchors. DES -- Dag-Erling Smørgrav - des_at_des.noReceived on Mon Aug 23 2010 - 06:12:14 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC