Is it possible that the pattern may occur more than once per line? Early versions of Fredrik's script only counted the first occurence per line before he switched to findall(). I don't know if it's possible for the Referer [sic] field in the log file to contain the pattern, either naturally or maliciously.
I'm not surprised that the command line version using egrep/awk/sort/uniq is one of the fastest. These programs are in the UNIX tradition of doing one or a few things well. They also divide the problem into nicely modular pieces that are easy to debug.
The Macs seem to perform pretty badly on the parallel disk access tests; wf-5 and wf-6 doesn't seem to buy you much over a single-threaded program on the Mac, but gives major speedups on multicore Windows and Unix boxes.
(and for the record, the egrep/awk pipeline is the fastest also on Windows, but the result doesn't look entirely correct:
(c:\bin\uniq.exe 1000) Exception trapped! (c:\bin\uniq.exe 1000) exception C0000005 at 10011E58 (c:\bin\uniq.exe 1000) exception: ax 0 bx 0 cx FFFFFFFF dx 0 (c:\bin\uniq.exe 1000) exception: si 0 di 240EED0 bp 240EE84 sp 240EE7C (c:\bin\uniq.exe 1000) exception is: STATUS_ACCESS_VIOLATION
mxTextTools' support for the buffer interface was removed when adding the Unicode support. We will re-add it again in one of the future versions.
BTW: You can achieve further speedups in the dalke-wf-9.py version by using the new mxTextTools CharSet() for DIGITS, ie. DIGITS = CharSet('0123456789').
IsIn will do a sequential search through the set string, while CharSet() uses a bitmap for faster access. You then use IsInCharSet instead of IsIn.
I managed to shave a third off the command line version which would make it the fastest if your result was scaled by the same amount: http://paddy3118.blogspot.com/2007/10/wide-finder-on-command-line.html
The main time saver was to use gawk to do the counting instead of sort+uniq.
Game developers who live and die on disk access speed have known for years that careful sequential reading is faster than mmap - this shouldn't surprise you.
9 comments:
Interesting article, thanks.
Is it possible that the pattern may occur more than once per line? Early versions of Fredrik's script only counted the first occurence per line before he switched to findall(). I don't know if it's possible for the Referer [sic] field in the log file to contain the pattern, either naturally or maliciously.
Cheers, Ralph Corderoy.
Tim's original regexp doesn't check to see if the match was in the correct field, and pretty much everyone is keeping with that assumption.
I don't know if the referer field goes through a normalization set (eg, " " to "+", and using % escapes), so I would use the user-agent field.
Give it a go and see if Tim mentions it in the future! :)
I'm not surprised that the command line version using egrep/awk/sort/uniq is one of the fastest. These programs are in the UNIX tradition of doing one or a few things well. They also divide the problem into nicely modular pieces that are easy to debug.
Nice post,
James Thiele
The Macs seem to perform pretty badly on the parallel disk access tests; wf-5 and wf-6 doesn't seem to buy you much over a single-threaded program on the Mac, but gives major speedups on multicore Windows and Unix boxes.
(and for the record, the egrep/awk pipeline is the fastest also on Windows, but the result doesn't look entirely correct:
(c:\bin\uniq.exe 1000) Exception trapped!
(c:\bin\uniq.exe 1000) exception C0000005 at 10011E58
(c:\bin\uniq.exe 1000) exception: ax 0 bx 0 cx FFFFFFFF dx 0
(c:\bin\uniq.exe 1000) exception: si 0 di 240EED0 bp 240EE84 sp 240EE7C
(c:\bin\uniq.exe 1000) exception is: STATUS_ACCESS_VIOLATION
;-)
Thanks for the nice article, Andrew !
mxTextTools' support for the buffer interface was removed when adding the Unicode support. We will re-add it again in one of the future versions.
BTW: You can achieve further speedups in the dalke-wf-9.py version by using the new mxTextTools CharSet() for DIGITS, ie. DIGITS = CharSet('0123456789').
IsIn will do a sequential search through the set string, while CharSet() uses a bitmap for faster access. You then use IsInCharSet instead of IsIn.
I managed to shave a third off the command line version which would make it the fastest if your result was scaled by the same amount:
http://paddy3118.blogspot.com/2007/10/wide-finder-on-command-line.html
The main time saver was to use gawk to do the counting instead of sort+uniq.
- Paddy.
Game developers who live and die on disk access speed have known for years that careful sequential reading is faster than mmap - this shouldn't surprise you.
But I'm not a game developer. :)
disqus.com can give you a comment system that integrates much better with you current page without requiring you to run dynamic code.
Post a Comment