[Discuss] 'C' string tokenizer for those who hate strtok

David Bronaugh dbronaugh at linuxboxen.org
Fri Jun 30 10:26:22 PDT 2006


Brian Quinlan wrote:
> Whether they are significantly slower at runtime depends on the 
> algorithm that you are expressing and how your code is written e.g. in 
> this case, my Python code can split a 3,889 character string into 1000 
> substrings 10000 times in 3.89 seconds. Your C code takes 13x longer. 
> I would expect that all the calls to malloc are killing you - Python 
> pre-allocates memory in medium-size chunks and manages it's own pools 
> so it probably ran my entire test using a single malloc call where the 
> C code required 1000 * 2 * 10000 malloc calls (and a corresponding 
> number of free calls).
>
> But I agree that C code can always be made to be faster than Python 
> code if you are willing to spend enough time optimizing it. In this 
> case, you could pre-allocate len(string) * 2 bytes to store the tokens.
You could also use strtok_r...

However, I'm curious about your testing:
 - First, pw's Makefile specifies conflicting optimizations -- the last 
one is -O0 so I believe that's the one that stands. I changed that first.
 - Second, it seems that more than half the time spent in the C code is 
spent calling printf. I removed the printf call and the program runs 
more than twice as fast.

My test input is up at http://bronaugh.linuxboxen.org/test.txt -- can 
you please make these changes and compare? For that matter, rip the 
print out of the Python, for fairness.

Keep in mind that I just wanted to clean up the current code, not 
rearchitect it. My own opinions is -- just use strtok_r.

David.


More information about the Discuss mailing list