Blogpost: PHP's in_array is slow - this works faster

PHP's in_array is slow - this works faster

Tags: programming, webdevelopment.
By lucb1e on 2012-10-03 22:50:33 +0100

I think this is best explained by example, so here's a simple script to load a file into the memory, removing duplicate lines:
<?php     $handle = fopen("myfile.txt", "r");     $lines = array();     while ($line = fgets($handle))         if (!in_array($line, $lines))             $lines[] = $line;          fclose($handle);

This works fine if you don't mind waiting for a minute or ten until it did all the millions of lines. If you're like me, you will probably want to make it run in under thirty seconds.

"The fastest query is no query at all," but that's not an option. Duplicate removal needs to be done, and by the server. The application is slow enough as it is (millions of lines remember?), so I had to find another solution...

PHP's in_array is a compiled function, it will almost certainly be faster than any implementation I could write. But there had to be a way... Proving once again that thinking is best done away from your pc, I thought of a solution while fetching a bokkepoot from the kitchen :)

$handle = fopen("myfile.txt", "r"); $hadlines = array(); while ($line = fgets($handle)) {     if (!isset($hadlines[$line])) {         $lines[] = $line;         $hadlines[$line] = true;     } } fclose($handle);

This way, I went from processing 1100 lines per second to 3000 lines per second: great!

But why does it work? PHP's arrays with string indexes are called hashtables, and they're called that for a reason: they use hashing. The hashing algorithm probably isn't like any you've heard of; it actually causes many collisions. Usually a collision is bad, and it is here too, but the algorithm is heavily optimized for speed instead of output distribution.

What it does it basically this:
You: $key = "test"; $array[$key] = true;
PHP: RAM[hash($key)] = true;

The hash() function returns a number. This is how you can very easily map the string "test" to a memory location. Also the hashing algorithm is not like md5 or sha, it's actually a very short script. And by that, I really mean very short:
/* DJBX33A by Daniel J. Bernstein */ hash(*key) {     h = 0;     while(*key) h=33*h + *key++;     return h; }

The distribution of the output is also not as great as MD5 or so, and collisions occur a lot. This is solved by using a linked list if I recall correctly; the thing is that you hardly even notice this as long as you don't have 10 billion of those collisions. When you create too many collisions, you get the hashdos effect (so a DoS vulnerability).

The in_array loops through all array indexes to check (which was about 150 megabytes for me). By using isset on a hashtable instead, all it does is loop over the key (in this case, like 50 bytes of data) to find the corresponding memory location, and then check whether it's set to true. Performing some arithmetic over 50 bytes and reading a random memory location is certainly a lot faster than scanning megabytes of data!

Another possible optimization technique would be a bloom filter (though I'm not sure if I'd have to write that myself, which would be interpreted instead of compiled, and thus slower than this), but bloom filters have false-positives. Unacceptable in this case.

Also, the code might simply put all strings in the hashtable, not saving them to $lines. Later you can easily rebuild $lines via:
foreach ($hadlines as $line=>$useless)     $lines[] = $line;

This was not possible in my case, so I did not test this.

So what is the downside? Why does in_array not automatically build an index? (Yes, the term 'index' here is very similar to indexes in MySQL or any other DBMS.) The downside is extra memory usage. This can be solved by my last example where $lines is later rebuilt (though I'm not sure what that does for performance), but that doesn't work for all cases anyway.