Follow-Up to Weighted Sorting in Python

Josh VanderLinden

2011-03-08 18:37

The activity on my latest blog post has been tremendous! I never expected that much activity within an hour or two of posting the article.

The aim of this article is to provide an alternative solution to my weighted sort when you're after increased performance. It might just be useful to those who came here in search of a way to do weighted sorting in Python. I need to give a shout out to Jeremy Brown for suggesting this solution. He's so awesome :P

While the example I posted in my previous article addressed my needs just fine, it is definitely not the fastest option. A better solution would be to completely remove the special IDs from the object list altogether and just place them in front of the list:

import itertools
import random

object_ids = [random.randint(0, 100) for i in range(20)]
special_ids = [random.choice(object_ids) for i in range(5)]

not_special_ids = (i for i in object_ids if i not in special_ids)
for i in itertools.chain(special_ids, not_special_ids):
    # do stuff with each ID
    pass

This solution is quite different from my weighted sort, as there's no sorting going on at all, just a simple generator and using itertools to chain two collections together.

Here's a way you can benchmark see which solution is faster:

from copy import copy
import cProfile
import itertools
import random

object_ids = [random.randint(0, 100) for i in range(20)]
special_ids = [random.choice(object_ids) for i in range(5)]

ITERATIONS = 1000000

def sorting():
    for i in xrange(ITERATIONS):
        l = copy(object_ids)
        l.sort(key=lambda i: int(i in special_ids) * -2 + 1)
        for i in l:
            pass

def chaining():
    for i in xrange(ITERATIONS):
        l = (i for i in object_ids if i not in special_ids)
        for i in itertools.chain(special_ids, l):
            pass

cProfile.run('sorting()')
cProfile.run('chaining()')

Sample output on my box is:

$ python weighted_sort.py
         24000003 function calls in 18.411 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   18.411   18.411 <string>:1(<module>)
  1000000    0.580    0.000    0.580    0.000 copy.py:112(_copy_with_constructor)
  1000000    0.791    0.000    1.510    0.000 copy.py:65(copy)
        1    1.397    1.397   18.411   18.411 weighted_sort.py:11(sorting)
 20000000    8.907    0.000    8.907    0.000 weighted_sort.py:14(<lambda>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000000    0.139    0.000    0.139    0.000 {method 'get' of 'dict' objects}
  1000000    6.597    0.000   15.503    0.000 {method 'sort' of 'list' objects}


         16000003 function calls in 7.381 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.381    7.381 <string>:1(<module>)
        1    2.744    2.744    7.381    7.381 weighted_sort.py:18(chaining)
 16000000    4.636    0.000    4.636    0.000 weighted_sort.py:20(<genexpr>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

So, you can see that the chaining solution is easily twice as fast as the sorting solution over 1 million iterations. Both of these solutions work perfectly well for my purposes, and I will probably end up switching to the chaining solution sometime in the future.

EDIT After reading lqc's comment on my previous article, I've decided to update this one with more appropriate benchmarks. The information that lqc has shared makes the speed of these solutions much closer.

Here's my updated test script:

$ python weighted_sort.py
         4000003 function calls in 8.437 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    8.437    8.437 <string>:1(<module>)
  1000000    0.558    0.000    0.558    0.000 copy.py:112(_copy_with_constructor)
  1000000    0.741    0.000    1.431    0.000 copy.py:65(copy)
        1    1.319    1.319    8.437    8.437 weighted_sort.py:11(sorting)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000000    0.133    0.000    0.133    0.000 {method 'get' of 'dict' objects}
  1000000    5.688    0.000    5.688    0.000 {method 'sort' of 'list' objects}


         17000003 function calls in 7.545 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.545    7.545 <string>:1(<module>)
        1    2.818    2.818    7.545    7.545 weighted_sort.py:18(chaining)
 17000000    4.726    0.000    4.726    0.000 weighted_sort.py:20(<genexpr>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

So if you only gain a second over 1 million iterations, I think I prefer the sort(key=special_ids.__contains__) solution! I hope these two articles will help you get started on your adventures with handling special objects before others!