Follow-Up to Weighted Sorting in Python
The activity on my latest blog post has been tremendous! I never expected that much activity within an hour or two of posting the article.
The aim of this article is to provide an alternative solution to my weighted sort when you're after increased performance. It might just be useful to those who came here in search of a way to do weighted sorting in Python. I need to give a shout out to Jeremy Brown for suggesting this solution. He's so awesome :P
While the example I posted in my previous article addressed my needs just fine, it is definitely not the fastest option. A better solution would be to completely remove the special IDs from the object list altogether and just place them in front of the list:
import itertools import random object_ids = [random.randint(0, 100) for i in range(20)] special_ids = [random.choice(object_ids) for i in range(5)] not_special_ids = (i for i in object_ids if i not in special_ids) for i in itertools.chain(special_ids, not_special_ids): # do stuff with each ID pass
This solution is quite different from my weighted sort, as there's no sorting going on at all, just a simple generator and using itertools to chain two collections together.
Here's a way you can benchmark see which solution is faster:
from copy import copy import cProfile import itertools import random object_ids = [random.randint(0, 100) for i in range(20)] special_ids = [random.choice(object_ids) for i in range(5)] ITERATIONS = 1000000 def sorting(): for i in xrange(ITERATIONS): l = copy(object_ids) l.sort(key=lambda i: int(i in special_ids) * -2 + 1) for i in l: pass def chaining(): for i in xrange(ITERATIONS): l = (i for i in object_ids if i not in special_ids) for i in itertools.chain(special_ids, l): pass cProfile.run('sorting()') cProfile.run('chaining()')
Sample output on my box is:
$ python weighted_sort.py 24000003 function calls in 18.411 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 18.411 18.411 <string>:1(<module>) 1000000 0.580 0.000 0.580 0.000 copy.py:112(_copy_with_constructor) 1000000 0.791 0.000 1.510 0.000 copy.py:65(copy) 1 1.397 1.397 18.411 18.411 weighted_sort.py:11(sorting) 20000000 8.907 0.000 8.907 0.000 weighted_sort.py:14(<lambda>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000000 0.139 0.000 0.139 0.000 {method 'get' of 'dict' objects} 1000000 6.597 0.000 15.503 0.000 {method 'sort' of 'list' objects} 16000003 function calls in 7.381 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 7.381 7.381 <string>:1(<module>) 1 2.744 2.744 7.381 7.381 weighted_sort.py:18(chaining) 16000000 4.636 0.000 4.636 0.000 weighted_sort.py:20(<genexpr>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
So, you can see that the chaining solution is easily twice as fast as the sorting solution over 1 million iterations. Both of these solutions work perfectly well for my purposes, and I will probably end up switching to the chaining solution sometime in the future.
EDIT After reading lqc's comment on my previous article, I've decided to update this one with more appropriate benchmarks. The information that lqc has shared makes the speed of these solutions much closer.
Here's my updated test script:
$ python weighted_sort.py 4000003 function calls in 8.437 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 8.437 8.437 <string>:1(<module>) 1000000 0.558 0.000 0.558 0.000 copy.py:112(_copy_with_constructor) 1000000 0.741 0.000 1.431 0.000 copy.py:65(copy) 1 1.319 1.319 8.437 8.437 weighted_sort.py:11(sorting) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000000 0.133 0.000 0.133 0.000 {method 'get' of 'dict' objects} 1000000 5.688 0.000 5.688 0.000 {method 'sort' of 'list' objects} 17000003 function calls in 7.545 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 7.545 7.545 <string>:1(<module>) 1 2.818 2.818 7.545 7.545 weighted_sort.py:18(chaining) 17000000 4.726 0.000 4.726 0.000 weighted_sort.py:20(<genexpr>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
So if you only gain a second over 1 million iterations, I think I prefer the sort(key=special_ids.__contains__) solution! I hope these two articles will help you get started on your adventures with handling special objects before others!