Pieces of Py #4 Using itertools.groupby

Posted on Sun 01 September 2019 in Python • 3 min read

During some of the exercises I did on my first round of #100DaysOfCode there was a need to group an iterable by a certain field in the data structure. I found out, that was a function called groupby in the built-in itertools library that can be used for it.

In this article I try to present a simple example on how it works. Let's say I got a list of namedtuples representing employees and their roles, and I want to group this data by role.

from collections import namedtuple  


Employee = namedtuple('Employee', 'first_name last_name role')

employees = [Employee(first_name='Renae', last_name='Notley', role='Developer'),
 Employee(first_name='Prent', last_name='Geffcock', role='IT Pro'),
 Employee(first_name='Francyne', last_name='Maudsley', role='Support'),
 Employee(first_name='Ashil', last_name='Mudd', role='Sales Rep'),
 Employee(first_name='Jamie', last_name='Berendsen', role='Sales Rep'),
 Employee(first_name='Carolan', last_name='Grenshiels', role='Support'),
 Employee(first_name='Marthe', last_name='Uttley', role='Developer'),
 Employee(first_name='Camille', last_name='Gierardi', role='Developer'),
 Employee(first_name='Carny', last_name='Borrow', role='Developer'),
 Employee(first_name='Arabele', last_name='Twallin', role='Developer'),
 Employee(first_name='Lydie', last_name="O'Bradain", role='Sales Rep'),
 Employee(first_name='Alida', last_name='Knotton', role='Developer'),
 Employee(first_name='Shaughn', last_name='Brownlee', role='Sales Rep'),
 Employee(first_name='Janetta', last_name='Loudwell', role='IT Pro'),
 Employee(first_name='Gawain', last_name='Bertlin', role='IT Pro'),
 Employee(first_name='Gregorio', last_name='Jiroutka', role='IT Pro'),
 Employee(first_name='Ddene', last_name='Orsay', role='IT Pro'),
 Employee(first_name='Sophronia', last_name='Blencowe', role='IT Pro'),
 Employee(first_name='Sunny', last_name='Harrisson', role='Sales Rep'),
 Employee(first_name='Krissie', last_name='Scates', role='IT Pro')]

To use the groupby function I first need to sort the data on the field I want to groupby. Otherwise each group will not contain all its items.

A quote from the Python documentation regarding itertools.groupby

Generally, the iterable needs to already be sorted on the same key function. The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function).

from itertools import groupby

# define function to sort and groupby on role
keyfunc = lambda x: x.role

# sort the list inline before grouping
employees.sort(key=keyfunc)

# use itertools.groupby to group on role
# each item returned from groupby contains a key and value
# which I use to build a dictionary with role as the key
# and employees with that role as a list
employees_by_role = {
    item[0]: list(item[1])
    for item in groupby(employees, keyfunc)
}

print(employees_by_role)
{'Developer': [Employee(first_name='Renae', last_name='Notley', role='Developer'), Employee(first_name='Marthe', last_name='Uttley', role='Developer'), Employee(first_name='Camille', last_name='Gierardi', role='Developer'), Employee(first_name='Carny', last_name='Borrow', role='Developer'), Employee(first_name='Arabele', last_name='Twallin', role='Developer'), Employee(first_name='Alida', last_name='Knotton', role='Developer')], 'IT Pro': [Employee(first_name='Prent', last_name='Geffcock', role='IT Pro'), Employee(first_name='Janetta', last_name='Loudwell', role='IT Pro'), Employee(first_name='Gawain', last_name='Bertlin', role='IT Pro'), Employee(first_name='Gregorio', last_name='Jiroutka', role='IT Pro'), Employee(first_name='Ddene', last_name='Orsay', role='IT Pro'), Employee(first_name='Sophronia', last_name='Blencowe', role='IT Pro'), Employee(first_name='Krissie', last_name='Scates', role='IT Pro')], 'Sales Rep': [Employee(first_name='Ashil', last_name='Mudd', role='Sales Rep'), Employee(first_name='Jamie', last_name='Berendsen', role='Sales Rep'), Employee(first_name='Lydie', last_name="O'Bradain", role='Sales Rep'), Employee(first_name='Shaughn', last_name='Brownlee', role='Sales Rep'), Employee(first_name='Sunny', last_name='Harrisson', role='Sales Rep')], 'Support': [Employee(first_name='Francyne', last_name='Maudsley', role='Support'), Employee(first_name='Carolan', last_name='Grenshiels', role='Support')]}

Or to print it out a little more readable I could do it like this.

from pprint import pprint as pp

pp(employees_by_role)
{'Developer': [Employee(first_name='Renae', last_name='Notley', role='Developer'),
           Employee(first_name='Marthe', last_name='Uttley', role='Developer'),
           Employee(first_name='Camille', last_name='Gierardi', role='Developer'),
           Employee(first_name='Carny', last_name='Borrow', role='Developer'),
           Employee(first_name='Arabele', last_name='Twallin', role='Developer'),
           Employee(first_name='Alida', last_name='Knotton', role='Developer')],
  'IT Pro': [Employee(first_name='Prent', last_name='Geffcock', role='IT Pro'),
        Employee(first_name='Janetta', last_name='Loudwell', role='IT Pro'),
        Employee(first_name='Gawain', last_name='Bertlin', role='IT Pro'),
        Employee(first_name='Gregorio', last_name='Jiroutka', role='IT Pro'),
        Employee(first_name='Ddene', last_name='Orsay', role='IT Pro'),
        Employee(first_name='Sophronia', last_name='Blencowe', role='IT Pro'),
        Employee(first_name='Krissie', last_name='Scates', role='IT Pro')],
   'Sales Rep': [Employee(first_name='Ashil', last_name='Mudd', role='Sales Rep'),
           Employee(first_name='Jamie', last_name='Berendsen', role='Sales Rep'),
           Employee(first_name='Lydie', last_name="O'Bradain", role='Sales Rep'),
           Employee(first_name='Shaughn', last_name='Brownlee', role='Sales Rep'),
           Employee(first_name='Sunny', last_name='Harrisson', role='Sales Rep')],
   'Support': [Employee(first_name='Francyne', last_name='Maudsley', role='Support'),
         Employee(first_name='Carolan', last_name='Grenshiels', role='Support')]}

Conclusion

In this article I have given a simple example on how to use ìtertools.groupby in python. Also mentioned that the iterable that are going to be grouped needs to be sorted before doing the groupby operation.

Resources

Let me know on Twitter if I can improve this article, or if you have other resources to help out with understanding this topic.