Python Data Structures

If you're using Python, you'll eventually have to handle data. This data may be your own data, your company's sales data, or maybe data that a recent experiment yielded. Raw data is rarely in a usable form, so you must impose some organizational structure before you can handle it. If you're on a team, the data structures you choose may make or break how well your teammates understand your code and your thinking process.

In this tutorial, we'll cover dictionaries, tuples and sets, along with some exercises to learn their syntax and usage.

Dictionaries: the mapping structure

Outside of programming, a typical dictionary is just a collection of words that you can look up to find their respective definitions.

Python dictionaries are similar, being a collection of key-value pairs. Keys are what you'd imagine to be the words of a regular dictionary, and values correspond to the definitions of these words. Dictionaries map, or associate, keys to values as Webster's Dictionary maps words to their definitions. You look up a key and get its value.

Dictionary keys must be immutable, which means they can't be changed. Keys can be strings, integers, floats or even tuples, which we'll cover later in the article. Mutable, or changeable, items such as lists cannot be used as keys. Flexibility in key-naming allows you to better document the contents of your dictionary for whomever may be reading your code.

Dictionary values are also extremely flexible. You can assign anything to a dictionary key, even other dictionaries!

There are multiple ways to create a dictionary in Python.

  1. You can create a dictionary by surrounding key-value pairs with curly brackets.
  2. Or you can use the dict() constructor. Key-value pairs can be passed as a list of key-value pairs or as keyword arguments
# Method 1:
>>> movie_genres = {
        "Titanic" : ["action", "romance"],
        "Moana" : "animated",
        "40 Year Old Virgin" : "comedy"
    }

# Method 2: 
>>> albums = dict([("Kendrick Lamar", "Damn"), ("Bruno Mars", "24K Magic"])

Once your dictionary has been created, you can look up the keys in your dictionary to return the mapped value.

>>> movie_genres["Moana"]
"animated"

If you'd like to add key-value pairs to your dictionary, the process is simple. You can assign a key to a value within your dictionary, and it will update with the new key that you've added.

>>> movie_genres["Black Panther"] = "action"

Let's say that you want to combine dictionaries. Python provides a simple way to combine dictionaries through the update() method. Please note that duplicate values are overwritten by the values of the second dictionary

# More movie genres that we'd like to add to our dicitonary
more_movie_genres = {
    "The Post" : "historical",
    "Moana" : ["animated", "children"]
}

movie_genres.update(more_movie_genres)

>>> movie_genres["The Post"]
"historical"

>>> movie_genres["Moana"]
["animated", "children"]

Another characteristic of dictionaries is that key-value pairs are unsorted. During your own analyses, you may find that you'd like to sort your dictionary to investigate its contents. Unlike lists, dictionaries are not ordered, and therefore it will not make sense to "sort" a dictionary.

However, there is a workaround to sort a dictionary by its keys, using the sorted() and items() methods along with a special inline function called a lambda function.

# sorted() will sort the dictionary keys by the key value
sorted_movie_genre_list = sorted(movie_genres.items(), key=lambda x: x[0])

>>> sorted_movie_genre_list
[("40 Year Old Virgin, "comedy"), ("Moana",["animated", "children"]), ("Titanic", ["action", "romance"]), ("The Post", "historical")]

>>> sorted_movie_genres = dict(sorted_movie_genre_list)

There's a lot going on here, so we'll dissect this code bit by bit.

  • movie_genres.items() takes our original dictionary and turns it into a list of tuples (more on this later).
  • The sorted() method needs to be told how to sort this list, and this denoted by the key keyword argument.
  • The lambda function is telling our sorted() function that it should sort by the first item of each tuple in the movie_genres.items() list. This creates the another list of tuples, but this time, it is sorted by the dictionary keys.
  • Finally, we'll convert this list of tuples back into a dictionary.

Dictionaries are an invaluable tool in any Python programmer's arsenal for their flexibility and reading ease.

Tuples: the immutable structure

Recall that lists are ordered, mutable sequences of data elements. Tuples are similar to lists with one crucial distinction: they are immutable. You may add or remove elements from a list after making one, but once you create a tuple, it may not be changed. You create tuples by having a list of comma separated items or by surrounding these items with parentheses.

# Method 1:
student_info = "Bob", 19, ["Psychology", "Chemistry", "Calculus"]

# Method 2:
other_student_info = (
    "Anny",
    22,
    ["Film Theory", "Advanced Cinematography", "Internship"]
)

# Note that trying to change the contents of a tuple results in an error
>>> student_info[0] = "Robert"
TypeError: object doesn't support item assignment

There is great symmetry between the syntax of lists and tuples. Tuples are indexed and sliced similarly to lists.

There is a special distinction to keep in mind if you create a tuple with only one element. When you create this tuple you must include a trailing comma after the sole value.

single_item_tuple = ("The One",)

Because tuples are immutable, they are useful in situations where you may want to enforce a rigid structure to your data or where you do not expect the data to have to change. Tuples are also useful in situations where you'd like to hold heterogenous values within the same structure. That is to say, you'd like to store different data types in your structure. You can certainly put heterogenous values into a list, but many list operations will only work if its values are homogenous.

Sets: the distinct values structure

The name "set" comes from a particular branch of mathematics with the original name "Set Theory." Mathematical sets are collections of distinct objects, and Python sets are no different.

Both lists and sets are collections of objects, but duplicate objects are not allowed in sets. Sets can be created by surrounding a sequence of comma-separated values with curly brackets, or by using the set() function with a list.

# Method 1:
last_years_members = { 
    "Andrea",
    "Billy",
    "Christopher",
    "Denny"
}

# Method 2:
current_years_members = set(["Andrea", "Bonnie", "Christopher"])

Sets are useful for comparing membership of different types of groups. Python supports different set operations, such as detecting whether or not two sets have objects in common or seeing what objects are in one set but absent in another.

# What members were around for both years?
>>> last_years_members & current_years_members
{"Andrea", "Christopher"}

# Who are all the members from both years?
>>> last_years_members | current_years_members
{"Andrea", "Billy", "Christopher", "Denny", "Bonnie"}

The & and | characters you see are called operators. The & represents a logical AND, enabling us to filter for members who belong to all sets we inquire about. The | (called a "pipe") represents a logical OR, which lets us inquire about all possible members of the sets we include.

Notice that the first line above includes people who were members both last and this year, while the second line includes everyone who was a member either last year or one currently. There are other operators for more sophisticated logical operations, so you can explore Python's official documentation on sets as you get more comfortable with their usage.

The del statement

Much of our talk so far has dealt with the creation of data structures. But what if we'd like to remove entries from our structures?

Enter the del statement.

The del statement allows you to delete variables and data you no longer need in your structures. The del statement actually removes the variable or data from the namespace, so that it can no longer be referenced. Contrast this to merely reassigning a dictionary key to None.

movie_genres["The Room"] = "worst movie ever"

>>> movie_genres["The Room"]
"worst movie ever"

# Maybe later we'll decide that we don't want The Room in our dictionary
del movie_genres["The Room']

# Now if we try to look for it, it gives us an error
movie_genres["The Room"]
Key Error: "The Room"

We now have learned how to create dictionaries, tuples, and sets. We've learned some of the characteristics of each data structure and how to create them in our code. Finally, we've learned how to delete items from our structures. With all this, let's work with a real data set and put our newfound knowledge to use!

Project: chocolate bar ratings

Imagine you are an intern at a small chocolate company. Your boss has received some ratings data on chocolate bars from around the world. Seeing your eager face, he gives the data to you and asks you to give him a summary of the data at the end of the day. We'll ask a few questions of this data set.

  • What countries are represented in this data set?
  • Where does the best chocolate come from?

For this project, we'll be working with Kaggle's Chocolate Bar Ratings data set. You may download the data set here. The data set should be named flavors_of_cacao.csv.

We've already uploaded the file into our code editor, so you can follow along here. The data set is in CSV format, so we'll have to get it into a more usable form with the csv library. We'll read our data in as a list of lists. Click the Run button below.


CSVs, or comma-separated value files, are structured exactly as their name suggests: each line in the file is what would be a row in Microsoft Excel or Google Sheets. Commas separate the column values. The with statement performs the task of opening our file under the f variable in a safe way, since it will automatically close the file after we are done. We give the open method the string containing the path to tell Python that we want to open the file at this exact location.

We use the reader method from the csv library (not a CSV file) to iterate over the f variable. The reader separates column values using the "delimiter" keyword, which indicates how it should split up the line. This whole statement is surrounded by the list method, which will create a list out of all the lists that were created by csv.reader. A lot done for a mere two lines of code!

Our chocolates variable is not immediately useful, and it certainly won't be useful to your boss. A list of lists is cumbersome to read and is difficult to interpret. Let's look at the column names and first data row to get an idea how the csv was first organized.

chocolates[:2]
[['CompanyÂ\xa0\n(Maker-if known)',
  'Specific Bean Origin\nor Bar Name',
  'REF',
  'Review\nDate',
  'Cocoa\nPercent',
  'Company\nLocation',
  'Rating',
  'Bean\nType',
  'Broad Bean\nOrigin'],
 ['A. Morin',
  'Agua Grande',
  '1876',
  '2016',
  '63%',
  'France',
  '3.75',
  'Â\xa0',
  'Sao Tome']]

It seems that "Company Location" will tell us where these chocolates are coming from. "Company Location" is the 6th item, so we'll grab this item from each list. Then, we'll convert this list into a set to get an idea of how many countries are represented.


Now we can figure out how many countries are represented just looking at the length of the set.

print(len(unique_countries))
60

That's of countries to look at! It would take too much time to calculate everything by hand. However, we can use our newly learned data structures to keep everything organized by country and do our calculations systematically.

We'll use dictionaries to handle our organization.


chocolate_catalog is a two-tiered dictionary. The first "level" of keys are the countries, and their corresponding values are another tier of dictionaries. These dictionaries will hold the information for each country, so that we can reference chocolate_catalog for information in a readable, logical format.

Each chocolate bar in the data set has a rating, aptly contained in the "Rating" column. We'll want to distribute each score to the correct country.


Perfect! We started with a list of lists and have now associated all the ratings with their particular country. We are now prepared to figure out which country has the best chocolate! We'll add a new key-value pair into each country's dictionary to store information on the average value of the country's chocolate ratings.


In the above code, we leveraged the structure of chocolate_catalog to systematically calculate the average score for each country. After storing all of the country's ratings into their respective dictionaries, we used iteration over the keys to ensure that the same calculation was performed for each country.

Let's look at a few countries!

print(chocolate_catalog["France"])
{'Scores': [3.75, 2.75, 3.0, 3.5, 3.5, 2.75, 3.5, 3.5, 3.75, 4.0, 2.75, 3.0, 3.25, 3.75, 2.75, 3.0, 3.25, 4.0, 3.25, 3.5, 4.0, 3.5, 3.75, 2.75, 2.75, 3.0, 2.5, 2.5, 3.0, 2.5, 2.75, 3.0, 2.75, 3.5, 4.0, 3.25, 3.75, 3.5, 3.25, 3.5, 3.5, 3.5, 3.75, 4.0, 4.0, 3.5, 3.25, 3.0, 2.75, 3.0, 3.0, 4.0, 2.5, 3.75, 4.0, 4.0, 4.0, 1.5, 3.0, 4.0, 3.5, 3.0, 2.0, 2.0, 3.0, 2.75, 3.5, 3.0, 3.75, 2.75, 3.25, 3.0, 3.25, 2.75, 2.75, 2.75, 3.25, 3.25, 3.5, 3.5, 3.25, 3.5, 3.0, 3.75, 3.5, 2.75, 2.75, 3.0, 3.25, 3.0, 3.5, 3.25, 3.75, 2.75, 2.75, 3.75, 3.5, 3.5, 2.0, 2.0, 3.0, 4.0, 4.0, 4.0, 2.0, 3.25, 3.25, 3.5, 2.0, 3.0, 3.0, 2.0, 3.5, 4.0, 3.0, 3.5, 3.5, 3.5, 3.75, 3.5, 3.5, 4.0, 3.0, 3.0, 3.25, 3.5, 4.0, 3.0, 3.0, 3.5, 3.25, 2.75, 4.0, 3.75, 4.0, 3.5, 3.5, 4.0, 3.25, 4.0, 3.75, 3.75, 3.5, 3.0, 3.75, 4.0, 1.5, 2.5, 2.75, 3.25, 3.0, 4.0, 2.5, 3.0, 3.5, 3.5], 'Average Rating': (156, 3.2516025641025643)}
print(chocolate_catalog["Chile"])
{'Scores': [3.75, 3.75], 'Average Rating': (2, 3.75)}

We've actually stored the average rating as a tuple. We assume our data won't change any time soon, so we can use a tuple to store how many ratings a country has along with the average rating. Some countries don't have much data, so just looking at averages may be misrepresent a country's chocolate quality.

So, which country has the best chocolate? We'll iterate over through our dictionary and filter out countries that have too few ratings.


If we make five our minimum amount of ratings, then Vietnam comes out on top. If we make 15 the minimum, then Brazil is the best. Thanks to the data structures that we learned, our data is now organized in a way that we and our boss can easily query the dictionary for useful information. Time to start stocking up on chocolate!

Summary

We've learned three important data structures in Python: dictionaries, tuples and sets:

Parentheses and curly brackets!

Each structure lends itself to particular situations that you may encounter as a Python programmer. Below are some common reasons you may need to use each structure:

The Uses of Data Structures