An image of a laptop displaying a WordPress plugin with a Python symbol.

Table of Contents

An Introduction to Python for SEO Pros Using Spreadsheets

2019 far exceeded my expectations in terms of Python adoption within the SEO community.

As we start a new year and I hear more SEO professionals wanting to join in the fun, but frustrated by the initial learning curve, I decided to write this introductory piece with the goal of getting more people involved and contributing.

Most SEO work involves working with spreadsheets which you have to redo manually when working with multiple brands or repeating the same analysis over time.

When you implement the same workflow in Python, you can trivially reproduce the work or even automate the whole workflow.

We are going to learn Python basics while studying code John Mueller recently shared on Twitter that populates Google Sheets. We will modify his code to add a simple visualization.

Using the Wikipedia API, I pulled in some more fields for the curious :). Spreadsheet: https://t.co/OSBENEubgt – Colab code: https://t.co/sTAb1vk8N4

— 🍌 John 🍌 (@JohnMu) January 3, 2020

Setting up the Python Environment

Similar to working with Excel or Google Sheets, you have two primary options when working with Python.

You can install and run Python on your local computer, or you can run it in the cloud using Google Colab or Jupyter notebooks.

Let’s review each one.

Working with Python on Your Local Computer

I typically choose to work on my Mac when there is software that won’t run in the cloud, for example, when I need to automate a web browser.

You need to download three software packages:

Go to https://www.anaconda.com/distribution/ to download and install Python 3.7 for your operating system. Anaconda includes Python and most of the libraries that you need for data analysis.

This will take a while to complete.

Once done, search for the Anaconda Navigator and launch it.

Click to launch JupyterLab and it should open a new tab in your browser with a JupyterLab session.

Click on the big icon to start a Python 3 notebook and you are reading to start type or copy/pasting code snippets.

You can think of this notebook as similar to a new Excel sheet.

The next step is optional.

Go to https://code.visualstudio.com/download and download and install Visual Studio Code for your computer.

I personally use Visual Studio Code when I need to write code in Python and JavaScript or when writing JavaScript code. You can also use it if you want to convert your notebook code into a command-line script.

It is easier to prototype in Jupyter notebooks and when you get everything to work, you can use Visual Studio Code to put everything together in a script or app that others can use from the command line.

Make sure to install the Python extension for VSC. You can find it here.

Visual Studio Code has built-in support for Jupyter Notebooks.

You can create one by typing the keyword combination Command+Shift+P and selecting the option “Python Jupyter Notebook”.

Working with Python in the Cloud

I do most of my Python work on Google Colab notebooks so this is my preferred option.

Go to https://colab.research.google.com/ and you can skip the downloading and installations steps.

Click on the option to start a new Python 3 notebook and you will have the equivalent of a new Google Sheet.

Learning the basics of Python & Pandas

Mueller shared a Colab notebook that pulls data from Wikipedia and populates and Google Sheet with that data.

Professional programmers need to learn the ins and out of a programming language and that can take a lot of time and effort.

For SEO practitioners, I think a simpler approach that involves studying and adapting existing code, could work better. Please share your feedback if you try this and see if I am right.

We are going most of the same basics you learn in typical Python programming tutorials with a practical context in mind.

Let’s start by saving Mueller’s notebook to your Google Drive.

After you click the link. Select File > Save a copy in Drive.

Here is the example Google sheet with the output of the notebook.

Overall Workflow

Mueller wants to get topic ideas that perform better in mobile compared to desktop.

What kind of content is more useful on mobile vs via desktop? Wikipedia to the rescue! Apparently, celebrity / entertainment & medical content rules mobile.https://t.co/lvEdYmNPB2 … also, Pomeranians?

He learned that celebrity, entertainment, and medical content does best on mobile.

Let’s read through the code and comments to get a high-level overview of how he figured this out.

We have several pieces to the puzzle.

Python Building Blocks

Let’s learn some common Python building blocks while we review how Mueller’s code retrieves values to populate a couple of fields: the PageId and Description.

We have two Python functions to retrieve the fields. are like functions in Google Sheets but you define their behavior in any way you want. They take input, process it and return an output.

Here is the PageId we get when we call get_PageId(“Avengers: Endgame”)

Here is the Description we get when we call get_description(pageId)

Anything after the # symbol is considered a Python comment and is ignored. You use comments to document the intention of the code.

Let’s step through, line by line, the get_PageId function to learn how it gets the ID of the title of the article that we are passing on.

q is a . It holds key-value pairs. If you look up the value of “action”, you get “query” and so on. For example, you’d perform such a lookup using q[“action”].

“action” is a . It represents textual information.

“titles”: title maps the “titles” key to the title that we passed as input to the function. All keys and values are hardcoded and explicit, except for the last one. This is what the dictionary looks like after we execute this function.

In the next line we have.

Here we have a function urllib.parse.urlencode. Module functions are just like Google sheet functions that provide standard functionality.

Before we call module or library functions, we need to import the module that contains them.

This line at the top of the notebook does that.

Let’s clarify the call and see the output we get.

You can find detailed documentation on the urlencode module function here. Its job is to convert a dictionary of URL parameters into a query string. A query string is the part of the URL after the question mark.

This is the output we get after we run it.

This is what our URL definition line looks like after we add the result of urlencode.

The + sign here concatenates the strings to form one.

This resulting string is the API request the notebook sends to Wikipedia.

In the next line of code, we open the dynamically generated URL.

requests.get is a Python third-party module function. You need to install third-party libraries using the pip.

You can run command line script and tools from a notebook by prepending them with !

The code after ! is not Python code. It is Unix shell code. This article provides a comprehensive list of the most common shell commands.

After you install the third-party module, you need to import it like you do with standard libraries.

Here is what the translated call looks like.

You can open this request in the browser and see the API response from Wikipedia. The function call allows us to do this without manually opening a web browser.

The results from the requests.get call gets stored in the Python variable response.

This is what the result looks like.

{“batchcomplete”: “”,
“query”: {“pages”: {“44254295”: {“contentmodel”: “wikitext”,
“lastrevid”: 933501003,
“length”: 177114,
“ns”: 0,
“pageid”: 44254295,
“pagelanguage”: “en”,
“pagelanguagedir”: “ltr”,
“pagelanguagehtmlcode”: “en”,
“title”: “Avengers: Endgame”,
“touched”: “2020-01-03T17:13:02Z”}}}}

You can think of this complex data structure as a dictionary where some values include other dictionaries and so forth.

The next line of code slices and dices this data structure to extract the PageId.

Let’s step through it to see how it gets it.

When we look up the value for the key “query”, we get a smaller dictionary.

{“pages”: {“44254295”: {“contentmodel”: “wikitext”,
“lastrevid”: 933501003,
“length”: 177114,
“ns”: 0,
“pageid”: 44254295,
“pagelanguage”: “en”,
“pagelanguagedir”: “ltr”,
“pagelanguagehtmlcode”: “en”,
“title”: “Avengers: Endgame”,
“touched”: “2020-01-03T17:13:02Z”}}}

Then, we look up the value of “pages” in this smaller dictionary.

We get an even smaller one. We are drilling down on the big response data structure.

{“44254295”: {“contentmodel”: “wikitext”,
“lastrevid”: 933501003,
“length”: 177114,
“ns”: 0,
“pageid”: 44254295,
“pagelanguage”: “en”,
“pagelanguagedir”: “ltr”,
“pagelanguagehtmlcode”: “en”,
“title”: “Avengers: Endgame”,
“touched”: “2020-01-03T17:13:02Z”}}

The PageId is available in two places in this slice of the data structure. As the only key, or as a value in the nested dictionary.

John made the most sensible choice, which is to use the key to avoid further exploration.

The response from this call is a Python dictionary view of the keys. You can learn more about dictionary view in this article.

We have what we are looking for, but not in the right format.

In the next step, we convert the dictionary view into a .

This what the conversion looks like.

Python lists are like rows in a Google sheet. They generally contain multiple values separated by commas, but in this case, there is only one.

Finally, we extract the only element that we care about from the list. The first one.

The first element in Python lists starts at index 0.

Here is the final result.

As this is an identifier, is better to keep as a string, but if we needed a number to perform arithmetic operations, we would do another transformation.

In this case, we get a .

The main differences between strings and integers are the types of operations that you can perform with them. As you saw before we can use the + operator to concatenate two strings, but if we used the same operator in two numbers, it would add them together.

As a side note, I should mention , a cool command line tool that allows you to slice and dice JSON responses directly from curl calls (another awesome command line tool). curl allows you to do the equivalent of what we are doing with the requests module here, but with limitations.

So far we’ve learned how to create functions and data types that allow us to extract data and filter data from third-party sites (Wikipedia in our case).

Let’s call the next function in John’s notebook to learn another important building block: flow control structures.

This is what the API URL looks like. You can try it in the browser.

Here what the response looks like.

{“ns”: 0,
“pageid”: 44254295,
“terms”: {“alias”: [“Avengers Endgame”, “Avengers End Game”, “Avengers 4”],
“description”: [“2019 superhero film produced by Marvel Studios”],
“label”: [“Avengers: Endgame”]},
“title”: “Avengers: Endgame”}

This is the code that will step through to understand control flows in Python.

This part checks if the response structure (above) includes a key named “terms”. It uses the Python If … Else control flow operator. Control flow operators are the algorithmic building blocks of programs in most languages, including Python.

If this check is successful, we look up the value of such key with rs[“terms”]

We expect the result to be another dictionary and check it to see if there is a key with the value “description”.

If both checks are successful, then we extract and store the description value.

We expect the final value to be a Python list, and we only want the first element as we did before.

The and logical operator combines both checks into one where both need to be true for it to be true.

If the check is false, the description is an empty string.

Populating Google Sheets from Python

With a solid understanding of Python basic building blocks, now we can focus on the most exciting part of Mueller’s notebook: automatically populating Google Sheets with the values we are pulling from Wikipedia.

Let’s step through some interesting parts of this function.

The functionality to update Google Sheets is covered by a third-party module.

We need to install it and import it before we can use it.

Mueller chose to convert the sheets into pandas data frame and while, as he mentions in the comments, it was not necessary, but we can take the opportunity to learn a little bit of pandas too.

At the end of every helper function that fills a column, we have a call like the one above.

We are passing the relevant columns and the function that will get the corresponding values.

When you pass the name of a function with parameters in Python, you are not passing data but code for the function to execute. This is not something that, as far as I know, you can do in a spreadsheet.

The first thing we want to know is which column we need to update. When we run the code above we get 7, which is the column position of the PageId in the sheet (starting with 1).

In this line of code, we have another control flow operator, the . For loops allow you to iterate over elements that represent collections, for example, lists and dictionaries.

In our case above, we are iterating over a dictionary where the index variable will hold the key, and the row variable will hold the value.

To be more precise, we are iterating over a , a dictionary view is like a read-only and faster copy of the dictionary, which is perfect for iteration.

<generator object DataFrame.iterrows at 0x7faddb99f728>

When you print iterrows, you don’t actually get the values, but a object.

Iterators are functions that access data on demand, require less memory and perform faster than accessing collections manually.

This is an example iteration of the for loop. I printed the index and row values.

forceUpdate is a value which defaults to False. Booleans can only be true or false.

row[“PageId”] is empty initially, so not row[“PageId”] is true and the next line will execute. The or operator allows the next line to execute for subsequent runs only when the flag forceUpdate is true.

This is the code that calls our custom function to get the page ids.

The result value for the example iteration is 39728003

When you review the function carefully, you will notice that we use df which is not defined in the function. The code that does that is at the beginning of the notebook.

The code uses the third-party module pandas to create a data frame from the Google Sheet rows. I recommend reading this 10 minutes to pandas article to get familiar. It is a very powerful data manipulation library.

Finally, let’s see how to we update the Google Sheet.

This code can be translated to.

This is the code that updates the Google sheet. The variable worksheet is also not defined in the update_spreadsheet_rows function, but you can find it at the beginning of the notebook.

I left this code for last because it is the last thing that gets executed and it is also more complicated than the previous code. However, it is the first thing you need to execute in the notebook.

First, we import the third-party module gspread, and complete an Oauth authentication in Chrome to get access to Google Sheets.

We manipulate the Google sheet with the worksheet variable and we use the worksheetRows variable to create the pandas Dataframe.

Visualizing from Python

Now we get to your homework.

I wrote code to partially reproduce John’s pivot table and plot a simple bar chart.

Your job is to add this code to your copy of the notebook and add print(varible_name) statements to understand what I am doing. This is how I analyzed John’s code.

Here is the code.

Resources to Learn More

If you got this far and want to learn more, I recommend you follow the links I included in the article and practice the code snippets in this guide.

At the end of most of my columns, I share interesting Python projects from the SEO community. Please consider checking out the ones that interest you and consider studying them as we did here.

But, even better, see how you might be able to add something simple but valuable that you can share back!

Screenshot taken by author, January 2020

This content was originally published here.