Mailchimp Data Preparation & Modeling for Email Campaign Optimization

In this article we’ll use the information we can derive from the structure of the email campaign data coming from Mailchimp, to craft new variables or features that hopefully will help us with our email marketing campaign optimization. We will attempt to achieve the following:

  • Figure out a way of identifying the gender of our user
  • Segment the e-mails of our users in different categories, for example if an e-mail address is personal or a business address.
  • Segment the events of our recipients into different periods of the day.

We should be able to do the above with the information we get from our Mailchimp email campaign data together with some additional data sources we might need, as we will see later on this post. Together with the process of deriving this implicit information from our email marketing data, we are also going to “clean” it, so for example if for one of our users we fail completely to come up with a gender we might remove the entry completely from the data set.

Explore the email campaign data model we’ll be using

To achieve the tasks we defined previously, we need to work with two of the resources that Mailchimp has as a platform and that are exposed through its API. The first resource is the List Member. It is quite interesting to notice that Mailchimp does not have a high level resource for a user, instead our users are represented as list members. This makes sense as Mailchimp is all about email campaigns that are performed on a list of recipients and thus its datamodel is optimized with this assumption in mind. The list member contains the following attributes we will need.

  • email_address: self-explanatory
  • merge_fields: Now this is not that obvious why we need it. Under this attribute which is an Object exist two attributes that are not documented on the documentation of the List Member resource. The FNAME and LNAME attributes. In these two the first name and last name respectively of our user are stored. We will need these to figure out the gender of our user.

The second resource is Email Activity Report. Under this resource we find a list of events that each member list has performed, related to a specific email campaign. So for example we can figure out who and when opened an email that was send in a campaign, if the recipient clicked on a url or how many times the mail was opened.

Each Email Activity Report is organized around the List Members that participated in the email campaign. For each member a list of events is kept called activity. Each event has the following structure:

  • action: the type of the event, e.g. open.
  • Timestamp: when the event was triggered
  • IP: the IP from which our user triggered the event

As we can see we need just a small fraction of the campaign data Mailchimp keeps to support our analysis. What we need to consider at this point though, is that although the email data is available through the Mailchimp API, it cannot be used directly for analytic purposes. The reason is that the Mailchimp API delivers complex hierarchical objects in JSON as responses while the algorithms we would like to use assume tabular data. Of course someone might consider the process of bringing the data into the right structure as part of data preparation, but having the campaign data in a database instead of polling the Mailchimp API every time we need to perform our analysis has a number of benefits.

  • Pulling data from an API is time consuming, especially if we have a considerable amount of data we want to pull from it. So by syncing Mailchimp with a database or data warehouse guarantees that we will always have the data we need available and up to date.
  • The campaign data we’ll get from the database will already be in a tabular form.
  • We can perform queries on the database and fetch only part of the data based on what kind of analysis we want to perform. In this way we’ll be able to perform our analysis faster and with a lower cost.

For this guide we will assume that the data are stored in a relational database already, having the following structure which is the one that Blendo delivers.

  • The table mailchimp_list_members which contains the information of all the members we have together with a list id. The column names that are of interest to us are the following:
    • mergefieldsfname which contains the first name
    • mergefieldslname which contains the last name
    • email_address which contains the email address of the user
  • The table mailchimp_report_email_activity which contains the information of the “list members” activity for all our email campaigns. The columns that are of interest to us are the following:
    • email address
    • action
    • timestamp The database is completely in sync with the Mailchimp platform, so all the email campaign data that Mailchimp has for our account will be replicated in the database that we will use. The above fields are the minimum required ones we need to perform the tasks we defined at the beginning.

Pull email campaign data from a database with Python

We have our campaign data on a PostgreSQL database but any other database will do, and we also know the data we want to pull out. For this tutorial we will be using Python and Pandas but for other choices and more information on how you can pull data from PostgreSQL and Amazon Redshift among others you can check this post.

Based on the Mailchimp data model, the data we need lives in two different tables, we have two choices here the first one is to pull data from each table into one DataFrame and join using pandas or we can perform the join directly on the database and fill just one dataframe. We will go with the second option, mainly because it will much more efficient to perform the join on the database, especially if we plan to pull a large amount of data. Additionally we will ask the database to return back only the columns we are interested in and thus make it much easier to work with the campaign data using Python and pandas.

So, executing a query on a database using SQLAlchemy and Python pandas is a quite straight forward task:

All it takes to pull our data from the database is to create an engine providing the correct credentials for our database, define our query which in this case we also perform a join on our tables and select specific fields to be returned. Finally we are just one command away from having our data loaded into a pandas DataFrame by using the read_sql_query command.

Gender identification

Now to the more interesting stuff, we got our data into a DataFrame and now we need to infer somehow the gender of each user. Our email campaign data from Mailchimp contains the fields, last_name and first_name, we’ll use this information to infer the gender of our recipient. To do that we’ll be using data from the United Census Bureau, which has made available two data sets that contain female and male names together with the following additional information for each one:

  • Frequency of occurrence for the name
  • Cumulative frequency of occurrence for the name
  • Rank

Our algorithm is quite simple, every first name that we have in our Mailchimp dataset will be matched against these two datasets. Then we will decide if it is a male or a female name and assign this as the gender of our recipient. There are some really strong assumptions here:

  • We assume that all our recipients are US citizens.
  • Our Mailchimp email campaign dataset is complete, in the sense that for every recipient at least the first name exists.

We will continue with this assumptions and later we will see how we can work without them.
The first thing that we have to do is to load these two additional datasets as DataFrames.

As the files do not include a header with column names, we also instruct Pandas how to name the columns. Then we need to do some really trivial pre-processing and cleaning on the campaign data. As you might have noticed, there’s a small difference between how we loaded the female and male names. It seems that, although not very popular, there are girls in the US named NA which is a reserved term from Pandas for NaN values. So we had to explicitly overwrite the default NaN values to load all the rows for the female names.

Then we clean a bit our data.

Two things happen here, first we make sure that we do not have any entries where the name is actually empty and then we remove any row that has a NaN value. These two actions are unnecessary because of the assumptions we made at the beginning but it doesn’t do any harm to perform them anyway.
Now we are ready to perform the actual gender identification for each recipient. Our algorithm works as following:

  • Do a similarity check of the name against all the female and male names and sort the results in a descending order.
  • Keep the first entries of the results. In this way we keep the most similar female and male match.
  • Compare the two and if the male similarity is greater, return male and if the female similarity is greater return female.
  • If the two similarities are equal, then check the frequencies for each name and keep the one that is greater.

It is common to have the same name given to both male and female children, but there’s a big difference on their frequency. For example the name ALEX can be found both as a male and female but the first has a frequency of 0.115 while the latter a frequency of 0.002.

To measure the similarity between the names we used the Difflib library of Python and of course you are encouraged to use other methods and metrics for calculating the similarity.

The above function implements the algorithm we just described. With it, generating the gender for each recipient is a matter of one line of code,

Where we apply the function to each recipient and we join the resulting column to our DataFrame.

Back to our assumptions. Although US is a multi-cultural country, it’s almost certain that you will end up with some names that are not included in this list. Especially if you are operating an Internet company. One way to overcome this issue is by introducing the nationality as another variable and find more authority files like the one we found from the US Census Bureau but from other countries.For the second assumption, one possible solution is to process the e-mail address. In most cases people are using the full names inside their email accounts so it is quite possible to find the name from it and fill any missing values.

The above possible solutions are not guaranteed to work. The best way to enhance the quality of your campaign data is by introducing additional data sources that will be joined in the same database, in this case data coming from your CRM. The profiles you will have for your customers will be much more complete on your CRM.

Email type identification

The second task that we want to perform on our campaign data is to figure out if our recipient is using a business e-mail address or not. The approach here is similar to what we did with the genders but much simpler. We will again using an authority set with e-mail addresses that we know that are personal in most cases, like Gmail accounts. Then all we have to do is check if the domain of the e-mail address of our recipient is found in this authority set.

Event time binning

In this final task, we want to identify when our recipient interacted with our e-mails. If it happened during work time, before or after. Of course you are free to refine the definitions of these time intervals or even add more, for example you might want to see if the mail was opened during typical work break hours. But before we implement this task, we need to extract the hour from the timestamp column. After we have a new column named hours which contains the hour of the day (assuming a 24h day) we can substitute it with the period we have defined by applying to it the occuredAt function.

One thing that we might want to consider here is the actual time that the event happened. There are three different “times” involved. The first one is our time which corresponds to the exact time we sent the email, the second is the time that Mailchimp tracks which might differ from our time and then of course is the time of the recipient who interacted with the email. If we want to be precise with our analysis we need to take into consideration the time zone of the recipient and again the best way to handle this is by incorporating data coming from another source and more specifically your CRM where you should keep track of where your customer operates.


That’s it, we now have calculated all the variables that we need in order to perform our analysis on the events we collect from Mailchimp. Of course there’s space for a lot of improvement, the methodologies we followed here are quite trivial but if we have our email campaign data readily available on a database and if we can sync to it more data sources like our CRM in addition to Mailchimp, the possibilities are endless. If you want the code in one place you can check here and of course leave me your comments, questions and suggestions if you have any.