Importing Google Analytics Data into Your Data Warehouse with Pentaho

Many companies that sell applications and games for desktop and mobile users build a data warehouse so they can create dashboards that show information about user patterns in these applications. By including web funnel data from your website into the data warehouse, you can go a step futher and get complete picture of your user acquisition and retention.


Data Source

Our data source is Google Analytics. To access the data, we'll need access to the profiles via a username and password.

For the purposes of this project, I recommend creating a new Google Analytics account for this project and giving it read-only access to the specific profiles you want to retrieve. Since you'll need to enter this account's password in another piece of software, it's generally a better practice never to use your own accounts.

ETL Software

To get the data into your data warehouse, you will need software that can extract the data, transform it into the necessary format, then load it into your data warehouse. Such software is called Extract, Transform, and Load (ETL) software. For this project, I'll be using Pentaho Data Integration.

The ETL software will need to be able to connect to your data warehouse. You will also need the ability to create new tables and insert/update data into these tables.

Retrieving & Storing Data with Pentaho

Pentaho comes with a free connector that you can use to query data from Google Analytics. It uses the same syntax as the Google Analytics Core Reporting API to define filters.

Before you query Google Analytics, you'll need to grab the following information with the ETL:

  • A date range
  • The filters for each filter step you want

On top of that, we will hard-code the user authentication information, the dimension (time), and metric (visitors).

Job: Master

master Pentaho job for Google Analytics ETL

The main thing you should ensure is that the Get Web Funnel Data job is set to execute for every input row on the "Advanced" properties tab. This will cause it to execute for every row that is returned in the previous transformation where we look up the web funnel steps.

Beyond that, this job houses little direct logic. The notification emails are helpful if you schedule your ETL to run daily so you know whether or not it succeeded.

Transformation: Lookup Web Funnel Steps

Pentaho transformation for looking up funnel steps for Google Analytics

To make the ETL more dynamic, it is helpful to store information about each funnel step in a database table. This will allow you to query for a new step in the future simply by adding a new row to this table. The table schema should look something like this:

  • funnel_step_id
    • auto-incrementing sequence (primary key)
    • Calling it "funnel_step_id" instead of "id" will make things easier for software that automatically links related tables together
  • name
    • String representation of the funnel step for reports.
    • Example: Landing Page Visitors
  • ga_filter
    • String representation of the filter that Google Analytics will use to execute the query for this funnel step.
    • Example: ga:eventAction==Start Registration

The "Copy Rows to result" output ensures the data is carried over into the next step of the ETL.

Job: Get Web Funnel Data

Pentaho job for querying Google Analytics

Like the master job, this one has little direct logic. It represents the loop that should execute for each funnel step that is returned from your earlier lookup.

Transformation: Set Variables for GA Query

Pentaho transformation for getting date and funnel information to set as global variables

This transformation is responsible for setting up the variables that you will need to feed into the actual query to Google Analytics.

Get rows from result

For this input, you should specify each of the table columns from your lookup table that you'll eventually set as an environment variable:

Pentaho dialog to get rows from previous result

Get Start and End Dates

Because Google Analytics sometimes doesn't update their data for 24 hours, you want to be sure you're fetching data for the past few days so you can update days that might have incomplete data. The easiest way to do that is to use the start of last week for your beginning date and yesterday for your end date. This will mean you're always retrieving between 7-14 days of data.

Pentaho dialog for getting system data

Scrub Date Fields

Pentaho's date format differs from the format you have to use to query Google Analytics. You can use JavaScript to transform the data into the proper format.

var dateFormat = "yyyy-MM-dd";

sYesterday = date2str(sYesterday, dateFormat);
sStartLastWeek = date2str(sStartLastWeek, dateFormat);

Afterward, make sure you set up the fields to update with the new values in the table below the JavaScript text area:

Pentaho dialog for modifying values with JavaScript

Set Global Variables

Now that you have all the variables you'll need for the query, make sure you set them with a scope of "Valid in the root job" so that other transformations and jobs have access to the them:

Pentaho dialog for setting environment variables

Transformation: Query GA and Store Results

Pentaho transformation for sending a query to Google Analytics and storing the results

At this point, the ETL has all the data it needs. All that is left is to execute the query and store the results.

Get Visitors

Enter your user credentials, select the appropriate Google Analytics profile, then enter all of the static and variable information as part of your query definition:

Pentaho dialog for getting environment variables

Google Analytics will return several fields that you do not need. To remove them, select them in the "Output Fields" section, right-click on them, and choose "Delete selected lines" from the menu.

Get funnel_step_id

At this stage, the date and visitor count is part of your result set. You'll need to add the funnel_step_id to the result set so we can link this result to the correct funnel step in your lookup table:

Pentaho dialog for getting environment variables

Insert / Update

Now that we have the date, funnel step, and visitor count in our result set, we can insert the data into our data warehouse (or update existing rows):

Pentaho dialog for inserting or updating to a database table

In the first table, make sure to specify both the date and the funnel_step_id as the keys to use to look up the values. If Pentaho finds an existing record for that date and funnel_step_id, it will simply update it with the new results rather than add a duplicate entry.

Next Steps

In one of my next articles, I will show how you can take the information you have just imported into your data warehouse and create funnel visualizations using QlikView. This will allow you to connect the dots between all of your funnels and allow you to visualize the end-to-end flow of your users.

comments powered by Disqus

If you liked this article, you may want to: