Graphing Get Ready
To help people get ready to enter for the Free eMetrics Pass, you did know about the contest right?
- One chart/graph whatever illustrating a cool web analytics challenge/opportunity
- 500 words or LESS describing that chart
- 500 words or LESS on why you should/want to go to eMetrics
- Emailed to firstname.lastname@example.org with the subject “Submittal” by February
Still not sure you want to go to eMetrics? My schedule for the conference is so busy I have calendared time to sleep. Seriously. It is going to be that awesome. See my previous post for some of the highlights, or just check out the Sentiment Analysis panel which I hear is going to be hella cool.
Before jumping into graphing, charting and creating flashy images to impress all comers, take a minute to understand the data.
OK, this might take more than just one minute but you will reap the rewards numerous times over.
Technical Note: The data are modeled data. So the visits have decimal remainders beyond the integers, since this is for a free ticket I didn’t want to do everything for you. That even
might make a nice addition to an entry, figuring out how to cut all those digits off.
Excel Analysis ToolPak
Start up Excel but don’t open the spreadsheet just yet, Daniel-san. For these exercises we are going to use the Excel Analysis ToolPak Add-In for Microsoft Excel. On the ribbon, or top menu, select the Data tab. Alllll the way to the right should be the option to use Data Analysis.
If you don’t have it, you should be able to install it in a snap. Select Tools, then Manage Add-Ins,
Depending on your version of Excel the option to manage Add-Ins is located either:
Click Tools on the menu bar. When you load the Analysis ToolPak, the Data Analysis command is added to the Tools menu.
Options Add-Ins Manage:Excel Add-Ins Select Analysis ToolPak OK
Microsoft provides instructions here. I would add that whatever sort of brownie points people get for answering questions in the MS Answers forums work very well. If you get stuck, post your question here and wait for someone to politely help you out.
Open your spreadsheet, open it up and you will see that there are lots and lots of rows and columns. The data are parted by day, from November 1, 2010 to December 31, 2010, and then every hour inside each day.
There are several segments included to make it more interesting, every segment is taken from Eric Peterson’s “Big Book of KPI’s” which is freely available for download so I will not cover them here.
Bounces, Not Rates
Amongst the first things that often need to be fixed for data analysis is the Bounce Rate. Why?
We need bounces, deriving bounces from the Bounce Rate and Visits is simple arithmetic. Add a column to the right of the Bounce Rate column for each segment, save a little time and select the formatting the same as the column to the right, an integer, and not the left, a percentage.
For each row, multiply the Visits of the segment by the Bounce Rate for the segment. Ta-da! Bounces.
Select the data, excluding Date and Hour for the time being, then select Data Analysis -> Summary Statistics. Double check that the range looks right, check that you have labels in your first row and that you want Summary Statistics.
At this point Excel will process the data, it may take a few seconds and your instance of Excel may even crash. If it crashes, consider separating the two monthly sets of data into separate sheets.
What you get back on a new sheet are a range of columns that should have confusing formatting with lots of numbers. We can fix that, in fact this can be scripted with VBA but that is outside the scope of the current project.
For each metric there are two columns, one describing each calculation and one with the values. Yes, it would have been more convenient to have only one description column for every metric.
That’s how Gnumeric does it, a Windows installer is available. Integrated Python and R consoles, every relevant Excel function and free?
Off We Go
Quickly Reviewing The Calculated Measurements
- Mean: the arithmetic average
- Standard Error: forget it for now
- Median: half the values are above it, half are below
- Mode: Most numerous observation
- Standard Deviation: A measure of the dispersion of the observed data from the mean
- The square root of the variance
- Higher SD means the data vary more during the observational period
- Sample Variance: A measure of how far numbers lie from the mean
- Kurtosis: We’ll call it the ‘pointiness’ of the data
- Skewness: We’ll call it the ‘leaningness’ of the data
- Range: The range of observations
- Minimum: The smallest observation
- Maximum: The largest observation
- Sum: Add up the observations
- Count: Hope you can get this one on your own!
Now you are empowered to delve into the data. Things to look are are comparing the calculated measurements across segments, such as the New Visitors. The kurtosis (pointiness) is 2.14 compared to 1.15 for Visits overall. How about that?
There is at least one more challenge to fix for the data set, the time parting is slightly interfering with data analysis. We’ll correct that flaw . . . tomorrow.
I have uploaded an updated Excel spreadsheet which would bring you up to date with the current post. It is available for download here.