Keep a Local Copy of Google Analytics With More Precise Timestamp #measure

HOWTO: GA on Your Server

How to Save and Process Google Analytics Data Locally

When Jason Thompson asked his question about more precise visitor times as quoted by Avinash I thought about setting a custom variable with a JavaScript function.

There is another alternative with an additional benefit: store a copy of your Google Analytics locally on your server.

If you were to do this not only do you have a backup copy of your Google Analytics data, you can go back and reprocess the data if you have the technical capabilities.

Official Google Analytics Documentation

According to the official Google Analytics Documentation, Google Analytics does not track more precisely than to the hour.

That is, of course, not the whole story.

Google Analytics Hack

I highly recommend ‘Advanced Web Metrics with Google Analytics‘ by Brian Clifton, he provides solid instructions and details for many implementations of Google Analytics. I’m going to use either his examples, or modified versions of his examples.

Brian’s book provides an example for a traditional snippet, the Google Code section of Google Analytics provides pretty clear instructions:

_gaq.push(['_setAccount', 'UA-XXXXX-X']);

So all on would need to do is add a line of code with a call to _setLocalRemoteServerMode to the Google Analytics Tracking code.

That’s a snap!

Server Log Files

The data should be recorded on your server, whether Apache, Microsoft or otherwise in the server log file.

Brian provides the example in Advanced Web Metrics for an Apache server of: - [03/Jan/2010:00:17:01 +0000]
 "GET /images/book-cover.jpg HTTP/1.1" 200 27095
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:
Gecko/2009101601 Firefox/3.0.15 (.NET CLR 3.530729"

Log files are often trimmed by hosting companies after a week, or if they are over a certain size, so make sure to set up a Cron job to either move the data to another part of the server or process it outright.

Bash scripts to accomplish this task abound on the internet.

Data Processing

Each record is a single line in the server log, and as you already know they can get rather large. Importing the data into Excel is not an option.

Individual needs combined the numerous options available make it nearly impossible to cover the universe of possibilities, consider an example where you want the date, time and page visited.

Depending on the flavor of RegEx your programming language uses your use case may vary, but this should grab the data from the server log file:


Charting The Visits

I will post some code on how to export the data into an Excel compatible file later in the week, check back for that and some other illustrations with server log data.

Need help with your Google Analytics?

I am always looking to connect with hard working, smart people who have interesting projects.

Send me an email or give me a holler on Twitter.

Technorati Tags: Analytics, Google Analytics, Measure

Posted in Google Analytics | Tagged , , | 6 Comments

#Measure in 2010 – Figuring Out Measure Network Analysis

How to Understand This Chart

Circles for Individuals

Each person is in the #measure in 2010 chart is represented by an oval, or circular, shape.

The size of the shape is determined by:

  • Number of Tweets tagged with #measure
  • Modified by variety of content
  • Reduced for re-tweets

See the a previous blog post for a more detailed background of the algorithm employed.

After each person is represented for contents, each arrow in or out of a person is a message, to or from that person.

Peter O’Neill

Take, for example, a zoom in of Peter O’Neill. Peter is an excellent analyst based in London, if you are looking for someone in that area consider using his services.

The shape is not very large, but the messages which Peter sent to “daverooney,” “rockergirrl” and “benjamingaines” are clearly labeled with arrows.

Lower Corner

In the lower right corner there is a collection of people who tweeted infrequently, but sent on message tagged with #measure to another person.

For whatever reason they aren’t engaged with the conversation going on in the center of the #measure network. I recognize several of the Twitter ID’s and am very curious why this is the case.

Jason Thompson

Jason Thompson has an interesting array of messages coming and going from his profile which are tagged with #measure.

Lots of interaction with the people who message him with the #measure hashtag.


That’s about all there is to it . . . sorry if you wish it were more complicated.

As far as image size I could have excluded more people from the sample for 2010, as it is I excluded anyone who only used #measure once.

WAA Spring Gala

I hope to start coloring in the attendees of the Spring Gala, and perhaps eMetrics SF overall, later this week.

Thomas Bosilevac, winner of the free eMetrics pass, has let me know he is going to attend. If he can make it, shouldn’t you?

Make sure to register for the WAA Spring Gala on the WAA site.

Technorati Tags: Graph Theory, Jason Thompson, Measure, Twitter

Posted in Analytics, Linguistics, Natural Language Processing, Social Media, Twitter | Tagged , , , | Leave a comment

Explaining #Measure in 2010 – Limitations and Improvements of Twitter Content Valuation

We’re Only as Accurate as Our Least Accurate Measurement

Wax On, Wax Off

The Twitterati chart posted previously turned into something I didn’t quite expect, so I would like to take a moment to explain the whole situation to the sensei of the #measure world.

Web Analytics Association Spring Gala

My understanding is that all the prominent people involved are attending the WAA Spring Gala on March the 15th. Interested in catching up with us in person?

Grab your ticket on the Web Analytics Association site.


  • The data are limited to tweets tagged with the “measure”
    • Tagged it with “measure” immediately followed by punctuation?
    • May not be there
    • Tagged it with “socialmedia”?
    • Not there
  • The data are further limited to those collected by Twapper Keeper

Working with Twitter I know 100% data coverage can be . . . challenging

Data Definition

Tweets appropriately tagged with “measure” and successfully captured by Twapper Keeper during the date range of January 6, 2010 to December 31, 2010 comprise the data set.

Data Cleansing

My initial analysis indicated that Twapper Keeper captured both “#measure” and “measure, the hashtag and keyword, so I removed those marked without the hashtag.


Twitter measurement tools give some value to being re-tweeted, which is certainly a measurement of something. That something being popularity, utility or whatever else is, I believe, undecided at this point.

Take, for example, a couple of recent tweets by prominent members of the #measure community:

BeyWebAnalytics Episode 40 – Shaking the baby with Evan Lapointe | Beyond Web Analytics! #measure Amazed at how @analysisxchange is taking off under Wendy’s guidance. If only there was an award I could nominate her for!
Cool! New episode with Rudi, Gary, Adam and Evan Lapointe as a guest, this tweet was re-tweeted by two people. Re-tweeted a grand total of one time, by me, this complex to machines tweet adds Eric’s opinion on who you should vote for an award . . . without explicitly mentioning the award!

Re-Tweeting to Success

I’m not sure how re-tweets are taken into account, if they are scored on a linear scale how is the value of the @BeyWebAnalytics tweet compared to the value of the tweet by @EricTPeterson? Double the value because two people re-tweeted it?

I like the @BeyWebAna podcast, however in terms of valuing what I want to see in real time @EricTPeterson’s tweet about the Analysis Exchange is much higher than the @BeyWebAnalytics podcast.

Finding a way to value tweets that are closer to real interactions, over tweets that are primarily marketing in nature, is what I am ultimately curious about.


Another interest is plagiarism, which Stéphane Hamel recently blogged about. Part of my comment on his post was about my perception that plagiarism on Twitter is increasing, the use of the ‘via’ indicator had decreased in my empirical observations.

I have seen content, whether links to articles or tweet content directly, shared without attribution. This doesn’t happen often, but even infrequently this isn’t ok.

Valuation of Content

Figuring out how we, as a community, can value content of Twitter within the extreme limitations of 140 characters is a positive move forward. Whether Twitter influence stems directly from good Twitter content was not proven in my analysis, so as a good scientist I could have probably made a different claim.

The spiral of over-promotion has spread to what should be our addition to social media, you know, just one of the most important technological developments in the history of mankind.

Variety of Content

I used entropy as a measure of the variety of topics, commonly used in information theory this can be an effective tool of disorder in a container.

I made no qualifications, predictions or guarantees regarding the semantic profile of the content. Just made the claim that people talk about different topics, some people talk about different stuff more frequently.

Consistency in Self-Tagging

The big winner was @Ulyssez, and people were taken aback at how could he be the winner.

Simple: my guess is that he self tagged with the “#measure” hashtag more consistenly than anyone else in the community.

Development of Networks

I actually thought the more inflammatory portion of the chart, click through the image above for the high resolution copy, was the networks that developed.

The arrows leading in or out of a person are messages to or from that person.

  • The groups on the side of the chart, why are they on the side?
  • Do they realize that they are outside the core conversation?
  • The groups towards the center of the chart, what are they talking about?
  • Is there anything one of the outlier groups are talking about that the inside groups should listen to?

Valuing the Community

Eric T. Peterson and Jeff Katz have done an excellent job moving us in that direction with Twitalyzer; my hope is that companies such as Twitalyzer continue to evolve and succeed.

It is incumbent upon us, the community, to encourage the use of tools which are based on data and not on the awesomeness of a GUI.

The sooner we draw a demarcation between the companies we know provide quality product, and those that do not, the sooner we can move forward measuring all those things we’d like to.

Technorati Tags: Eric T. Peterson, Jeff Katz, Measure, Social Media, Tim Wilson, Twitalyzer, Twitter, Valuing Content

Posted in Analytics, Linguistics, Measure, Natural Language Processing, Social Media, Tools, Twitter | Tagged , , , , , , , | Leave a comment

#Measuring in 2010 – Analyzing the Twitter Data of #Measure Twitterati

#Measure in 2010

Who influences the influencers?

#Measure Twitter Graph 2010

Click through the image for a high res picture . . . it is pretty big so it may take a second or two.

#Measure Twitterati

In preparation for publicizing the Web Analytics Association Spring Awards Gala I wanted to know who were the most influential members of the #measure Twitterverse. Fortunately, @minethatdata had great foresight to store the tweets tagged with #measure on Twapper Keeper.

@Minethatdata has been posting about the dynamics of the group for a while so I took a different tactic. That, and somehow improving upon the declartion that @MicheleHinojosa is the oxygen of the community seemed impossible.

Data Distribution

Tweets with the #measure hastag are distributed as shown in this graph, with just a selection of users producing most of the content:

Content is King

I used a scoring algorithm to value an increased variety of content, specifically Entropy as defined by:

This is applied to the content of all tweets from a user in 2010 tagged with #measure; entropy is a calculated metric which Google uses in part to value content on web pages.

Final Algorithm

The final algorithm turned out like this:

Value = Count of Tweets + Entropy – 0.5 * Count of Retweets

Subtracting a bit for retweeting to value original content slightly higher.


Interesting to me was the value of @ulyssez, who score very high with this algorithm. Get him to re-tweet your stuff to gain followers!

Also interesting are the groups outside the main group of information exchange, sometimes consisting primarily of vendors. If they are tweeting to sell to the #measure crowd, 2010 was anything but a success in that effort.


I did not take into account clicks on links, or tweets being retweeted which would improve the utility for sure.

Email Michael

Questions? Comments?

Interested in working with me?

Email me at

Technorati Tags: Graph Theory, Measure, Social Graph

Posted in Analytics, Python, Social Media, Twitter | Tagged , , | 13 Comments

eMetrics Pass Winner – Thomas Bosilevac of Mashable Metrics

Free eMetrics Pass

Wrapping up the Contest

In response to my profile is Emir Kirrane‘s silly series of web analysts, Jim Sterne generously donated a free pass to eMetrics San Francisco for a worthy web analyst. The catch was that I would need to run the contest, judge it and so forth.

eMetrics San Francisco is the singular event in the #measure community which we should all attend, at the very least to meet in person those people we are exchanging tweets with.

A selection of analysts turned out excellent efforts, their performance gives me great hope for the future of our industry.

Winning Entry

I graded the entries on a scale, and it came down to a final two analysts who have different experience levels.
Thomas Bosilevac the founder of Mashable Metrics turned out a particularly noteworthy effort and analysis. What I found really interesting is that he took the admittedly unexciting sample data and added a layer of complexity, sales revenue and acquisition costs.

Utilizing the freely available Tableau Public he built an interactive chart which you can view here:

To everyone who entered, I wish I could have given out free passes to everyone as you all put forth excellent efforts.


There were a few people who expressed a great deal of interest in the contest but couldn’t make the deadline, if you are in that group email me at and I will see if I can get you a discount code.

Spot The Error

Solving the Spot The Error option, the description of this graph:

Said that the size was the product of pageviews and time on site, more accurately it is the scaled product. The raw product returns something like this:

A further improvement could be made on the x axis with regards to the dates, as there wasn’t 1400 dates in the range. This is a step in the right direction, although the actual dates would be even better:


This entire endeavor would not have been possibel without Emer posting that profile, so I am pretty sure I owe her a drink at eMetrics.

I would like to thank Jim Sterne again for sponsoring the contest, in fact we have already discussed the idea of another contest.

If you have feedback on how to make a free eMetrics pass contest even more exciting please email me at

Technorati Tags: Emer Kirrane, eMetrics, Free Ticket, Jim Sterne, Thomas Bosilevac

Posted in Conferences, eMetrics | Tagged , , , , | 2 Comments