TwitterLinguistics : An Open Research Project

Data Scientists are Will-Doers

Not Can-Doers

My presentation at PyCon on “The Linguistics of Twitter” produced exactly the response I was looking for, smart motivated people made contact with me and asked how they could help.

Quick Recap

The slides from the presentation are available on Slideshare and the video will be available shortly.

In a nutshell, processing data from Twitter presents unique challenges due to:

  • 140 Character limit
  • API lookup range limiting
  • Dialectical English usage

Linguistic Challenges

Twitter users shorten words to fit into the character limit, and word selection has been proven to be regionally influenced.

This presents an opportunity to explore the use of regional American English dialects on Twitter in an effort to build out systems which could facilitate communication.

TwitterLinguistics Project

An oversimplified outline of the project is:

  • Collect geo-tagged tweets from Twitter
    • For the different dialectical regions in America
    • Lots of tweets, think in the millions at least for each region
    • Post the data in a publicly accessible location
  • Process the regions with the Natural Language Toolkit
    • While this is starting out as a Python project, other programming languages are welcome to help
    • Build out regional corpora
  • English only for now
    • Including other languages would comlplicate an already very complicated challenge
  • Sit back and admire all the hard work we did ;)

More Information


We definitely need help, and lots of it.

Casual programmers, people who have an interest in exploring Natural Language Processing or are just an awesome individual there is probably a way to contribute.

If you are just learning NLP in Python I strongly reccommend bot the NLTK book from O’Reilly and ‘Python Text Processing with NLTK 2.0 Cookbook.’

Without programming knowledge, a good place to start would be both of those books listed above if you want to be actively involved in the project.

There are also opportunities for publicity, data sharing and data validation which require less time on your end.

Email me at to get on the mailing list while its hot.

Technorati Tags: Natural Language Processing, NLTK, Python, Social Media, Twitter

Posted in Analytics, Linguistics, Natural Language Processing, Social Media, Twitter | Tagged , , , , | 3 Comments

Web Analytics Association Spring Awards Gala Dress-Code Decoded Infographic #Measure

Not Sure What to Wear?

Dress-Code Decoder

Get ready for the inaugural Web Analytics Association Spring Awards Gala with this handy infographic breaking down who will wear what:

Some Tickets Still Available

Make sure to get your tickets at http://bit.lw/waa-gala because it looks like the event is going to sell out before eMetrics!

Technorati Tags: WAA, WAA Spring Awards Gala, WAA Spring Gala

Posted in Web Analytics Association | Tagged , , | Leave a comment

RStudio: Easy to Use Interface For R and Google Analytics Data #measure Part One: Setup

RStudio: FOSS Interface to R

The limitations of widely used existing business solutions for statistical programming have led many organizations previously opposed to FOSS solutions deploy R.

With the release of RStudio the bar to gain entry into the world of R has been lowered so dramatically there is functionally no excuse aside laziness to avoid R.

In part One of this series we will set up RStudio, in Part Two we will pull data from the Google Analytics API and chart up some cool charts. All that in less than 30 minutes.

Your choice. Red pill or blue.

Install R

There are numerous exemplary instructions on how to install R for your computer, the high notes are:


I would start with the GUI for your software manager, Synaptic on Debian/Ubuntu. Look under the “GNU R statistical programming sections.”

Select, download, install, get going with your life.


Windows installers exist; a detailed FAQ for R on Windows is available on the CRAN website.

There is even a slick plug-in for Excel called RExcel freely avaialble on the RExcel website.


The FAQ for R on Mac is ‘rather incomplete,‘ their words, but according to the documentation it can be installed on OS X 10.2+.


There is a company with the same name, so make sure you navigate to the correct RStudio site:

RStudio: Server or Desktop?

After selecting that you want to download RStudio, you are presented with the option to download either the Desktop or Server version. Select the appropriate version for your use, I’m guessing Desktop.

R on a server? What, you never heard of rApache?

Select RStudio Version

RStudio politely suggests the appropriate version of the software for you, since I use Linux I get to see Tux next to my version.

Install RStudio

After downloading RStudio, install it as appropriate for your OS. I used GDebi and it went so smooth I forgot to even take a screen capture!

Starting RStudio

When you start RStudio the utility of the program is pretty apparent. I love, no I LOVE the integrated packages panel and documentation in the lower right.

R is fantastic, the one pain is documentation can be a bit . . . scattered depending on who is the maintainer of the package.

I know, they can all be found on Crantastic. RStudio saves me that time I had to click over to my browser bookmarks, find the package in Crantastic and find the documentation.

Serenity now.

Load The Iris Data Set Via Web URL

Another slight time saver is the “Import Dataset” option in the panel above the documentation panel.

Select the option, and then “From Web URL…” and enter the URL of your data. In our test case we are using the Iris data set at the UCI Machine Learning Repository:

After you paste the URL into the dialog box, RStudio returns something like this:

I’m not sure that could get any easier.

Working in R

Once you approve the data by selecting “Import” you are returned to RStudio with the left hand side showing the data set above the interactive terminal.

There are numerous tutorials for R, having read too many of them I recommend “Using R for Data Analysis and Graphics: Introduction, Code and Commentary” by Dr. Maindonald of the Centre for Mathematics and Its Applications, Australian National University.

Well written and very approachable, start out with the Iris data set and move onto the more complex data sets available at UCI.


In less than the time you spent setting up RStudio we will pull data and generate some really slick charts from the Google Analytics API in the next post.

If you have any questions, need help or are just curious about deploying R in your enviroment feel free to send me an email

Technorati Tags: FOSS, Google Analytics, R, RStudio

Posted in Analytics, Google Analytics, R, Tools | Tagged , , , | 2 Comments

Upcoming SF Bay Area #Measure Networking Events

March #Measure Networking Events

March is shaping  up to be an extremely busy month for #measure networkers in the San Francisco Bay Area. Make sure to come to one, or all, of these events:

Web Analytics Wednesday

March 2, 2011


Webtrends and ObservePoint


6:00 pm – 8:00 pm


CityScape at the Hilton San Francisco Union Square

333 OFarrell Street, San Francisco CA


Web Analytics Wednesday on a Tuesday

March 15, 2011


SAS Institute Inc.


6:00 pm – 7:30 pm


Marriott Marquis

55 Fourth Street, San Francisco CA


Web Analytics Association Spring Gala Awards Dinner

March 15, 2011Web Analytics Association


Bing, ForeSee Results, UBC UCI, ObservePoint, Tealeaf


7:30 pm – ??:?? pm


Marriott Marquis

55 Fourth Street, San Francisco CA

Tickets Available:

All three are worthy events and you should make it to every one. If you aren’t sure think about all the people going to the Omniture Summit as well and then making it to all these events.

Me? I’m working in a trip to PyCon in between the events !

Technorati Tags: eMetrics, WAA, WAA Spring Gala, WAW

Posted in Networking, Web Analytics Association, Web Analytics Wednesday | Tagged , , , | Leave a comment

#Measure in 2010: Tracking Links Shared With Bitly and Others

API Enabling Open Data Discovery

The Application Programming Interface (API) provided by many URL shortening services could provide insights into what the #measure community is passing around and, more importantly, what they find interesting.

To prepare for this I had to see which API keys I might need, being very familiar with the Bitly API I was pleasantly surprised to learn that Bitly still dominates the #measure shortlink market.


  • Pulled the #measure archive from Twapper Keeper
  • Searched for tweets with links
  • Tested the links against the Bitly API to see if they were a Bitly Pro link

Bitly API Domination

Seemingly everyone is still using Bitly, and why not? Its a great service, and they offer the opportunity to brand your domain through Bitly pro.

Yes, the Bitly links above include Bitly Pro links and the Bitly service.

Bitly Pro

If you acquire a short domain, you can use the Bitly backend to shorten links branded through the domain which customers may identify with you.

The New York times uses the host:

To brand their links. As an added bonus, if someone shortens a link for a pro domain in Bitly, Bitly returns that shortened pro domain.

Bonus branding!

Using Bitly Pro?

With the Bitly backend so easy to use, it should not be a surprise that numerous companies are doing this.

In 2010, links that were shared via Twitter tagged with the measure hashtag included these shortening services which use Bitly Pro:


The API interface provides a better history of metrics than Bitly, so I was a little surprised that people had not switched over to use it more frequently.

Twitter is forcing pushing their shortener which should provide some market position, however given the minuscule coverage over 2010 I am curious when that might happen.

Until those happen, Bitly still has a very strong presence.

Link Content

  • What are the links about?
  • How many times did people click on them?

Tomorrow I will post some results of the links shared and metrics available.

Leveraging APIs

Need help with, or just curious about the wide possibilities from APIs? Send me an email, chances are I already have an API key for the service in question.

There are so many APIs it is hard to pick favorites, but I do particularly recommend the SimpleGeo API, Rapleaf API, and, of course, Google Analytics and Omniture APIs.

Technorati Tags: API, Application Programming Interface,, Bitly,, Measure, Social Media,

Posted in Analytics, Application Programming Interfance, Bitly, Measure, Social Media, Twitter | Tagged , , , , , , , | 7 Comments