Almost 6,000 items are Tweeted per second, corresponding to about 500 million Tweets per day. Each Tweet contains indexable data. Aside form the obvious text content and the hashtag classification, you can get the Tweet creation time stamp, the location, the user profile information, and more.

In this tutorial, we will use Kibana's friendly UI to analyze this data using these fields (individually or collectively) and then to visualize the results.


For the sake of this exercise, we will index the Tweets for a particular hashtag, say #Marvel. For those who don't know, Marvel is an awesome Elasticsearch monitoring tool, but the vast majority of Tweets with this hashtag will be about the comic book heroes (ES isn't that popular.....yet).

We will filter Tweets on the basis of a mention of a certain Marvel superhero, say, Wolverine. And then, let's further break down the interest in Wolverine by country by adding a location filter and then viewing the geographical interest.

Before we move over to Kibana, there are some prerequisites that will make our experience with Kiabana much more enriching, so let's get started.

Step 1: Cluster Provisioning

Of course, we are going to be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or check the bright blue "launch your cluster" button in the sidebar.

Once you have logged in using your Qbox account, you will be redirected to the dashboard section. From here, you can create the cluster. For the purposes of this exercise, the smallest single-node clusters will do. (If you need help with this, refer to Michael Lussier's post entitled Provisioning a Qbox Elasticsearch Cluster.)

You will want to refer to the screenshot below. Note that we will use Rackspace services for building our host machine.

edited_qbox.jpg#asset:432

Press the Create button, and a page like the following will appear showing the details of the cluster.

operational_cluster.png#asset:492

In the screenshot above we can see a dropdown field called “Monitor plugin.” We select the option Imenezes/elasticsearch-kopf. (Kopf is a toolkit used for easy interaction with Elasticsearch, and it is handy in visualisations of our queries and responses.)

Step 2: Install and Enable Twitter River

We enable Twitter River from the plugins panel because it was designed exclusively for importing Twitter data to Elasticsearch instances.

  1. It should be noted that there are different versions of Elasticsearch corresponding to Twitter River versions. Because we are using Elasticsearch 1.3.4, we select the plugin version 2.3.0. (Check the above-linked Github repository for further information.)
  2. We then run the following command in the terminal "bin/plugin -install elasticsearch/elasticsearch-river-twitter/2.3.0." Note that the last number is the version number of the plugin matching to that of the Elasticsearch version.

You need to be authorized to take data from Twitter via its API. This part is easy:

  1. Login to your Twitter account
  2. Go to https://dev.twitter.com/apps/
  3. Create a new Twitter application (here I give SearchTwitterTest as the name of the app).

After you succesfully create the Twitter application, you get the following parameters in "Keys and Access Tokens":

  1. Consumer Key (API Key)
  2. Consumer Secret (API Secret)
  3. Access Token
  4. Access Token Secret

Now we are ready to create the Twitter data path (river) from Twitter servers to your machine. Use the above four parameters (consumer key, consumer secret, Access token, Accsess token secret) in the following command and run it in the terminal.


curl -XPUT 'http://9dae02428d83f160000.qbox.io:80/_river/my_twitter_river/_meta' -d '{
  "type" : "twitter", 
    "twitter" : {
        "oauth" : {
            "consumer_key" : "*** YOUR Consumer key HERE ***",
            "consumer_secret" : "*** YOUR Consumer secret HERE ***",
            "access_token" : "*** YOUR Access Token HERE ***",
            "access_token_secret" : "*** YOUR Access Token Secret HERE ***"
        },
        "filter" : {
            "tracks" : "marvel,comics",
            "language" : "en"
        }
    },
    "index" : {
        "index" : "comics",
        "type" : "comics",
        "bulk_size" : 100,
        "flush_interval" : "5s"
    }
}'

A few notes of explanation -

  1. The http link in the curl command is the one you get when you have successfuly created the cluster in step 1.
  2. "oauth" field -- This is the authentication part. Here we give our keys and tokens as mentioned above and get the authentication done.
  3. "filter" field -- Here in the above CURL code you can see a filter track with sub-fields tracks and language. Tracks are the keywords that we need to look at the Tweets. Here we have given two such tracks,"marvel" and "comics." Now Tweets that contain those fields are directed toward our server, ready to be indexed. We also set a language filter "en," which means only Tweets on the above topics that are written in the English language will be rivered to our index. There are many other supported filter fields, too, such as "user_lists," "follow," etc.
  4. "index" field -- Here we mention the name of our Index page in Elasticsearch. Here I have named the index and type as "comics." It is under this index that all the analytics and operations are performed by the Elasticsearch sever.

At this point, if you need to check whether the plugin is working or not, you can use the tail command.

tail -f /var/log/elasticsearch/elasticsearch.log

Step 3: Examine Tweets

After step 3, your Kopf is ready, and you can start recieving and analyzing the Tweets via your Qbox Elasticsearch node. You can go to the following link to access the Kopf UI "http://localhost:9200/_plugin/kopf/#!/rest"

Let's start filtering the Tweets. My query is designed to filter all the Tweets that contain location information. We paste the following query into the REQUEST box of the kopf.

{
    "query": {
        "query_string": {
            "query": "_exists_:location"
        }
    }
}

Numerous responses are recieved. To examine the structure, let's take only one of them:

{
  "text": "Some weird kids want to do a graphic novel about my time with the company.\nI maced them in the face.\n#sdcc2015 #comics #AvengersAgeOfUltron",
  "created_at": "2015-01-11T07:54:47.000Z",
  "source": "<a href="%5C">Twitter for iPhone</a>",
  "truncated": false,
  "language": "en",
  "mention": [],
  "retweet_count": 0,
  "hashtag": [
    {
      "text": "sdcc2015",
      "start": 101,
      "end": 110
    },
    {
      "text": "comics",
      "start": 111,
      "end": 118
    },
    {
      "text": "AvengersAgeOfUltron",
      "start": 119,
      "end": 139
    }
  ],
  "location": {
    "lat": 33.686657,
    "lon": -117.674558
  },
  "place": {
    "id": "74a60733a8b5f7f9",
    "name": "Foothill Ranch",
    "type": "city",
    "full_name": "Foothill Ranch, CA",
    "street_address": null,
    "country": "United States",
    "country_code": "US",
    "url": "https://api.twitter.com/1.1/geo/id/74a60733a8b5f7f9.json"
  },
  "link": [],
  "user": {
    "id": 2873953509,
    "name": "Dr. Midnite",
    "screen_name": "darknetsurfer",
    "location": "Camp Freedom",
    "description": "Never, ever, worked at Reagan White House back in 86'.  Also, did not know LtCol North."
  }
}

In the following , we examine some of the key data fields:

  1. text -- The actual Tweet of the user.
  2. created_at -- The date and time that the Tweet was Tweeted.
  3. hashtags -- As you can see, there are three arrays in the hashtag fields. This means there were three (#sdcc2015, #comics, and #AvengersAgeOfUltron) hashtags used by the user to tag this Tweet.
  4. location -- This field gives the geographical information in latitudes and longitudes of the origin of the Tweet.
  5. place -- While the location field gives the location as latitudes and longitudes, the "place" filed gives the location information somewhat similar to postal adresses (i.e., street name, city name, country name, etc.).
  6. user -- This field gives the information about the user, including details such as name and screen_name in Twitter, hometown, and Twitter profile picture.


Conclusion

Now that we have learned how to stream the data of our interests from Twitter to Elasticsearch, we can learn how to the visualize this aggregated data using Kibana. This will be the subject of Part 2 of this series.

comments powered by Disqus