This article explains how to use Logstash to import CSV data into Elasticsearch. We make use of the file input, CSV filter, and Elasticsearch output components of Logstash. Importing CSV into Elasticsearch using Logstash is a pretty simple and straightforward task, but several aspects of this process can make importing a CSV into Elasticsearch complicated quickly. I'm going to teach you some concepts that are important in this context. Some of these concepts will be useful for working with Logstash and Elasticsearch in general.

Example Docs

Before we go more in-depth on the topic, let's look at Logstash's Sincedb. For our example we are going to import CSV files containing information related to Bitcoin into Elasticsearch. We start by getting hold of our test data:

$ wget http://www.quandl.com/api/v1/datasets/BCHARTS/MTGOXUSD.csv

The data has the following structure. The headings for columns and the actual columns always stay in this position:

Date,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
2014-02-25,173.2,173.84343,101.62872,135.0,29886.7532397,3667985.39624,122.729470372
2014-02-24,314.99996,316.78999,131.72093,173.871,94594.0225893,17590531.0124,185.958166604
2014-02-23,260.70495,348.98,220.1,309.99971,38395.103758,11051773.3535,287.843299581
2014-02-22,111.0,290.52557,96.6345,255.53,71861.2880229,11632970.1851,161.88090285
2014-02-21,111.61995,160.0,91.5,111.4,82102.9295521,9798282.70207,119.341450488
...

Elasticsearch was originally created to store time-based data, thus it helps if there is a timestamp or date in your CSV. We usually say that Elasticsearch was made to store and search log files and you can think of log files as just files with data and a timestamp associated with the data. This will allow us to make visualizations of different aspects of the data over time. This is what our config file looks like:

input {
  file {
    path => "/home/timo/bitcoin-data/*.csv"
    start_position => "beginning"
   sincedb_path => "/dev/null"
  }
}
filter {
  csv {
      separator => ","
#Date,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
     columns => ["Date","Open","High","Low","Close","Volume (BTC)", "Volume (Currency)" ,"Weighted Price"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "bitcoin-prices"
  }
stdout {}
}

Sincedb

You should now have a good idea of the structure of a Logstash configuration file. We are going to focus on some parts of the configuration that are not obvious. The first part is the part in the file input section that refers to "sincedb". Logstash has an interesting component or feature called sincedb. Logstash keeps track of where it was last reading a file before it crashed or stopped. This is useful for reading files with Logstash when realizing you need to change something in your configuration file. 

Learn About Our Open Source Container Orchestration System Supergiant

Now you can input data into Logstash using the file input plugin, and then change your configuration file to read from those files again as if you have never read from them before. This means that Logstash runs your configuration on all files that you would have missed if you didn't tell sincedb to read from /dev/null.  How does sincedb work?

Sincedb is a file that is created by Logstash by default, although you can configure Logstash to create it at the location of your choice. Sincedb follows a specific format. To make use of sincedb, you need to tell it where to write itself to. This comes as part as the file input plugin. You can tell it to write to a location by specifying a value here:

input {
  file {
   path => "/somedirectory/file_to_read_from.txt"
...
    sincedb_path => "/place_to_write_to"
  }
}

If we tell the Logstash file input plugin to write the sincedb file to /dev/null, it is written to a place that it cannot write to, so no sincedb file is created. Here is the relevant part from our configuration file for this example.

input {
  file {
    path => "/home/timo/bitcoin-data/*.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

We need to be running Elasticsearch as well as Kibana.

$ sudo service elasticsearch start
$ cd kibana-*
$ cd bin
$ ./kibana&

Let's run Logstash on our config. Contrary to popular belief you don't need to run Logstash as root. Make sure you are in the right directory. We just run cd with no arguments and then cd back to our directory where we need to be.

$ cd
$ cd bitcoin-data/
$ /opt/logstash/bin/logstash -f ~/bitcoin-data/btc.conf

Now open up Kibana. I tend to run Kibana on localhost only,as mentioned in a previous article you can use ssh local forwarding to run things on your desktop browser on localhost , which is an actual fact only running on localhost on a remote server.

$ ssh -f -N -L 5601:127.0.0.1:5601 timo@172.20.0.152

Now open up http://localhost:5601/app/kibana in your browser. Go to settings index -> Add new. Now choose your index name and click "Date", which was one of the fields of the CSV. This allows us to map the data based on the timestamps provided in the CSV file. 

Tutorial: How To Install Supergiant Container Orchestration Engine On AWS EC2

My workspace consists of running something in the terminal every now and then and then switching between Kibana and Sense. Sense is useful for basic tasks such as listing indexes and creating new indexes, etc. You will now be able to make graphs from this data.  

New Data Set

Let’s find a dataset that has interesting information in the fields and then plot the data in Kibana. For a more interesting example let’s make use of the export history per address function from the website blockchain.info. Navigate to this URL.

logstash1.png#asset:1054

and click on filter-> Export History

logstash2.png#asset:1055

Look at the structure of the CSV file and the types of data that it contains. This will help you get an idea of what we can visualize in Kibana and also how to do it. I couldn’t figure out how to download this file without a browser on a headless server. Instead, I downloaded the file to my computer and used scp to copy the CSV file to my remote server, like so:

$ scp "Downloads/history-13-05-2016-13-06-2016 (2).csv" timo@172.20.0.203:/home/timo/bitcoin-data/

This copies the CSV file from my desktop computer’s Downloads directory into the ~/bitcoin-data on my remote server.  I’m going to move this file to it’s own directory so we don’t confuse it with the other files parsed earlier.

$ mkdir wallet-data
$ mv history-13-05-2016-13-06-2016\ \(2\).csv wallet-data/wallet.csv

Now we need to make a config to ingest the data via Logstash:

input {
  file {
    path => "/home/timo/bitcoin-data/wallet-data/*.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
filter {
  csv {
     separator => ","
#date,description,moneyIn,moneyOut,tx
# 2016-06-13 21:04:19,PAYMENT SENT,0,0.0045,9919eb32a8f4d554a6cac25029883d370a7e0c536fcf50d47b5d938c9d16862a
      columns => ["date","description","moneyIn","moneyOut","tx"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "wallet-address-index"
  }
stdout {}
}

The config writes the data to a new index with the name “wallet-address-index”. It then reads from the new directory just created. See how we make use of the different CSV column names to read the data into the proper fields. Let’s set a mapping for the index before we Logstash on the config. Run the following in sense:

PUT wallet-address-index
{
    "mappings": {
      "logs": {
        "properties": {
          "@timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "@version": {
            "type": "string"
          },
          "date": {
            "type": "string"
          },
          "description": {
            "type": "string"
          },
          "host": {
            "type": "string",
            "index" : "not_analyzed"
          },
          "message": {
           "type": "string",
            "index" : "not_analyzed"
          },
          "moneyIn": {
            "type": "string",
            "index" : "not_analyzed"
            
          },
          "moneyOut": {
            "type": "string",
            "index" : "not_analyzed"
          },
          "path": {
            "type": "string",
            "index" : "not_analyzed"
          },
          "tx": {
            "type": "string",
            "index" : "not_analyzed"
          }
        }
      }
    }
}

The mapping used is to make certain fields show the “raw”, or not analyzed. This allows us to draw graphs from the data in Kibana.

logstash3.png#asset:1056

Now we can import the data using our Logstash config just created. I put my config in a file by the name of btc2.conf.

$ /opt/logstash/bin/logstash -f btc2.conf

logstash4-copy.png#asset:1057

Open up Kibana and configure the index we just indexed data to. Go to Settings->Add New. Add index name uses the index name “wallet-address-Index”. Use “@timestamp” as the time-field-name. Then click “create”. Remember, we aren’t actually creating an index here; We created the index when we set our mapping with PUT in sense.

Quick Read: Understanding Why Container Architecture Is Important To The Future Of Your Business

logstash5-copy.png#asset:1058

Now make sure your mapping survived after you indexing your data. Once you've configured the index pattern you should now see the mapping for the index. The fields that don’t have a tick next to them in the analyzed column are the fields that you set as “not analyzed” in your mapping.

logstash6.png#asset:1059

Now that we are satisfied with the mapping of the indexed we just loaded data into, we can create a graph from the data.

logstash7.png#asset:1060

Conclusion

If you are a Bitcoin, Litecoin, or Ethereum fan then you should seriously consider playing around with Elasticsearch, Logstash, and Kibana. There is a vast amount of data freely available online on each of the cryptocurrencies, of which can make use of Elasticsearch to index and search this data. You can use Elasticsearch to help you make decisions with the data by representing the data in a sensible way. We hope this tutorial was helpful. Questions/Comments? Drop us a line below.