Explorations in Unix
haskell unix statistics

Introduction

Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix’s use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA.

  • Inspect
  • Reshape
  • Enumerate
  • Describe
  • Visualize
  • Inspect

    Inspecting data is one of the first steps of exploratory data analysis. Quickly examining the structure of data provides inroads to uncovering patterns and understanding latent meaning. Data often comes sorted in rows with some meaningful order, so it can be instructive to inspect the header and beginning lines as well as ending lines of a data file. This example displays the first and last 5 lines of a data file:

    $ (head -5; tail -5) < data
    

    I use this so frequently, I’ve created a shell function for this command. I tweeted about it a few months ago:

    Sometimes it is beneficial to inspect the first few lines in many files. The above command can be used on multiple files by globbing; however, it can be difficult to understand which lines came from which files when examining the output in the console. A convenient method to generate a more readable output is to pipe a glob to cat:

    $ head -3 data* | cat
    

    The output displays the file name above the data:

    Reshape

    After inspecting many data files, it is sometimes necessary to aggregate data into one file. reshape (rs), laminate (lam), and paste are useful commands for this purpose. paste can be used to combine data from each file column-wise, such that the contents of each file are placed into individual columns thereby generating a data table where the first row is the header and each column is data from an individual file. To combine three files, data1, data2, and data3 is this manner:

    $ paste data* > agg-data
    

    Another use case for paste is with grouped data. Consider the following data file called datafile:

    # group1
    1.123
    2.123
    1.239
    # group2
    1.2e-10
    2.4e-08
    # group3
    3.8
    4.2
    

    paste can be used with awk or csplit to group data by pattern matching and then reshape each match into a single data file column-wise. Using the above example, data can be grouped by lines which start with an octothorpe using this command:

    $ awk '/#/{x="group"++i;}{print > x;}' datafile && paste group* > agg-data
    

    Enumerate

    Removing blank and duplicate lines from a data file is a common data cleaning operation. It’s also simultaneously instructive to determine exactly how many lines of data are initially present and how many lines are present after the data has been cleaned. In Unix, data cleaning, enumeration, and cardinality can be performed in one command:

    $ wc -l < dat && awk '!x[$0]++' dat | sed '/^\s*$/d' | tee >(wc -l >&2) | sponge dat
    

    The first part of this command prints the number of lines in the file. In the second part, awk prints only the first record, thus removing duplicates from the dat file without needing to sort. The sed call removes the single blank line left after the duplicates have been removed. The second-half of this command may be less familiar. The tee >(wc -l >&2) command is called a process substitution and works in both bash and zsh. tee reads from standard input and writes to standard output as is Unix convention, but tee also forks the command pipeline effectively allowing additional operations on the stream to be performed. The final tool in the chain is moreutils sponge, which soaks up stdin and is used to write the cleaned data back to the same file. The total line counts as well as the cardinality of the data set are printed to the screen.

    Describe

    Given the bevy of Unix tools, it’s surprising that there are not simple utilities for calculating common descriptive statistics such as the arithmetic mean, median, mode, etc. awk can do these calculations, but I compute these values frequently enough to warrant designing a tool that can calculate a number of descriptive statistics all at one time. One of the great strengths of Unix for EDA is that it is endlessly extensible because custom functions can be written in any language and integrated into a Unix pipeline via stdin/stdout.

    Most of my custom EDA tools operate on data columns. Since Unix is adept at text processing, it’s very easy to manipulate any data into a column for use with my custom tools. As Alan Perlis said in his Epigrams on Programming:

    It is better to have 100 functions that operate on one data structure than 10 functions on 10 data structures.

    I write most of my tools in Haskell because it is a high-level language that excels at fast numerical computation and it is amenable to parallelization since the language proper is entirely pure. One of my EDA tools is called describe and is modeled after the describe function in SciPy and the fivenum and summary functions of R. You can find information about the project on my Projects page.

    I’ll use the iris data set as a demonstration of how I use describe. To see the structure of the data set, the (i)nspect function can be used. The first 5 and last 5 lines of the data set look like this:

    Lets analyze some descriptive statistics for sepal_length in column one. describe operates on columns of numerical data, so it’s necessary to extract the sepal_length column and remove the header row prior to feeding the data to describe. The following command extracts the first column and feed all data row except the first row (the header row) to describe:

    cut -f1 -d"," iris-data | tail -n +2 | describe 
    

    The output looks like this:2

    To use describe on a single row of data rather than a column, the reshape can be used. The second row of numerical data can be analyzed using:

    $ cut -f-4 -d"," iris-data | awk 'NR == 2' | tr ",", " " | rs -T | describe
    

    Sometimes it’s also useful to examine multiple columns of data. To calculate descriptive statistics on all numerical columns within the iris dataset:

    $ awk 'NR >= 2' iris-data | cut -f-4 -d"," | tr ',', '\n' | describe > iris-stats
    

    awk is first used to select all rows of data except the header. The last column of iris data contains plant species names not numerical data, so cut is used to omit this column. tr then translates the comma delimited data to newlines thus creating one giant column prior to calling describe.

    Visualize

    The final component to EDA in Unix is visualization. Basic visualization is a critical component of EDA and helps guide further, more sophisticated, downstream data analysis. I like to use ggplot and matplotlib for most of my static visualizations, but for exploratory analysis I typically just want to construct a simple scatter plot or histogram. For simple static plotting in Unix, the defacto tool is gnuplot. To examine sepal length as a function of petal length, gnuplot can be used at the end of a pipeline:

    $ cut -f1,3 -d"," iris-data | tail -n +2 | tr ",", " " | gnuplot -p -e 
        'plot [0:10][14] "-" using 1:2 with points pointtype 5 pointsize 1.5'
    

    The plot isn’t aesthetically pleasing, but for making numerous quick plots, gnuplot gets the job done. In one line it’s possible to create a quick visualization to guide further data analysis steps.


    1. It would be remiss of me to write this post without mentioning the father of EDA—John Tukey. Tukey pioneered many of the ideas of EDA and was a strong proponent of calculating summary statistics and generating visualization to guide downstream data analysis steps. Many of his ideas are covered in his excellent EDA book. As if this wasn’t enough, he also invented what is probably the most influential algorithm of all time. ↩

    2. cut is generally faster than awk in my experience. ↩