statistics data python


A few months ago I discussed my approach towards effective information processing. Filtering useful content from large amounts of data is perhaps the most difficult step in this process. Nowhere is the problem more pronounced than on the internet where there are 10 billion indexable pages. How can useful information be found without investing the time to manually examine thousands of web pages? To remedy this problem, I use a machine learning tool to find useful information for me. This post explains some of the mechanics behind my personal recommendation engine called Pinborg.

Pinborg is a mashup—part Pinboard URL aggregator and part machine learning cyborg. Pinborg takes a set of user provided URLs and recommends other URLs that will likely be useful to the user. I use Pinborg for two different applications—for general interests and for topic-specific research. If Pinborg is fed my entire bookmark collection it can find other URLs covering a broad range of topics that I will find interesting. I also use Pinborg for research by feeding it a given set of URLs focused on a narrow topic to find additional information about the same topic.

The output of Pinborg is an RSS feed of my top scoring recommendations:

I created Pinborg primarily as a playground to try out new machine learning, statistic, and data mining tools. A nice side-effect of this project is that I also get a tool that has a useful purpose. Pinborg is capable of finding things that search engines and commercial content recommender systems like Zite and Prismatic cannot. Part of Pinborg’s performance is attributed to a fantastic starting data set. In my previous post I highlighted why I like Pinboard for information mining:

Pinboard represents a human curated pre-filtered information data set containing the best Internet content updated in nearly real time. Without quality data even the most sophisticated recommendation techniques are ineffective. I’ve found Pinboard superior to social networks and news aggregation sites for finding information because Pinboard costs money and people only save things they themselves find valuable.

Another contributing factor to Pinborg’s performance is the use of matrix factorization (MF) techniques for generating URL recommendations. Many of the top algorithms from the Netflix competition also used latent factor models based on matrix factorization to recommend movies to Netflix customers. Matrix factorization is conceptually intuitive, displays high performance, and is useful in a large number of applications like image analysis, data compression, and gene expression analysis. In Pinborg, matrix factorization is used because of its high performance on sparse data with parts-based representations.

The Netflix competition also demonstrated that recommendation systems frequently perform best by employing ensemble methodologies, which combine numerous techniques to make recommendations to users. The winning Netflix team used over 100 predictors to recommend new movies to Netflix customers. The Pinborg engine also uses several methods to recommend URLs, but in this post I will focus on the application of non-negative matrix factorization (NMF).1 Below I provide an overview of NMF and some of the implementation I use in Pinborg.


In Pinboard there are users and bookmarks. Generating the top URL recommendations for a user can be formalized as a discrete matrix completion problem. Let users be denoted as $ u_{1},u_{2},…,u_{m} $ and bookmarks be denoted as $ b_{1},b_{2},…,b_{n} $. In Pinborg, this data is expressed as a matrix, $ V $ composed of users, $ U $ and bookmarks, $ B $.

$$ V \in {0,1}^{U \times B} $$

The matrix is constructed as a proxy for bookmark preference. Let a given user’s rating for a specific bookmark be denoted as $ v_{ub} $. If $ v_{ub} = 0 $, this represents an unknown—either a given user has seen a given bookmark and chosen not to save it or the user has simply never seen the bookmark. If a user has saved a given bookmark to Pinboard, let the initial value of $ v_{ub} = 1 $. Matrix factorization with discrete binary or boolean data is difficult, mathematically complex, and frequently shows poor predictive ability for unknown ratings in my experience. Therefore, a weight is applied to $ v_{ub} $ based on the popularity of the bookmark. The goal of NMF is to then use known $ v_{ub} $ values to learn user ratings where $ r_{ub} = 0 $.

To make recommendations, NMF is used to decompose $ V $ into two submatrices, $ M $ and $ N $, such that the product of the submatrices approximate the original data matrix $ V $. The factorization yields submatrices containing latent features, $ k $, which describe underlying attributes within the data. This decomposition allows $ v_{ub} $ to be approximated by calculating the inner product of the corresponding feature vectors:

$$ \hat{v_{ub}} = \sum_{k=1}^{k} m_{uk}n_{kb} $$

In Pinborg, latent features could describe attributes for a given URL, but the actual significance of the attributes are unknown. For example, users who like Apple products may share a bookmark about a git logging script from Brett Terpstra. The URL has nothing to do with Apple, however many Mac users follow Brett Terpstra’s blog because he’s a Mac developer. Users who have this bookmark may like Apple products for example.

To approximate $ V $ and minimize the error, $ \epsilon $ the system learns the model by fitting previously observed bookmarks using stochastic gradient descent, where $ \epsilon_{ub} $ is the associated prediction error of a given rating:

$$ m_{uk} := \alpha(\epsilon_{ub}n_{bk} - \beta m_{uk}) \ $$ $$ \;\;\;\;\;\;\;\;\;\;\;n_{bk} := \alpha(\epsilon_{ub}m_{uk} - \beta n_{bk}) \textrm{, where } \ $$ $$ \epsilon_{ub} = v_{ub} - \hat{v}_{ub} $$ Pinborg initializes $ M $ and $ N $ from a random distribution and then modifies the values within the submatrices accordingly to converge on a minimized error. The learning rate $ \alpha $ and regularization constant $ \beta $ are introduced to control overfitting and rate of convergence, respectively.


Pinborg uses Python and runs on an Amazon EC2 instance. Public user bookmarks are collected from Pinboard every few minutes and inserted into a MongoDB database on an Elastic Block Store. Using the PyMongo API, Pinborg captures JSON data from Pinboard and processes the URL, user name, and user tags for each bookmark. Each document in the database contains a unique URL. An example document from the database looks like this:

    "url" : "", 
    "users" : [
    "tags" :  [

When the database reached 100,000 URLs, Pinborg carries out matrix factorization to make URL recommendations to the user. The matrix factorization component of Pinborg is heavily memory-bound as the matrix of users and bookmarks is held in memory while the computations are performed. The data is constructed as a sparse matrix with several billion unique values using SciPy. These computations would be virtually impossible without SciPy, which provides speed, storage efficiency, and other tools that facilitate linear algebra. The sparse matrix is then converted to a NumPy array and NMF is carried out using the following method:2

import scipy as sp

    def nmf(self, V, M, N, K, iters=3000, error=0.02, alpha=0.005, beta=0.02):
        N = N.T
        for t in xrange(iters):
            it = sp.nditer(V, flags=['multi_index'])
            while not it.finished:
                if it[0] > 0:
                    e = 0
                    value, x = (it[0], it.multi_index)
                    eub = value -[x[0],:], M[:,x[1]])
                    e = e + pow(eub, 2)
                    for k in xrange(K):
                        M[x[0],k] += alpha * (eub * N[k,x[1]] - beta * M[x[0],k])
                        N[k,x[1]] += alpha * (eub * M[x[0],k] - beta * N[k,x[1]])
                        e = e + (beta/2) * (pow(M[x[0],k], 2) + pow(N[k,x[1]], 2))
                    if e < error: break
        return M, N.T

What constitutes good or bad performance in Pinborg is subjective. To evalute performance, I measure the number of my previously saved bookmarks that score in the top fifty recommendations returned by Pinborg. Using this simple metric, I’ve optimized the error threshold, learning rate, and regularization constant to yield the best recommendation performance with respect to computation time.

The final step of Pinborg is to generate an RSS feed of the top scoring recommendations. I use the Pinboard API for this task using python-pinboard, which provides an easy way to upload bookmarks to Pinboard. The method below reads a text file containing the top 20 URL recommendations and uploads the bookmarks to my Pinboard account.

import pinboard

def post_url(self, bm, title, pin_tags):
    """ Send new bookmarks to Pinboard
    d = "{0}".format(self.cur_date).split("-")
    bm_date = (int(d[0]), int(d[1]), int(d[2]))
    self.p.add(url = bm, 
    description = title,
    extended = "", 
    tags = ['Pinborg'], 
    date = bm_date)

Each uploaded URL is tagged with the Pinborg tag. Pinboard provided RSS feeds for individual user tags, so I simply subscribe to the Pinborg tag feed and all new recommendations made by Pinborg are avilable through this RSS subscription.

  1. For additional resources on matrix factorization see Lee and Seung’s paper, the Netflix summary, Albert Yeung’s overview, and the Tong et. al review. ↩

  2. There are a number of excellent machine learning libraries for Python. These libraries would likely perform better than my implemenation. I’ve choosen to write my own implementations for learning and experimentation purposes. ↩