Time Warp
statistics osx algorithm haskell python

Introduction

Time Machine has been an integral component of my backup system since its inception in OS X 10.5. Despite its utility, I dislike how backup snapshots are handled when Time Machine’s hard drive is full. To make efficient use of hard drive space, Time Machine keeps hourly backups for a single day, daily backups for a month, and finally weekly backups until the backup drive is full. Once full, Time Machine operates as a First-in, First-out queue (FIFO)—the oldest snapshot is dequeued to allow an incoming snapshot to be enqueued. This post is about my backup tool called Time Warp and how I use it to modify Time Machine’s backup behavior using weighted reservoir sampling. I built Time Warp to preserve important backup snapshots and prevent Time Machine from deleting them.

Rationale

Time Machine’s backup behavior is incongruous with the way I work. I primarily use Time Machine for two purposes—to restore files recently changed or to recover files that were deleted some time ago. When a drive is full, Time Machine continuously deletes older backups to make room for newer backups. This behavior makes it impossible to restore files older than a few months using a normal sized backup drive. Time Machine preferentially retains snapshots based on the assumption that a user is more likely to need to restore a more recently changed file versus an older file. Dequeuing by FIFO is undesirable because I find that it’s equally likely that I may need to restore a file from 6 weeks or 6 months ago.

A solution to this problem is to decrease the frequency of backups with a tool like TimeMachine Editor. Decreasing backup frequency allows older snapshots to be retained on a backup drive for a longer period of time, but it sacrifices the availability of recent hourly snapshots. This behavior is undesirable because having access to hourly backups is important to me. I sometimes alter a file only to realize a few hours later that I need to restore a previous version of the same file.1

To help illustrate how I view the relative importance of different Time Machine backups, I’ve graphed backup importance as a function of time:

In the above graph, weight is a representation of relative importance. Ignoring some details for the moment, the x-axis is the time in days since the current day. Each bar along the x-axis represents a single Time Machine backup snapshot and its associated weight. In this example, backups are numbers 1 through 21 above each bar. Backup 1 is yesterday’s backup and backup 21 is a backup from 29 days ago. A larger weight signifies that a snapshot should be preferentially preserved on the drive. Recent backups have higher weights because I want to preferentially keep these backups. Yesterday’s snapshot has the largest weight because it contains the most recently saved state of my data.

Backups at the edges of a gap in backup history also have increased weight. I sometimes travel or work on my laptop remotely and I don’t always have daily backups of my work. The backup prior to a gap in backup history is important because it allows me to preserve a divergent state of my data. If a drive could only hold 21 backups, dequeuing by FIFO would delete backup 21, which is a very important backup to me. It would be much better to delete backup 10 for example because backups 9 and 11 are likely to contain highly similar states.

Algorithm

Faced with a hard drive of finite size, concessions must be made in deciding which snapshots should be preserved or discarded. Time Machine produces a continuous data stream of backups, but only $k$ backups can stored on a hard drive. This problem is an interesting application of stream sampling.

Selecting items from a stream is a well-known problem with a rich history of algorithmic study. The idea of stream sampling is to select item from a continuously flowing data source of unknown size and distribution in one-pass. In many applications, the objective is to randomly sample data such as measuring IP packets traveling across a network or to count the frequency of insect species in a river. For my use case, I needed to select backups based on different weights to determine which backups should be selected remain on the disk. To tackle this problem, I choose an implementation of weighted reservoir sampling called A-ES as first described by Paul Efraimidis and colleagues in 2006.2

A-ES starts with a stream, $S$ of unknown size and a reservoir, $R$ of finite size $m$.

The first $m$ items in $S$ are inserted into $R$.

If the reservoir is full, for the $i$th datum in $R$, a key is generated by calculating $u_i^{1/w_i}$, where $u_i$ is a uniform random variable—$\mathcal{U}(0,1)$ and $w_i$ is an associated weight of the $i$th datum. The datum with the minimum key in the reservoir is the threshold, $T$.

For each incoming item in the stream, $s_i$ an associated key, $k_i$ is generated. If $k_i$ is larger than $T$, the $i$th datum in $R$ corresponding to $T$ is replaced with $s_i$ and $k_i$ becomes the new $T$. Note that when all the weights in the stream are 1, this algorithm reduces to simple random sampling because $k_i$ become a random variate drawn from the uniform distribution.

The process then repeats.

The process continues ad infinitum until the desired number of items in the stream have been processed.

As the number of items seen from the stream increases, the probability that a new datum from the stream will be incorporated into the reservoir decreases. In the case of uniform weights, the probability that a given element is retained in the reservoir is the inverse of the number of items seen. This is weighted reservoir sampling without replacement.

Implementation

Here’s my implementation of the base A-ES algorithm in Haskell:

{- Seth Brown
GHC 7.4.2
http://www.drbunsen.org/time_warp
sethbrown AT drbunsen DOT ORG
Copyright 2013 Seth Brown. All rights reserved.
-}

import Data.List
import Data.Ord
import System.Random

main :: IO ()
main = do
    val <- randomIO :: IO Int
    let st = stream [1..] [1,1..] val
    print . aes 5 . take 100 $ st

stream :: [a] -> [Float] -> Int -> [(a, Float)]
stream vals wts seed = zip vals . key $ rand wts seed
    where
        rand lst seed = zip lst . randomRs (0.0, 1.0) . mkStdGen $ seed
        keyFunc (wt, u) = u**(1.0/wt) 
        key [] = []
        key [x] = [keyFunc x]
        key (x:xs) = keyFunc x : key xs

aes :: Eq a => Int -> [(a, Float)] -> [a]
aes n strm = map fst $ sample n strm
    where 
        sample n strm
            | length strm < n = strm
            | otherwise = (go . splitAt n) strm
                where
                    sortByKey = sortBy (comparing snd)
                    minByKey = minimumBy (comparing snd)
                    go (res, strm) = foldl' (\acc x -> 
                        if (snd . minByKey) acc < snd x
                        then tail . sortByKey $ acc ++ [x]
                        else acc) res strm

My Haskell prototype uses simple constructs—linked lists and tuples mostly. The aes function consumes an infinite stream of tuples containing a datum and its associated weight then uses a strict left fold to build the reservoir. The code is unoptimized but still much faster than the optimized Python implementation presented below. Choosing a better data structure such as a finger tree would likely improve performance by eliminating the repeated need to find the minimum key and sort the reservoir.

My Python implementation of A-ES uses a priority queue in the form of a heap. Heaps are great data structures for this type of problem because most of the A-ES implementation involves popping the root element in the heap and then pushing the newest element from the stream onto the heap. Python even has a heapreplace method that combines these two actions into one convenient step. The end result is a very readable A-ES implementation compared to my version above in Haskell:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Seth Brown
Python 2.7.4
http://www.drbunsen.org/time_warp
sethbrown AT drbunsen DOT ORG
Copyright 2013 Seth Brown. All rights reserved.
"""

from __future__ import print_function
import heapq
from itertools import repeat, izip
from random import random

def aes(strm, k=1):
    # Weighted reservoir sampling without replacement implementation.

    # See [Efraimidis et. al][1].

    # strm = stream
    # k = reservoir size
    # rsv = reservoir
    # wts = associated weights for the stream

    # [1]: http://arxiv.org/pdf/1012.0256.pdf
    rsv = []
    heapq.heapify(rsv)
    # generate a key and fill the reservoir to k elements with associated keys
    for n, (el, wi) in enumerate(strm):
        ki = random()**(1. / wi)
        if n < k:
            heapq.heappush(rsv, (ki, el))

        # if the reservoir is full, find a minimum threshold, t.
        # if ki is large then t, pop t and push ki onto the heap.
        else:
            if len(rsv) > 1:
                t, _ = heapq.nsmallest(1, rsv)[0]
                if ki > t:
                    heapq.heapreplace(rsv, (ki, el))

    # yield k elements with the largest keys, this is the reservoir sample.
    for elem in heapq.nlargest(k, rsv):

        yield elem[1]

def main():
    """ Stream sampling with uniform weights
    """
    elems = xrange(1, 100)
    wts = repeat(1, 100)
    strm = izip(elems, wts)
    res_sample = aes(strm, k=5)
    print(*res_sample)

if __name__ == '__main__':
    main()

Weight

To use the Python A-ES implementation in Time Warp, I needed to create a function to model weight. I wanted to add weight to backups recently created in the last few days and to any backup where there were large time gaps in backup history (see the graph above). To model backup weight, I constructed the following function:

$$ w_i = f(\Delta t_i, \Delta g_i) = 100\varphi^{-\Delta t_i} + 100\ln(\Delta g_i) + 1 $$

For the $i$th datum, $w_i$ is calculated using two main terms. The first expression is an exponential decay function where phi is the golden ratio and $\Delta t$ is the time in days from the date the backup was created to the current date. This term adds weight to backups created in the last few days and approaches the horizontal asymptote of 1 as the backups get older.

The second term controls the weight for gaps. $\Delta g_i$ is the time in days/weeks from the $i$th backup to the next most recent backup. Most of my Time Machine backups are an unbroken history where I have backup data for each day/week. For these gap-less backups, I’ve setup weight to be simply controlled by exponential decay because $\Delta g_i$ = 1, so the gap term is zero when there is no gap in backup history.3

Time Warp

Time Warp is the name of my tool for manipulating Time Machine backups. Time Warp uses the A-ES algorithm described above and wraps Apple’s tmutil utility to selectively preserve Time Machine backups that are important to me. A sample Time Warp command looks like this:

By default, Time Warp always operates in safe-mode. If safe-mode was turned off in the example above, Time Warp would have removed two Time Machine snapshots—one from February 2012 and one from November 2013.

safe-mode and other options in Time Warp are specified by a config file. The safe-warp.json config used above looks like this:

{
    "mode": "safe",
    "volume": "/Volumes/tm/Backups.backupdb/celery/",
    "threshold": 1024, 
    "log": "timewarp.log"
}

The config file has 4 parameters. Time Warp’s API is designed to always operate in safe-mode unless explicitly configured otherwise. I use safe-mode to experiment, debug, and setup Time Warp with a new Time Machine drive without fear of accidentally deleting backups. When configured with safe-mode, Time Warp prints the names of the snapshots that would have been deleted if Time Warp was actually executed.4 Only when mode is set to live will Time Warp become active. In live-mode, Time Warp examines the size of a given Time Machine drive and if the drive is close to being full, Time Warp uses the A-ES algorithm to preferentially select backups to remove.

The volume and threshold parameters tell Time Warp where the Time Machine backup is located and when Time Warp should become active, respectively. In the default config safe-warp.json, the threshold is set to 1024MB. This parameter tells Time Warp that when the free space on a Time Machine volume is less then 1024MB, become active and select backups to remove before Time Machine can delete the oldest backup. Once the free space on the drive exceeds 1024MB, Time Warp becomes inactive again.

The final parameter,log is optional. If log is included in the config file, Time Warp will create a log file at the specified destination. Each time Time Warp is invoked, the log file records the deleted snapshot’s name and the time that the backup was deleted.

The full code, including Time Warp’s Launchd daemon and setup instructions can be found at my GitHub account.


  1. This actually happened while writing this post. I decided to exclude the graph below and deleted it. I then changed my mind and needed to recover it through Time Machine. 

  2. Sadly, the original paper is behind a pay wall. This review by the same author is actually a better resource. 

  3. I’ve been experimenting with adding an additional term that also takes into account the size of the each backup. Time Machine does not create individual backups, rather it uses a snapshot system. Time Machine snapshots do not have defined sizes, but you can see the approximate size of a given snapshot via the calculatedrift command in tmutil

  4. Time Warp is probabilistic in nature. The number of backups and names of each backup marked for deletion can be different each time the code is executed. safe-mode is designed to allow me to experiment with my code and make sure the algorithm is running correctly. Since Time Machine uses a snapshot system, backups do not have a defined size. In safe-mode Time Warp estimates the size of each backup by using the mean snapshot size.