Neeldhara
  • About
  • Research
    • Overview
    • People
    • Publications
    • Surveys
  • Teaching
    • Courses
    • Materials
  • Lists
    • Puzzles
    • Bookmarks
  • Exposition
    • Talks
    • Videos
  • Events
  • Blog

1. Storing Files on a Tape

  • Advanced Algorithms
    • Flows
    • Intractability
    • Approximation Algorithms
    • Randomized Algorithms
    • Greedy Algorithms
      • 1. Storing Files on a Tape
      • 2. A Scheduling Problem
      • 3. Stable Matchings
    • Matroids
      • 1. The Motivation
      • 2. The Definition
      • 3. Greedy Works!
      • 4. Scheduling with Deadlines
    • Dynamic Programming
      • 1. Longest Increasing Subsequence
      • 2. Subset Sum
      • 3. Set Cover
      • 4. Optimal BSTs
      • 5. Maximum Independent Set on Trees

On this page

  • The Problem
  • Greedy Works Out
  • Frequencies

1. Storing Files on a Tape

algonotes
lecturenotes
Published

May 16, 2023

Acknowledgements

This write up is borrowed (with minor adaptations) from Chapter 4 of Jeff Erickson’s textbook on Algorithms. The interactive illustrations were generated with assistance from ChatGPT (the May 12 version).

The Problem

Suppose we have a set of nnn files that we want to store on magnetic tape1. In the future, users will want to read those files from the tape. Reading a file from tape isn’t like reading a file from disk; first we have to fast-forward past all the other files, and that takes a significant amount of time. Let L[1..n]L[1 .. n]L[1..n] be an array listing the lengths of each file; specifically, file iii has length L[i]L[i]L[i]. If the files are stored in order from 1 to nnn, then the cost of accessing the kkk th file is

cost⁡(k)=∑i=1kL[i]. \operatorname{cost}(k)=\sum_{i=1}^{k} L[i] . cost(k)=i=1∑k​L[i].

The cost reflects the fact that before we read file kkk we must first scan past all the earlier files on the tape. If we assume for the moment that each file is equally likely to be accessed, then the expected cost of searching for a random file is

E[cost⁡]=∑k=1ncost⁡(k)n=1n∑k=1n∑i=1kL[i].  \mathrm{E}[\operatorname{cost}]=\sum_{k=1}^{n} \frac{\operatorname{cost}(k)}{n}=\frac{1}{n} \sum_{k=1}^{n} \sum_{i=1}^{k} L[i] \text {. } E[cost]=k=1∑n​ncost(k)​=n1​k=1∑n​i=1∑k​L[i]. 

If we change the order of the files on the tape, we change the cost of accessing the files; some files become more expensive to read, but others become cheaper. Different file orders are likely to result in different expected costs. Specifically, let π(i)\pi(i)π(i) denote the index of the file stored at position iii on the tape. Then the expected cost of the permutation π\piπ is

E[cost⁡(π)]=1n∑k=1n∑i=1kL[π(i)] \mathrm{E}[\operatorname{cost}(\pi)]=\frac{1}{n} \sum_{k=1}^{n} \sum_{i=1}^{k} L[\pi(i)] E[cost(π)]=n1​k=1∑n​i=1∑k​L[π(i)]

Which order should we use if we want this expected cost to be as small as possible? Try this yourself in the demonstration below. The tape is the grey rectangle, and the files are the colored boxes below it. The widths of the boxes are proportional to the file lengths. Double-click to get a file outside the tape to push it on the tape, and double-click a file on the tape to remove it. Can you match the optimal cost?

125
100
100
75
20
Average: 0
Target: 205.00
Keep going until all pieces are on the tape…

After some playing around, you may have come to the conclusion that you should sort the files by increasing length. Indeed, the length of the first file shows up the most frequently in the expression we are trying to optimize, while the length of the last file shows up just once: so it seems clear that the larger files should be pushed to the end.

However — and especially so with greedy approaches — intuition is a tricky beast. The only way to be sure that this order works is to actually prove that it works!

Greedy Works Out

Correctness of Greedy - I

Lemma. E[cost⁡(π)]\mathrm{E}[\operatorname{cost}(\pi)]E[cost(π)] is minimized when L[π(i)]≤L[π(i+1)]L[\pi(i)] \leq L[\pi(i+1)]L[π(i)]≤L[π(i+1)] for all iii.

Suppose L[π(i)]>L[π(i+1)]L[\pi(i)]>L[\pi(i+1)]L[π(i)]>L[π(i+1)] for some index iii. To simplify notation, let a=π(i)a=\pi(i)a=π(i) and b=π(i+1)b=\pi(i+1)b=π(i+1). If we swap files aaa and bbb, then the cost of accessing aaa increases by L[b]L[b]L[b], and the cost of accessing bbb decreases by L[a]L[a]L[a]. Overall, the swap changes the expected cost by (L[b]−L[a])/n(L[b]-L[a]) / n(L[b]−L[a])/n. But this change is an improvement, because L[b]<L[a]L[b]<L[a]L[b]<L[a]. Thus, if the files are out of order, we can decrease the expected cost by swapping some mis-ordered pair of files.

Try dragging the borders below to see how changes in file sizes impacts the average cost.

105
105
105
105

Current Cost: 262.5

Optimal Cost: 262.5

This is our first example of a correct greedy algorithm. To minimize the total expected cost of accessing the files, we put the file that is cheapest to access first, and then recursively write everything else; no backtracking, no dynamic programming, just make the best local choice and blindly plow ahead. If we use an efficient sorting algorithm, the running time is clearly O(nlog⁡n)O(n \log n)O(nlogn), plus the time required to actually write the files. To show that the greedy algorithm is actually correct, we proved that the output of any other algorithm can be improved by some sort of exchange.

Frequencies

Let’s generalize this idea further. Suppose we are also given an array F[1..n]F[1 .. n]F[1..n] of access frequencies for each file; file iii will be accessed exactly F[i]F[i]F[i] times over the lifetime of the tape. Now the total cost of accessing all the files on the tape is

Σcos⁡t(π)=∑k=1n(F[π(k)]⋅∑i=1kL[π(i)])=∑k=1n∑i=1k(F[π(k)]⋅L[π(i)]). \Sigma \cos t(\pi)=\sum_{k=1}^{n}\left(F[\pi(k)] \cdot \sum_{i=1}^{k} L[\pi(i)]\right)=\sum_{k=1}^{n} \sum_{i=1}^{k}(F[\pi(k)] \cdot L[\pi(i)]) . Σcost(π)=k=1∑n​(F[π(k)]⋅i=1∑k​L[π(i)])=k=1∑n​i=1∑k​(F[π(k)]⋅L[π(i)]).

As before, reordering the files can change this total cost. So what order should we use if we want the total cost to be as small as possible? (This question is similar in spirit to the optimal binary search tree problem, but the target data structure and the cost function are both different, so the algorithm must be different, too.)

We already proved that if all the frequencies are equal, we should sort the files by increasing size. If the frequencies are all different but the file lengths L[i]L[i]L[i] are all equal, then intuitively, we should sort the files by decreasing access frequency, with the most-accessed file first.

Food For Thought

Prove this formally by adapting the proof from the previous discussion.

But what if the sizes and the frequencies both vary? In this case, we should sort the files by the ratio L/FL / FL/F.

Correctness of Greedy - II

Lemma. Σcost⁡(π)\Sigma \operatorname{cost}(\pi)Σcost(π) is minimized when L[π(i)]F[π(i)]≤L[π(i+1)]F[π(i+1)]\frac{L[\pi(i)]}{F[\pi(i)]} \leq \frac{L[\pi(i+1)]}{F[\pi(i+1)]}F[π(i)]L[π(i)]​≤F[π(i+1)]L[π(i+1)]​ for all iii.

Proof: Suppose L[π(i)]/F[π(i)]>L[π(i+1)]/F[π(i+i)]L[\pi(i)] / F[\pi(i)]>L[\pi(i+1)] / F[\pi(i+i)]L[π(i)]/F[π(i)]>L[π(i+1)]/F[π(i+i)] for some index iii. To simplify notation, let a=π(i)a=\pi(i)a=π(i) and b=π(i+1)b=\pi(i+1)b=π(i+1). If we swap files aaa and bbb, then the cost of accessing aaa increases by L[b]L[b]L[b], and the cost of accessing bbb decreases by L[a]L[a]L[a]. Overall, the swap changes the total cost by L[b]F[a]−L[a]F[b]L[b] F[a]-L[a] F[b]L[b]F[a]−L[a]F[b]. But this change is an improvement, because

L[a]F[a]>L[b]F[b]⟺L[b]F[a]−L[a]F[b]<0. \frac{L[a]}{F[a]}>\frac{L[b]}{F[b]} \Longleftrightarrow L[b] F[a]-L[a] F[b]<0 . F[a]L[a]​>F[b]L[b]​⟺L[b]F[a]−L[a]F[b]<0.

Thus, if any two adjacent files are out of order, we can improve the total cost by swapping them.

Footnotes

  1. Just in case you have not heard of them, here’s Wikipedia on magnetic tapes :)↩︎


© 2022 • Neeldhara Misra • Credits •

 

Corrections? Please leave a comment here or a PR in this repository, thanks!

I’d rather be a failure at something I love than a success at something I hate.

George Burns

You live and you learn — at any rate, you live.

Douglas Adams

A problem worthy of attack proves its worth by fighting back.

Paul Erdos

×