Picking a Random Line from a File

Problem

You want to choose a random line from a file, without loading the entire file into memory.

Solution

Iterate over the file, giving each line a chance to be the randomly selected one:

module Enumerable def random_line selected = nil each_with_index { |line, lineno| selected = line if rand < 1.0/lineno } return selected.chomp if selected end end #Create a file with 1000 lines open('random_line_test', 'w') do |f| 1000.times { |i| f.puts "Line #{i}" } end #Pick random lines from the file. f = open('random_line_test') f.random_line # => "Line 520" f.random_line # => nil f.rewind f.random_line # => "Line 727"

 

Discussion

The obvious solution reads the entire file into memory;

File.open('random_line_test') do |f| l = f.readlines l[rand(l.size)].chomp end # => "Line 708"

The recommended solution is just as fast, and only reads one line at a time into memory. However, once it's done, the file pointer has been set to the end of the file and you can't access the file anymore without calling File#rewind. If you want to pick a lot of random lines from a file, reading the entire file into memory might be preferable to iterating over it multiple times.

This recipe makes for a good command-line tool. The following code uses the special variable $., which holds the number of the line most recently read from a file:

$ ruby -e 'rand < 1.0/$. and line = $_ while gets; puts line.chomp if line'

The algorithm works because, although lines that come earlier in the file have a better chance of being selected initially, they also have more chances to be replaced by a later line. A proof by induction demonstrates that the algorithm gives equal weight to each line in the file.

The base case is a file of a single line, where it will obviously work: any value of Kernel#rand will be less than 1, so the first line will always be chosen.

Now for the inductive step. Assume that the algorithm works for a file of n lines: that is, each of the first n lines has a 1/n chance of being chosen. Then, add another line to the file and process the new line. The chance that line n+1 will become the randomly chosen line is 1/(n+1). The remaining probability, n/n+1, is the chance that one of the other n lines is the randomly chosen one.

Our inductive assumption was that each of the n original lines had an equal chance of being chosen, so this remaining n/n+1 probability must be distributed evenly across the n original lines. Given a line in the first n, what's it's chance of being the chosen one? It's just n/n+1 divided by n,or 1/n+1. Line n+1 and all earlier lines have a 1/n+1 chance of being chosen, so the choice is truly random.

See Also

Категории