Reading the Contents of a File

Problem

You want to read some or all of a file into memory.

Solution

Open the file with Kernel#open, and pass in a code block that does the actual reading. To read the entire file into a single string, use IO#read:

#Put some stuff into a file. open('sample_file', 'w') do |f| f.write("This is line one. This is line two.") end # Then read it back out. open('sample_file') { |f| f.read } # => "This is line one. This is line two."

To read the file as an array of lines, use IO#readlines:

open('sample_file') { |f| f.readlines } # => ["This is line one. ", "This is line two."]

To iterate over each line in the file, use IO#each. This technique loads only one line into memory at a time:

open('sample_file').each { |x| p x } # "This is line one. " # "This is line two."

 

Discussion

How much of the file do you want to read into memory at once? Reading the entire file in one gulp uses memory equal to the size of the file, but you end up with a string, and you can use any of Ruby's string processing techniques on it.

The alternative is to process the file one chunk at a time. This uses only the memory needed to store one chunk, but it can be more difficult to work with, because any given chunk may be incomplete. To process a chunk, you may end up reading the next chunk, and the next. This code reads the first 50-byte chunk from a file, but it turns out not to be enough:

puts open('conclusion') { |f| f.read(50) } # "I know who killed Mr. Lambert," said Joe. "It was

If a certain string always marks the end of a chunk, you can pass that string into IO#each to get one chunk at a time, as a series of strings. This lets you process each full chunk as a string, and it uses less memory than reading the entire file.

If a certain string always marks the end of a chunk, you can pass that string into IO#each to get one chunk at a time, as a series of strings. This lets you process each full chunk as a string, and it uses less memory than reading the entire file.

# Create a file… open('end_separated_records', 'w') do |f| f << %{This is record one. It spans multiple lines.ENDThis is record two.END} end # And read it back in. open('end_separated_records') { |f| f.each('END') { |record| p record } } # "This is record one. It spans multiple lines.END" # "This is record two.END"

You can also pass a delimiter string into IO#readlines to get the entire file split into an array by the delimiter string:

# Create a file… open('pipe_separated_records', 'w') do |f| f << "This is record one.|This is record two.|This is record three." end # And read it back in. open('pipe_separated_records') { |f| f.readlines('|') } # => ["This is record one.|", "This is record two.|", # "This is record three."]

The newline character usually makes a good delimiter (many scripts process a file one line at a time), so by default, IO#each and IO#readlines split the file by line:

open('newline_separated_records', 'w') do |f| f.puts 'This is record one. It cannot span multiple lines.' f.puts 'This is record two.' end open('newline_separated_records') { |f| f.each { |x| p x } } # "This is record one. It cannot span multiple lines. " # "This is record two. "

The trouble with newlines is that different operating systems have different newline formats. Unix newlines look like " ", while Windows newlines look like " ", and the newlines for old (pre-OS X) Macintosh files look like " ". A file uploaded to a web application might come from any of those systems, but IO#each and IO#readlines split files into lines depending on the newline character of the OS that's running the Ruby script (this is kept in the special variable $/). What to do?

By passing " " into IO#each or IO#readlines, you can handle the newlines of files created on any recent operating system. If you need to handle all three types of newlines, the easiest way is to read the entire file at once and then split it up with a regular expression.

open('file_from_unknown_os') { |f| f.read.split(/ ? | (?! )/) }

IO#each and IO#readlines don't strip the delimiter strings from the end of the lines. Assuming the delimiter strings aren't useful to you, you'll have to strip them manually.

To strip delimiter characters from the end of a line, use the String#chomp or String#chomp! methods. By default, these methods will remove the last character or set of characters that can be construed as a newline. However, they can be made to strip any other delimiter string from the end of a line.

"This line has a Unix/Mac OS X newline. ".chomp # => "This line has a Unix/Mac OS X newline." "This line has a Windows newline. ".chomp # => "This line has a Windows newline." "This line has an old-style Macintosh newline. ".chomp # => "This line has an old-style Macintosh newline." "This string contains two newlines. ".chomp # "This string contains two newlines. " 'This is record two.END'.chomp('END') # => "This is record two." 'This string contains no newline.'.chomp # => "This string contains no newline."

You can chomp the delimiters as IO#each yields each record, or you can chomp each line returned by IO#readlines:

open('pipe_separated_records') do |f| f.each('|') { |l| puts l.chomp('|') } end # This is record one. # This is record two. # This is record three. lines = open('pipe_separated_records') { |f| f.readlines('|') } # => ["This is record one.|", "This is record two.|", # "This is record three."] lines.each { |l| l.chomp!('|') } # => ["This is record one.", "This is record two.", "This is record three."]

You've got a problem if a file is too big to fit into memory, and there are no known delimiters, or if the records between the delimiters are themselves too big to fit in memory. You've got no choice but to read from the file in chunks of a certain number of bytes. This is also the best way to read binary files; see Recipe 6.17 for more.

Use IO#read to read a certain number of bytes, or IO#each_byte to iterate over the File one byte at a time. The following code uses IO#read to continuously read uniformly sized chunks until it reaches end-of-file:

class File def each_chunk(chunk_size=1024) yield read(chunk_size) until eof? end end open("pipe_separated_records") do |f| f.each_chunk(15) { |chunk| puts chunk } end # This is record # one.|This is re # cord two.|This # is record three # .

All of these methods are made available by the IO class, the superclass of File. You can use the same methods on Socket objects. You can also use each and each_byte on String objects, which in some cases can save you from having to create a StringIO object (see Recipe 6.15 for more on those beasts).

See Also

Категории