Parsing Comma-Separated Data

Problem

You have a plain-text string in a comma-delimited format. You need to parse this string, either to build a data structure or to perform some operation on the data and write it back out.

Solution

The built-in csv library can parse most common character-delimited formats. The FasterCSV library, available as the fastercsv gem, improves on csvs performance and interface. Ill show you both, but I recommend fastercsv unless you can use any software at all outside the standard library.

CSV::Reader.parse and FasterCSV.parse work the same way: they accept a string or an open file as an argument, and yield each parsed row of the comma-delimited file as an array. The csv yields a Row object that acts like an array full of Column objects. FasterCSV just yields an array of strings.

require csv primary_colors = "red,green,blue red,yellow,blue" CSV::Reader.parse(primary_colors) { |row| row.each { |cell| puts cell }} # red # green # blue # red # yellow # blue require ubygems require faster_csv shakespeare = %{Sweet are the uses of adversity,As You Like It "We few, we happy few",Henry V "Seems, madam! nay it is; I know not ""seems.""",Hamlet} FasterCSV.parse(shakespeare) { |row| puts "#{row[0]} -- #{row[1]}"} # Sweet are the uses of adversity -- As You Like It # We few, we happy few -- Henry V # Seems, madam! nay it is; I know not "seems." -- Hamlet

Discussion

Comma-delimited formats are among the most basic portable file formats. Unfortunately, they e also among the least standardized. There are many different formats, and some are internally inconsistent.

FasterCSV and the csv library can parse every comma-delimited format, but they will parse common formats like the one used by Microsoft Excel, and they e your best tool for making sense of the myriad.

FasterCSV and csv both model a comma-delimited file as a nested array of strings. The csv librarys CSV class uses Row objects and Column objects instead of arrays and strings, but its the same idea. The terminology is from the spreadsheet worldunderstand-ably, since a CSV file is a common way of portably storing spreadsheet data.

The complications begin when the spreadsheet cells themselves contain commas or newlines. The standard way to handle this when exporting to comma-delimited format is to surround those cells with double quotes. Then the question becomes what to do with cells that contain double-quote characters. Both Ruby CSV libraries assume that double-quote characters are escaped by doubling, turning each " into "", as in the Hamlet quotation:

%{"Seems, madam! nay it is; I know not ""seems.""",Hamlet}

If you e certain that there are no commas or newlines embedded in your data, and thus no need for quote handling, you can use String#split to parse delimited records more quickly than csv. To output to this format, you can use Array#join:

def parse_delimited_naive(input, fieldsep=,, rowsep=" ") input.split(rowsep).inject([]) do |arr, line| arr << line.split(fieldsep) end end def join_delimited_naive(structure, fieldsep=,, rowsep=" ") rows = structure.inject([]) do |arr, parsed_line| arr << parsed_line.join(fieldsep) end rows.join(rowsep) end parse_delimited_naive("1,2,3,4 5,6,7,8") # => [["1", "2", "3", "4"], ["5", "6", "7", "8"]] join_delimited_naive(parse_delimited_naive("1,2,3,4 5,6,7,8")) # => "1,2,3,4 5,6,7,8" parse_delimited_naive(1;2;3;4|5;6;7;8, ;, |) # => [["1", "2", "3", "4"], ["5", "6", "7", "8"]] parse_delimited_naive(1,"2,3",4) # => [["1", ""2", "3"", "4"]]

This is not recommended unless you wrote all the relevant code yourself, or can manually inspect the code as well as the dataset. Just because you haven seen any quoted cells yet doesn mean there won be any in the future. When in doubt, use csv or fastercsv. Handwritten CSV generators and parsers are a leading cause of bad data.

To create a comma-delimited file, open an output file with CSV.open or FasterCSV.open, and append a series of arrays to the resulting file-like object. Every array you append will be converted to a comma-delimited row in the destination file.

data = [[1,2,3],[A,B,C],[do, e,mi]] writer = FasterCSV.open(first3.csv, w) data.each { |x| writer << x } writer.close puts open(first3.csv).read() # 1,2,3 # A,B,C # do,re,mi data = [] FasterCSV.foreach(first3.csv) { |row| data << row } data # => [["1", "2", "3"], ["A", "B", "C"], ["do", "re", "mi"]]

See Also

Категории