Parsing Not-Quite-Comma-Separated Data

2017-11-03 09:05:07

Problem

You need to parse a plain- text string or file thats in a format similar to commadelimited format, but its delimiters are some strings other than commas and newlines.

Solution

When you call a CSV::Reader method, you can specify strings to act as a row separator (the string between each Row) and a field separator (the string between each Column). You can do the same with simulated keyword arguments passed into FasterCSV.parse. This should let you parse most formats similar to the comma-delimited format:

require csv pipe_separated="1|2ENDa|bEND" CSV::Reader.parse(pipe_separated, |, END) { |r| r.each { |c| puts c } } # 1 # 2 # a # b require ubygems require faster_csv FasterCSV.parse(pipe_separated, :col_sep=>|, :row_sep=>END) do |r| r.each { |c| puts c } end # 1 # 2 # a # b

Discussion

Value-delimited formats tend to differ along three axes:

The field separator (usually a single comma)

The row separator (usually a single newline)

The quote character (usually a double quote)

Like Reader methods, Writer methods accept custom values for the field and row separators.

data = [[1,2,3],[A,B,C],[do, e,mi]] open(first3.csv, w) do |output| CSV::Writer.generate(output, :, -END-) do |writer| data.each { |x| writer << x } end end open(first3.csv) { |input| input.read() } # => "1:2:3-END-A:B:C-END-do:re:mi-END-" FasterCSV.open(first3.csv, w, :col_sep=>:, :row_sep=>-END-) do |output| data.each { |x| output << x } end open(first3.csv) { |input| input.read() } # => "1:2:3-END-A:B:C-END-do:re:mi-END-"

Its rare that youll need to override the quote character, and neither csv nor fastercsv will let you do it. Both libraries quote characters are hardcoded to the double-quote character. If you need to parse a format that has different quote character, the simplest thing to do is subclass FasterCSV and override its init_parsers method.

Change the regular expression assigned to @parsers[:csv_row], replacing all double quotes with the quote character you want. The most common alternate quote character is the single quote: to get that, youd have an init_parsers method like this:

class MyFasterCSV < FasterCSV def init_parsers(options) super @parsers[:csv_row] = / G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match (?: ((?>[^]*)(?>\[^]*)*) # find quoted fields | # … or … ([^#{Regexp.escape(@col_sep)}]*) # unquoted fields )/x end end MyFasterCSV.parse("1,2,3,4") { |r| puts r } # 1 # 2,3 # 4

Some value-delimited files are simply corrupt: they were generated by programs that didn think to escape quote marks or to quote cells with embedded delimiters. Neither csv nor fastercsv can parse these files, because they e ambiguous or invalid.

missing_quotes=%{20051002, Alice says, "I saw that!"} CSV::Reader.parse(missing_quotes) { |r| r.each { |c| puts c } } # CSV::IllegalFormatError: CSV::IllegalFormatError unescaped_quotes=%{20051002, "Alice says, "I saw that!""} FasterCSV.parse(unescaped_quotes) { |r| r.each { |c| puts c } } # FasterCSV::MalformedCSVError: Unclosed quoted field.

Your best strategy for dealing with this kind of file is to use regular expressions to massage the data into a form that fastercsv can parse, or to parse it with String#split and deal with any quoting problems afterwards. In either case, your code will have to work with the particular quirks of the data you e trying to parse.

Parsing Not-Quite-Comma-Separated Data

Problem

Solution

Discussion

See Also

Категории