Parsing URLs

Problem

You want to parse a string representation of a URL into a data structure that articulates the parts of the URL.

Solution

URI.parse TRansforms a string describing a URL into a URI object.[5] The parts of the URL can be determined by interrogating the URI object.

[5] The class name is URI, but I use both "URI" and "URL" because they are more or less interchangeable.

require uri URI.parse(https://www.example.com).scheme # => "https" URI.parse(http://www.example.com/).host # => "www.example.com" URI.parse(http://www.example.com:6060/).port # => 6060 URI.parse(http://example.com/a/file.html).path # => "/a/file.html"

URI.split transforms a string into an array of URL parts. This is more efficient than URI.parse, but you have to know which parts correspond to which slots in the array:

URI.split(http://example.com/a/file.html) # => ["http", nil, "example.com", nil, nil, "/a/file.html", nil, nil, nil]

Discussion

The URI module contains classes for five of the most popular URI schemas. Each one can store in a structured format the data that makes up a URI for that schema. URI.parse creates an instance of the appropriate class for a particular URLs scheme.

Every URI can be decomposed into a set of components, joined by constant strings. For example: the components for a HTTP URI are the scheme ("http"), the hostname ("www.example.com (http://www.example.com)"), and so on. Each URI schema has its own components, and each of Rubys URI classes stores the names of its components in an ordered array of symbols, called component:

URI::HTTP.component # => [:scheme, :userinfo, :host, :port, :path, :query, :fragment] URI::MailTo.component # => [:scheme, :to, :headers]

Each of the components of a URI class has a corresponding accessor method, which you can call to get one component of a URI. You can also instantiate a URI class directly (rather than going through URI.parse) by passing in the appropriate component symbols as a map of keyword arguments.

URI::HTTP.build(:host => example.com, :path => /a/file.html, :fragment => section_3).to_s # => "http://example.com/a/file.html#section_3"

The following debugging method iterates over the components handled by the scheme of a given URI object, and prints the corresponding values:

class URI::Generic def dump component.each do |m| puts "#{m}: #{send(m).inspect}" end end end

URI::HTTP and URI::HTTPS are the most commonly encountered subclasses of URI, since most URIs are the URLs to web pages. Both classes provide the same interface.

url = http://leonardr:pw@www.subdomain.example.com:6060 + /cgi-bin/mycgi.cgi?key1=val1#anchor URI.parse(url).dump # scheme: "http" # userinfo: "leonardr:pw" # host: "www.subdomain.example.com" # port: 6060 # path: "/cgi-bin/mycgi.cgi" # query: "key1=val1" # fragment: "anchor"

A URI::FTP object represents an FTP server, or a path to a file on an FTP server. The typecode component indicates whether the file in question is text, binary, or a directory; it typically won be known unless you create a URI::FTP object and specify one.

URI::parse(ftp://leonardr:password@ftp.example.com/a/file.txt).dump # scheme: "ftp" # userinfo: "leonardr:password" # host: "ftp.example.com" # port: 21 # path: "/a/file.txt" # typecode: nil

A URI::Mailto represents an email address, or even an entire message to be sent to that address. In addition to its component array, this class provides a method (to_mailtext) that formats the URI as an email message.

uri = URI::parse(mailto:leonardr@example.com?Subject=Hello&body=Hi!) uri.dump # scheme: "mailto" # to: "leonardr@example.com" # headers: [["Subject", "Hello"], ["body", "Hi!"]] puts uri.to_mailtext # To: leonardr@example.com # Subject: Hello # # Hi!

A URI::LDAP object contains a path to an LDAP server or a query against one:

URI::parse("ldap://ldap.example.com").dump # scheme: "ldap" # host: "ldap.example.com" # port: 389 # dn: nil # attributes: nil # scope: nil # filter: nil # extensions: nil URI::parse(ldap://ldap.example.com/o=Alice%20Exeter,c=US?extension).dump # scheme: "ldap" # host: "ldap.example.com" # port: 389 # dn: "o=Alice%20Exeter,c=US" # attributes: "extension" # scope: nil # filter: nil # extensions: nil

The URI::Generic class, superclass of all of the above, is a catch-all class that holds URIs with other schemes, or with no scheme at all. It holds much the same components as URI::HTTP, although theres no guarantee that any of them will be non-nil for a given URI::Generic object.

URI::Generic also exposes two other components not used by any of its built-in sub-classes. The first is opaque, which is the portion of a URL that couldn be parsed (that is, everything after the scheme):

uri = URI.parse( ag:example.com,2006,my-tag) uri.scheme # => "tag" uri.opaque # => "example.com,2006,my-tag"

The second is registry, which is only used for URI schemes whose naming authority is registry-based instead of server-based. Its likely that youll never need to use registry, since almost all URI schemes are server-based (for instance, HTTP, FTP, and LDAP all use the DNS system to designate a host).

To combine the components of a URI object into a string, simply call to_s:

uri = URI.parse(http://www.example.com/#anchor) uri.port = 8080 uri.to_s # => "http://www.example.com:8080/#anchor"

See Also

Категории