ASP.NET 4 Unleashed

Using HTML pattern matching, you can make any Web site act like an XML Web service. HTML pattern matching enables you to extract content from any document by specifying regular expressions that match content in the document. Matched content is exposed as values of Web service properties.

This technology has two broad applications. First, you can use it to access data from legacy systems. For example, if you have thousands of old customer invoices stored in an ancient pre-war mainframe in the back office, you can use HTML pattern matching to liberate the data. If a document is accessible over the network ”regardless of whether the document is a plain text file, XML document, or HMTL page ”you can use HTML pattern matching to expose it.

HTML pattern matching also can be used for communicating with Web sites that haven't implemented a Web service. For example, suppose that you need to exchange product information over the Internet with a sister Web site. If the sister Web site displays a normal HTML page with a list of products, you can use HTML pattern matching to scrape the information from the page.

To implement HTML pattern matching, you need to write a WSDL document that describes the content you want to expose. After you create the WSDL document, you generate a proxy class for the Web service described by the WSDL file. You can then integrate the proxy class into an ASP.NET application in the same way , as you would for any other Web service.

Creating the WSDL Document

A Web services Description Language (WSDL) document is an XML document that describes, among other things, all the methods and properties accessible through a Web service. It provides the name and location of each method and property. It also specifies the data types of all the parameters that can be used with the methods and properties.

The hardest part of implementing HTML pattern matching is writing the WSDL document. You need to write this document to specify the text patterns that you want to match. Therefore, you need to spend a little time examining the structure of a WSDL document.

NOTE

You can find the current version of the WSDL document standard at the following address:

http://www.w3.org/TR/wsdl

A WSDL document contains the following five sections:

  • Types ” Lists all the data types used when exchanging messages.

  • Messages ” Lists the name and all the parts of each message. Input elements are listed separately from output elements. Each message part is associated with a data type.

  • PortTypes ” Lists the operations that can be performed with messages.

  • Bindings ” Associates the operations listed in PortTypes with a particular message format and protocol.

  • Services ” Associates a binding with a particular location (URL).

Each section of a WSDL document builds on the elements of the preceding section. For example, you specify all the messages in the Messages section and combine the messages into operations in the PortTypes section.

To use HTML pattern matching, you add a regular expression pattern to the Bindings section. Any text matched by the regular expression is exposed as a property of the Web service.

NOTE

Regular expressions are described in Chapter 24, "Working with Collections and Strings."

The file in Listing 23.20 is a template for a simple WSDL document that you can use for HTML pattern matching. All the parameters that you need to replace are displayed in bold.

Listing 23.20 Template.Wsdl

<?xml version="1.0"?> <definitions xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:s0=" http://yourdomain.com/webservices " targetNamespace=" http://yourdomain.com/webservices " xmlns="http://schemas.xmlsoap.org/wsdl/"> <types/> <message name=" MessageName HttpGetIn" /> <message name=" MessageName HttpGetOut" /> <portType name=" PortTypeName HttpGet"> <operation name=" OperationName "> <input message="s0: MessageName HttpGetIn"/> <output message="s0: MessageName HttpGetOut"/> </operation> </portType> <binding name=" BindingName HttpGet" type="s0: PortTypeName HttpGet"> <http:binding verb="GET"/> <operation name=" OperationName "> <http:operation location=" Relative Url "/> <input> <http:urlEncoded/> </input> <output> <tm:text> <tm:match name=' PropertyName ' pattern=' pattern ' /> </tm:text> </output> </operation> </binding> <service name=" ServiceName "> <port name=" PortTypeName HttpGet" binding="s0: BindingName HttpGet"> <http:address location=" Url " /> </port> </service> </definitions>

The C# version of this code can be found on the CD-ROM.

The file in Listing 23.20 contains a template for a Web service that has a single method with no parameters. When you call the method, any text matching the regular expression pattern described by the <match> element can be retrieved.

Keep in mind that this is a simple WSDL template. You can create HTML pattern matching services that match multiple elements in a page and also create services with multiple input and output parameters.

Specifying the Regular Expression Pattern

The real work of pattern matching is performed in the <match> element in Listing 23.20. This element has the following attributes:

  • Capture ” An integer that represents the index of a match within a grouping

  • Group ” An integer that represents the index of a particular capturing group

  • IgnoreCase ” A Boolean value that specifies whether a case-sensitive match is performed

  • Matches ” An attribute that returns a MimeTextMatchCollection , which represents all the regular expression matches

  • Name ” A string that represents the name of the property that exposes the matches

  • Pattern ” A string that represents a regular expression pattern

  • Repeats ” An integer that represents the number of matches to perform

  • Type ” A string that is used when a match contains submatches

NOTE

The <match> element is represented by the MimeTextMatch class in the .NET Framework.

Creating a Simple HTML Pattern Matching Service

Now you can start with a simple HTML pattern matching service called the TitlesService . This service will enable you to retrieve a list of book titles and prices from an HTML page.

A sample of the target page exposed by the Web service is displayed in Figure 23.5. This page is included on the CD-ROM with the name Titles.aspx . The Titles.aspx page displays different books from the Titles table, depending on the value of a query string variable named type .

Figure 23.5. Target page for HTML pattern matching.

The first step in creating the TitlesService is creating the necessary WSDL document. To do so, you can use the WSDL document in Listing 23.21.

Listing 23.21 Titles.Wsdl

<?xml version="1.0"?> <definitions xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:s0="http://yourdomain.com/webservices" targetNamespace="http://yourdomain.com/webservices" xmlns="http://schemas.xmlsoap.org/wsdl/"> <types/> <message name="GetTitlesHttpGetIn" /> <message name="GetTitlesHttpGetOut" /> <portType name="BookTitlesHttpGet"> <operation name="GetTitles"> <input message="s0:GetTitlesHttpGetIn"/> <output message="s0:GetTitlesHttpGetOut"/> </operation> </portType> <binding name="BookTitlesHttpGet" type="s0:BookTitlesHttpGet"> <http:binding verb="GET"/> <operation name="GetTitles"> <http:operation location="/Titles.aspx?type=business"/> <input> <http:urlEncoded/> </input> <output> <tm:text> <tm:match name='Titles' pattern='&lt;li&gt;(.*?)- $' ignoreCase='true' repeats='100' /> <tm:match name='Prices' pattern='$([\d\.]+)' ignoreCase='true' repeats='100' /> </tm:text> </output> </operation> </binding> <service name="TitlesService"> <port name="BookTitlesHttpGet" binding="s0:BookTitlesHttpGet"> <http:address location="http://localhost" /> </port> </service> </definitions>

The C# version of this code can be found on the CD-ROM.

The WSDL file in Listing 23.21 documents the public interface of a Web service named TitlesService that has one method named GetTitles . It contains two match elements: one for book titles and one for book prices.

NOTE

The WSDL file in Listing 23.21 describes an XML Web service located at http://localhost/ . Normally, you would specify the actual domain of the Web service.

The regular expression for book titles looks like this:

&lt;li&gt;(.*?)- $

The regular expression for book prices looks like this:

$([\d\.]+)

The Web service works with an HTML page located on the local host with the path /Titles.aspx?type=business . It always retrieves a page that displays a list of business titles.

You can build a proxy class from the WSDL file in Listing 23.21 with the following statements executed from a command prompt:

wsdl /l:vb Titles.wsdl vbc /t:library /r:System.dll,System.Web.Services.dll,System.Xml.dll TitlesService.vb

These statements generate a new proxy class named TitlesService.dll . Before you can use this proxy class, you need to copy it to your application /bin directory.

You then can use the page in Listing 23.22 to test the TitlesService Web service. The page displays the book titles and prices in a slightly different format than the format on the Titles.aspx page (compare Figure 23.5 with Figure 23.6).

Listing 23.22 TestTitlesService.aspx

<Script runat="Server"> Sub Page_Load Dim objTitlesService As TitlesService Dim objMatches As GetTitlesMatches Dim intCounter As Integer objTitlesService = New TitlesService objTitlesService.Timeout = 10000 objMatches = objTitlesService.GetTitles() For intCounter = 0 To objMatches.Titles.Length - 1 lblTitles.Text &= "<p>TITLE: " & objMatches.Titles( intCounter ) lblTitles.Text &= "<br>PRICE: " & objMatches.Prices( intCounter ) Next End Sub </Script> <html> <head><title>TestTitlesService.aspx</title></head> <body> <h2>Scraped Titles</h2> <asp:Label id="lblTitles" Runat="Server" /> </body> </html>

The C# version of this code can be found on the CD-ROM.

Figure 23.6. Results of screen scraping.

In the Page_Load subroutine in Listing 23.22, an instance of the TitlesService proxy class is created. Next, the GetTitles() method is called to retrieve an array of titles and prices. Finally, all the titles and prices are displayed within a For...Next loop.

Using Input Parameters with HTML Pattern Matching

One limitation of the HTML pattern matching service that you created earlier is that it always passes the same query string variable ( type=business ). Therefore, it always retrieves a list of business titles. However, you might want to pass different input parameters and retrieve different types of books. To do so, you need to modify the WSDL document. A modified WSDL document is contained in Listing 23.23.

Listing 23.23 TitlesInput.Wsdl

<?xml version="1.0"?> <definitions xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:s0="http://yourdomain.com/webservices" targetNamespace="http://yourdomain.com/webservices" xmlns="http://schemas.xmlsoap.org/wsdl/"> <types/> <message name="GetTitlesInputHttpGetIn"> <part name="type" type="s:string" /> </message> <message name="GetTitlesInputHttpGetOut" /> <portType name="BookTitlesHttpGet"> <operation name="GetTitlesInput"> <input message="s0:GetTitlesInputHttpGetIn"/> <output message="s0:GetTitlesInputHttpGetOut"/> </operation> </portType> <binding name="BookTitlesHttpGet" type="s0:BookTitlesHttpGet"> <http:binding verb="GET"/> <operation name="GetTitlesInput"> <http:operation location="/Titles.aspx"/> <input> <http:urlEncoded/> </input> <output> <tm:text> <tm:match name='Titles' pattern='&lt;li&gt;(.*?)- $' ignoreCase='true' repeats='100' /> <tm:match name='Prices' pattern='$([\d\.]+)' ignoreCase='true' repeats='100' /> </tm:text> </output> </operation> </binding> <service name="TitlesServiceInput"> <port name="BookTitlesHttpGet" binding="s0:BookTitlesHttpGet"> <http:address location="http://localhost/webservices" /> </port> </service> </definitions>

The C# version of this code can be found on the CD-ROM.

Notice that the first message in the file in Listing 23.23 contains a <part> element, which represents an input parameter named type . The type parameter is passed to Titles.aspx when the Web service is called.

To create the proxy class for the Web service, you need to execute the following statements:

wsdl /l:vb TitlesInput.wsdl vbc /t:library /r:System.dll,System.Web.Services.dll,System.Xml.dll TitlesServiceInput.vb

Finally, you need to copy the compiled proxy class, TitleServiceInput.dll , to the application /bin directory.

You can use the page in Listing 23.24 to test the new Web service.

Listing 23.24 TestTitlesServiceInput.aspx

<Script runat="Server"> Sub Button_Click( s As Object, e As EventArgs ) Dim objTitlesService As TitlesServiceInput Dim objMatches As GetTitlesInputMatches Dim intCounter As Integer objTitlesService = New TitlesServiceInput objTitlesService.Timeout = 10000 objMatches = objTitlesService.GetTitlesInput( txtBookType.Text ) For intCounter = 0 To objMatches.Titles.Length - 1 lblTitles.Text &= "<p>TITLE: " & objMatches.Titles( intCounter ) lblTitles.Text &= "<br>PRICE: " & objMatches.Prices( intCounter ) Next End Sub </Script> <html> <head><title>TestTitlesServiceInput.aspx</title></head> <body> <h2>Scraped Titles</h2> <form runat="Server"> <asp:TextBox id="txtBookType" Runat="Server" /> <asp:Button Text="Go!" OnClick="Button_Click" Runat="Server" /> <p> <asp:Label id="lblTitles" EnableViewState="False" Runat="Server" /> </form> </body> </html>

The C# version of this code can be found on the CD-ROM.

The page in Listing 23.24 contains a form that enables you to enter a book type. When you enter a type and click Go! , only books of that type are retrieved through the Web service and displayed (see Figure 23.7).

Figure 23.7. Using an input parameter with HTML pattern matching.

Building the Six Degrees Web Service

Before leaving the subject of building HTML pattern-matching Web services, I want to provide you with a more complicated example. In this section, you'll build the Six Degrees Web service.

The Six Degrees Web service enables you to enter any Web address into an HTML form. After you enter an address, the Web service retrieves all the links on the page you indicated. The Web service chooses one of the links and follows it to a new page. This process continues until the Web service cannot find a new link or makes 25 hops.

Figure 23.8 illustrates what might happen when you enter the address for the home page of the Microsoft Web site.

Figure 23.8. Using the Six Degrees Web Service.

The WSDL document for the Six Degrees Web service is contained in Listing 23.25.

Listing 23.25 SixDegrees.Wsdl

<?xml version="1.0"?> <definitions xmlns:s="http://www.w3.org/2001/XMLSchema" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:s0="http://yourdomain.com/webservices" targetNamespace="http://yourdomain.com/webservices" xmlns="http://schemas.xmlsoap.org/wsdl/"> <types/> <message name="GetLinksHttpGetIn" /> <message name="GetLinksHttpGetOut" /> <portType name="SixDegreesHttpGet"> <operation name="GetLinks"> <input message="s0:GetLinksHttpGetIn"/> <output message="s0:GetLinksHttpGetOut"/> </operation> </portType> <binding name="SixDegreesHttpGet" type="s0:SixDegreesHttpGet"> <http:binding verb="GET"/> <operation name="GetLinks"> <http:operation location="/"/> <input> <http:urlEncoded/> </input> <output> <tm:text> <tm:match name='Links' pattern='"(http://.*?)["/]' repeats="10" /> </tm:text> </output> </operation> </binding> <service name="SixDegrees"> <port name="SixDegreesHttpGet" binding="s0:SixDegreesHttpGet"> <http:address location="http://localhost" /> </port> </service> </definitions>

The C# version of this code can be found on the CD-ROM.

The WSDL document in Listing 23.25 uses the regular expression pattern "(http://.*?)["/] to match an Internet address. (Notice that this regular expression matches only the home page of a Web site.)

You can use the following statements to build a proxy class from the WSDL file:

wsdl.exe /l:vb SixDegrees.wsdl vbc /t:library /r:System.dll,System.Web.Services.dll,System.xml.dll SixDegrees.vb

After you execute these statements, you need to copy the compiled proxy class, SixDegrees.dll , to your application's /bin directory.

The ASP.NET page in Listing 23.26 uses the Six Degrees Web service to follow Web site links.

Listing 23.26 DisplaySixDegrees.aspx

<Script Runat="Server"> Dim objSixDegrees As New SixDegrees Dim objMatches As New GetLinksMatches Dim objRandom As New Random Dim colHistory As New ArrayList Sub Button_Click( s As Object, e As EventArgs ) objSixDegrees.UserAgent = "SixDegrees" objSixDegrees.Timeout = 5000 objSixDegrees.AllowAutoRedirect = True GetNextLink( txtUrl.Text ) dgrdHistory.DataSource = colHistory dgrdHistory.DataBind() End Sub Sub GetNextLink( strCurrentLink ) Dim strNextLink As String colHistory.Add( strCurrentLink ) objSixDegrees.Url = strCurrentLink Try objMatches = objSixDegrees.GetLinks() Catch ex As Exception colHistory.Add( ex.Message ) End Try If Not ( objMatches.Links Is Nothing ) And colHistory.Count < 25 Then If objRandom.Next( 2 ) = 1 Then Array.Reverse( objMatches.Links ) End If For each strNextLink in objMatches.Links If strNextLink.ToLower() <> strCurrentLink.ToLower() Then GetNextLink( strNextLink ) Exit For End If Next End If End Sub </Script> <html> <head><title>DisplaySixDegrees.aspx</title></head> <body> <h2>Six Degrees of Separation</h2> <form runat="Server"> <asp:TextBox id="txtUrl" Columns="40" Text="http://" Runat="Server" /> <asp:Button Text="Go!" OnClick="Button_Click" Runat="Server"/> </form> <p> <asp:DataList id="dgrdHistory" CellPadding="6" Gridlines="Both" AlternatingItemStyle-BackColor="lightblue" Runat="Server"> <ItemTemplate> <%# Container.ItemIndex %> - <asp:HyperLink Text='<%# Container.DataItem %>' NavigateUrl='<%# Container.DataItem %>' Runat="Server" /> </ItemTemplate> </asp:DataList> </body> </html>

The C# version of this code can be found on the CD-ROM.

The page in Listing 23.26 recursively calls a subroutine named GetNextLink that uses the Six Degrees Web service to get a list of all the links in a Web page. The subroutine chooses one of the links and calls itself, once again passing the link to the new Web site. The list of links retrieved by leaping from Web site to Web site is displayed in a DataList control.

Категории