Oct 2 2011

SwimEventTimes W7P Development -- HTML Agility Pack

Category: � Administrator @ 08:13

When this app started out, a long time ago, I happened upon the HTML Agility Pack or HAP.  This tool was originally written for .Net and used XPath to search and define the elements that you needed from the original source HTML.  So when it comes time to SCRAPE data from a page, you'll need a tool that provides a robust and almost carefree nature about the HTML structure.  It does so much for you that without it I would have never attempted what I was thinking:

http://htmlagilitypack.codeplex.com/

Kudos to the authors, especially DarthObiwan, of this fantastic piece of work.

Basically it allows you to define what tags you want from a page and search and pull out that piece of the text.  But the twist here is that for HAP to work on the phone it had to work without Xpath since Xpath was not supported in the OS 7.0 release on the phone.

So what took the place of the Xpath is LINQ.  This added yet another dimension to my learning since I had never really used LINQ and had only recently started looking into using LINQ when I thought about converting a project from XSLT.  It takes time to make the mental switch from Xpath to LINQ but there was no other way.  Also, at that time, none of the code releases for HAP worked on the phone but by getting the source and following the comments of the authors on how to re-engineer the code I was able to get it compile and now it works like a charm.

So now let's talk about how we use HAP:

Here we have entire HTML loaded up in htmDoc.

//let's detemine what came back on the response

HtmlAgilityPack.HtmlDocument htmDoc = new HtmlAgilityPack.HtmlDocument();
htmDoc.LoadHtml(responseData);

Next you can start to look for specific items:

string pattern = @".*txtSearchLastName$"; 
var SearchPagenode = htmDoc.DocumentNode
                      .Descendants("input")
                      .FirstOrDefault(x => Regex.IsMatch(x.Id, pattern));


So now I can look at the element and get the id:

CTLID = SearchPagenode.Id;

Other things like pulling out <a> tags out of table contained in the last <tr>:

pattern = @".*dgPersonSearchResults$"; 
var links = htmDoc.DocumentNode.Descendants("table")
           .First(n => Regex.IsMatch(n.Id, pattern))
          .Elements("tr").Last() .Descendants("a")
          .Select(x => x.GetAttributeValue("href", "")).ToArray();


You can go crazy with HAP and as your LINQ gets better you can go further and refine these queries.

HAP provides the foundation and performs the grunt work required to interrogate/parse a HTML source.


Tags: , , , ,

Add comment