PS Tip: Parsing HTML from a local File or a String

If you are familiar with Invoke-WebRequest cmdlet then you are aware that you can get a parsed HTML from the requested Web URL. DOM structure of this parsed HTML could be utilised to get access to HTML elements of the web page (see below).


$webRequest = Invoke-WebRequest "google.com"

$webRequest.ParsedHTML.getElementsByTagName("span") | % textContent

WebRequest1

Problem

What if we have the HTML files locally saved in the computer or in a string? Do we have any mechanism to parse it from a local file/string?

Solution

Answer is Yes.

Microsoft provides the HTML document class in .Net framework class library, which has a Write() method to write HTML Document using DOM 2 (Document Object Model Level 2)

WebRequest2

Solution 1 : From a string

$html = New-Object -ComObject "HTMLFile"

$html.IHTMLDocument2_write($content)

$html.all.tags("A") | % innerText

Solution 2 : From a file

Similarly we can parse HTML document from a local HTML file.


$html = New-Object -ComObject "HTMLFile"

$html.IHTMLDocument2_write($(Get-Content .\file.html -raw))

$html.all.tags("A") | % innerText

 

Note

Even the parsed HTML from Invoke-Webrequest has the type HTML Document Class


$WR = Invoke-WebRequest "http://google.com"

$WR.ParsedHtml.GetType()

Output is: HTMLDocumentClass

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s