Parsing HTML to SQL using SQLDOM

Parsing HTML to SQL using SQLDOM

All Articles

Parsing HTML to SQL using SQLDOM

By David Rueter, 2014/12/26 (first published: 2012/04/16)

 

 

 

 

 

Microsoft SQL incorporates a good XML parser. Unfortunately, the XML parser is not very useful for parsing out HTML. Why would anyone want to parse HTML in SQL? There are three main reasons that I can think of:

  1. Import data contained in HTML directly to SQL tables
  2. Manipulate, analyze and/or normalize HTML data
  3. Render data stored in SQL directly to HTML at the server

I know that we usually think of HTML as being entirely separate from SQL. I’m not suggesting that parsing HTML in SQL is the right answer for every problem. Clearly the main point of HTML is to present information to client browsers. XML (or even JSON…but that is a different story) is a better way to get external data in and out of SQL.

But sometimes we do not have the luxury of defining the data formats we interact with. Sometimes we need to “screen scrape” data from web pages. Sometimes we need to render HTML and really don’t need or want to introduce a server-side scripting platform to accomplish what really amounts to some simple data transformation. Sometimes we have HTML or HTML fragments stored in our SQL tables that need some manipulation.

Parsing HTML into a DOM (document object model) made up of SQL tables can make sense for certain kinds of operations. This article will show how an open-source SQLDOM solution I created can be used to efficiently interact with HTML from SQL.

HTML is of course made up of data organized by tags. A very simple example would be:

<html>
<head>
<title>Sample HTML</title>
</head>
<body>
<div>Hello World</div>
</body>
</html>

Now this particular example could be parsed as XML; it happens to be a well-formed XML document. But HTML is often not well-formed XML. Some tags are “singleton” tags–that do not require end tags. Also, some tags, such as script tags, are allowed to contain characters that are not legal in XML. For example, maybe the word “world” is to be on a separate line. It would be legal HTML to write:

...
<div>
Hello <br> World
</div>
...

The <br> (for line break) does not have a closing tag. The XML parser will fail on this.

Suppose I want to represent this HTML document in a SQL table. Maybe I am only interested in getting the “Hello World” text, or maybe I want to capture and store the entire HTML document. Short of storing the whole thing as a varchar(MAX), SQL does not give us any good way of getting at the data. And of course, if we do store the entire document as a varchar(MAX), we still haven’t solved the problem of extracting just the elements we are interested in, nor of being able to modify certain elements.

I recently encountered precisely this need, and ended up writing an HTML parser and related DOM-manipulation procedures in T-SQL. I was pleased with the results, and though they might be useful to others. So I went ahead and released my SQLDOM as open source. You can find the project at http://sf.net/p/sqldom

“Project” is a bit of an overstatement. Really, this is a handful of temporary tables and temporary stored procedures. No modification to any user databases are required, so SQLDOM is minimally invasive. This makes it useful for both ad-hoc and production-related uses.

Here is how it works. From SSMS I simply execute the entire SQLDOM_core.sql script. This creates the needed temporary objects in tempdb, and prints out some instructions.

Then, I can do something like this:

EXEC #spactDOMLoad @HTML='<html>
<head>
<title>Sample HTML</title>
</head>
<body>
<div>
Hello <br> World
</div>
</body>
</html>'

That’s it: the HTML document is now parsed into SQL tables.

To retrieve the DOM information, I can do this:

EXEC #spgetDOM

 

This will return a resultset containing the nodes that were parsed:

DEID DocID Tag ID Name Class TextData

OpenTagStartPos

CloseTagEndPos

ParentDEID

HUID SortHUID DOMLevel
1 1 html NULL NULL NULL NULL 1 119 NULL 1 1 1
2 1 head NULL NULL NULL NULL 10 53 1 1.1 1.000001 2
3 1 title NULL NULL NULL NULL 18 43 2 1.1.1 000001.000001.000001 3
4 1 NULL NULL NULL NULL Sample HTML NULL NULL 3 1.1.1.1 000001.000001.000001.000001 4
5 1 body NULL NULL NULL NULL 57 110 1 1.2 1.000002 2
6 1 div NULL NULL NULL NULL 67 100 5 1.2.1 000001.000002.000001 3
7 1 NULL NULL NULL NULL Hello NULL NULL 6 1.2.1.1 000001.000002.000001.000001 4
8 1 b NULL NULL NULL NULL 82 53 6 1.2.1.2 000001.000002.000001.000002 4
9 1 NULL NULL NULL NULL World NULL NULL 8 1.2.1.2.1 000001.000002.000001.000002.000001 5

Let’s quickly explore these results:

DEID is the DOM Element ID. It is just a unique identity value for each row.

DocID will always be 1 when working with a single HTML document at a time. (SQLDOM supports multiple documents should the need arise.)

Tag is the HTML tag that was parsed out. A NULL tag indicates that the row is for a “text node”, or the text contained inside another HTML tag

ID, Name, and Class are the values of these special HTML elements, if they are set.

TextData is the text contained in a text node. (In the case of an HTML comment, the tag will be shown as !– and the TextData node of that row will contain the comment data)

OpenTagStartPos is the character position in the source HTML data of the < of the opening tag.

CloseTagEndPos is the character position in the source HTML data of the > of the closing tag.

ParentDEID is a self-referencing foreign key that points to the DEID of the row that contains the “parent” element.

HUID is the Hierarchical Unique ID. It is a simple outline-like dotted numbering system for the nodes.

SortHUID is the same HUID values, but each segment is padded with leading zeros so that is suitable for sorting

DOMLevel is the level of the element within the DOM. The root level element(s) will be 1, their children will be 2, and so on.

The actual DOM element data is stored in a temporary table #tblDOM. You can access it directly should you need to—but you may find it more convenient to utilize the SQLDOM stored procedures.

Note that any HTML attributes contained within tags other than the special attributes of ID, Name and Class are parsed too. These are stored in a temporary table named #tblDOMAttribs, and can be easily joined to the DOM data.

Additionally there is a temporary table #tblDOMStyles that contains the name-value pairs of CSS properties within the special style attribute. In other words, the complete style attribute data can be seen in #tblDOMAttribs, and this is then parsed further down to #tblDOMStyle

If you wanted to try out a “real” web page instead of just a simple “hello world” HTML document, you can do this:

DECLARE @HTML varchar(MAX)EXEC #sputilGetHTTP @URL ='http://www.google.com', @ResponseText = @HTML OUTPUT, @SuppressResultset =1EXEC #spactDOMLoad @HTML = @HTML

EXEC #spgetDOM

This will do the following:

  1. Retrieve the contents of the Google home page, and store it in the local @HTML variable
  2. Parse the contents of the @HTML variable to the DOM
  3. Output a resultset of the resulting nodes

 

Now what? Well, if you want to select only certain data from the DOM, you could create your own query. You can also use #spgetDOM, and pass in a value for the @Selector parameter. This accepts a JQuery-like CSS selector, such as ‘#myclass’, ‘.myid’, or ‘mytag’. The resultset that is returned will be filtered accordingly.

Note that currently only a single selector term can be passed in at a time. Future versions of SQLDOM will include more sophisticated @Selector support.

There is a lot more that SQLDOM can do in terms of DOM manipulation–but I’ll save that for a future part 2 of this article.

Before we end though, let’s quickly look at how we can take that DOM information and render it back out to HTML. To do so, once I have parsed HTML to the DOM via the #spactDOMLoad procedure I can simply do this:

EXEC #spgetDOMHTML @PrettyWhitespace =1, @PrintHTML =1

This will return:

<html>
<head>
<title>
Sample HTML
</title>
<body>
<div>
Hello
<br />
World
</div>
</body>
</head>
</html>

In other words, we can complete a round-trip:

  1. Parse HTML to SQL data
  2. Manipulate and/or store SQL data
  3. Render new HTML from SQL data

 

This technique is powerful: it allows you to take HTML, parse it to SQL data, and then perform updates on the SQL data… but then render updated HTML.

But even if your requirements don’t include outputing HTML, SQLDOM can come in handy for simply parsing HTML so that the contents can be stored as SQL data.

Feel free to download and use the SQLDOM source from http://sf.net/p/sqldom.  I’ll try to answer any questions you may have–either on the SQL Server Central discussion, or on the SourceForge forum.

Next time in Part 2 I will continue this article with information on how SQLDOM can be used in more advanced ways and an exploration of some different use cases.

Comments are closed.