HTML Scraping with Excel and VBA

Posted in hacks and kludges on Thursday, April 23 2015

Because I was a bad person in a past life, and I need to be punished, I work in a place where the one and only tool for doing calculations is excel, and my data source is a cryptic enterprise intranet that uses activex for everything and does not have an accessible API.

My current task was to take a table of data from our intranet and do some statistics on it. Naturally the intranet cannot output a csv, nor does it have nice static urls for me to point excel's query table at. It has some generic .aspx that needs buttons pressed and activex controls to run before it can generate an html table. This is the table I want in my spreadsheet.

I can easily jump through the hoops of getting to the table using VBA from within excel to take control of internet explorer:

Sub RefreshData()
' Opens the intranet and runs the request '

Dim ie As Object
Dim ieDoc As Object

'this opens internet explorer and points it at the url
Set ie = CreateObject("internetexplorer.application")
ie.Visible = True
ie.Navigate "http://local.intranet/request.aspx"

' wait until the page loads before doing anything
Do Until (ie.readyState = 4 And Not ie.Busy)
    DoEvents ' DoEvents releases the macro and lets excel do other thing while it waits

' look at the document returned and find the buttons that need pressing
Set ieDoc = ie.Document
Dim buttons As Object
Set buttons = ieDoc.getElementsByTagName("button")

Dim i As Long
Dim notClicked As Boolean

' loops through the all the buttons until it finds the one I want, then click that
i = 0
notClicked = True
Do While i < buttons.Length And notClicked
    If buttons(i).Title = "request" Then
        notClicked = False
    End If
    i = i + 1

' once more wait while the page reloads.
Do Until (ie.readyState = 4 And Not ie.Busy)

' find the table I want to save
Dim tables As Object
Dim tbl As Object
Set tables = ieDoc.getElementsByTagName("TABLE")
Set tbl = tables(2)

' Do something with the table

End Sub

At this point I could step through the table row by row and populate a sheet

Dim row as Object
Dim col as Object
Dim some_row as Range
Dim some_cell as Range
Set some_row = Sheet("Sheet1").Range("A1")     'where I want to put the data
Set some_cell = some_row
For Each row in tbl.Rows
    For Each col in row.Cells
        some_cell.Value = col.innerText        ' set the cell value
        Set some_cell = some_cell.Offset(0,1)   ' move the cell right 1
    Next col
    Set some_row = some_row.Offset(1,0)         ' move the cell down 1
    Set some_cell = some_row
Next row

That is what I tried initially, and it sort of worked. I had two issues with it:

  1. Excel auto formatted the contents of the cells and buggered up all the dates (oh, didn't you know? the local intranet uses its own custom date string format!) I will have to sort out which cells are dates and then parse them during the population phase
  2. This literally took 5 minutes to run. During that time excel sat there glumly filling in rows at a glacial pace. There are 1500 rows in that table I'm copying in and I guess going row by row does not scale, like at all.

BUT If you use a query table (what you get when you go through the menus to set up a website as a data source, for example) it only takes a fraction of a second to parse, and I can just set an option to not parse dates. Unfortunately there is no static url I can point the query table at, recall that the table is buried behind some activex obfuscation and the intranet site updates without the url changing, or by using GET/POST or anything convenient like that.

My terrible ashamed solution was to take all the code to find the table and then, instead of walking through it, cache it to an html file and open that as a query table. Like I said, stupid solution. Maybe I can fix it later so it checks on how new the cache is and by-passes scraping the site if it is new enough thereby making a kludge into a feature.

' set the path to the cache
Dim pth As String
pth = ThisWorkbook.Path & "/_cache.htm"

' set up the file system object that will save the html
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim oFile As Object
Set oFile = fso.CreateTextFile(pth)
oFile.WriteLine tbl.outerHTML        ' saves the table, tags and all

' clean up all those objects
Set fso = Nothing
Set oFile = Nothing


Set ie = Nothing
Set ieDoc = Nothing
Set tables = Nothing

' Parse the cached html table back into the spreadsheet

Dim qt As QueryTable
Dim ws As Worksheet
Set ws = Sheets("Sheet1")
Set qt = ws.QueryTables.Add(Connection:="URL;file:\\" & pth, Destination:=ws.Range("A1"))

qt.RefreshOnFileOpen = False
qt.FieldNames = True                 ' my data table has headers
qt.WebSelectionType = xlAllTables    ' there's only one table in the cache
qt.WebDisableDateRecognition = True  ' no date recognition please! I can do that later
qt.MaintainConnection = False
qt.RefreshStyle = xlOverwriteCells   ' if you don't set this it inserts cells to the left
qt.Refresh BackgroundQuery:=False
qt.Delete                            ' no reason to leave the connection around

Set qt = Nothing
Set ws = Nothing

With this set up what used to take literal minutes takes a fraction of a second.

I can't stress enough how important it is to set the qt.RefreshStyle, otherwise by default every time you ran this it would insert the new data to the left of the old. This way excel overwrites, which is what I and all right thinking individuals want.

I chained this together with some filtering and other routines and tied the whole mess to a button, so now the spreadsheet gives you a nice dashboard and you can click the refresh button to query the intranet for data. It works quite nicely even if it is sketchy.

What I would like to do, but I can't find any documentation on how to do, is take the HTML table tbl and turn it directly into an Array. This seems like an obvious thing I should be able to do, and yet my google-fu fails me here.