Because I was a bad person in a past life, and I need to be punished, I work in a place where the one and only tool for doing calculations is excel, and my data source is a cryptic enterprise intranet that uses activex for everything and does not have an accessible API.
My current task was to take a table of data from our intranet and do some statistics on it. Naturally the intranet cannot output a csv, nor does it have nice static urls for me to point excel's query table at. It has some generic .aspx that needs buttons pressed and activex controls to run before it can generate an html table. This is the table I want in my spreadsheet.
I can easily jump through the hoops of getting to the table using VBA from within excel to take control of internet explorer:
Sub RefreshData() ' Opens the intranet and runs the request ' Dim ie As Object Dim ieDoc As Object 'this opens internet explorer and points it at the url Set ie = CreateObject("internetexplorer.application") ie.Visible = True ie.Navigate "http://local.intranet/request.aspx" ' wait until the page loads before doing anything Do Until (ie.readyState = 4 And Not ie.Busy) DoEvents ' DoEvents releases the macro and lets excel do other thing while it waits Loop ' look at the document returned and find the buttons that need pressing Set ieDoc = ie.Document Dim buttons As Object Set buttons = ieDoc.getElementsByTagName("button") Dim i As Long Dim notClicked As Boolean ' loops through the all the buttons until it finds the one I want, then click that i = 0 notClicked = True Do While i < buttons.Length And notClicked If buttons(i).Title = "request" Then buttons(i).Click notClicked = False End If i = i + 1 Loop ' once more wait while the page reloads. Do Until (ie.readyState = 4 And Not ie.Busy) DoEvents Loop ' find the table I want to save Dim tables As Object Dim tbl As Object Set tables = ieDoc.getElementsByTagName("TABLE") Set tbl = tables(2) ' Do something with the table End Sub
At this point I could step through the table row by row and populate a sheet
Dim row as Object Dim col as Object Dim some_row as Range Dim some_cell as Range Set some_row = Sheet("Sheet1").Range("A1") 'where I want to put the data Set some_cell = some_row For Each row in tbl.Rows For Each col in row.Cells some_cell.Value = col.innerText ' set the cell value Set some_cell = some_cell.Offset(0,1) ' move the cell right 1 Next col Set some_row = some_row.Offset(1,0) ' move the cell down 1 Set some_cell = some_row Next row
That is what I tried initially, and it sort of worked. I had two issues with it:
- Excel auto formatted the contents of the cells and buggered up all the dates (oh, didn't you know? the local intranet uses its own custom date string format!) I will have to sort out which cells are dates and then parse them during the population phase
- This literally took 5 minutes to run. During that time excel sat there glumly filling in rows at a glacial pace. There are 1500 rows in that table I'm copying in and I guess going row by row does not scale, like at all.
BUT If you use a query table (what you get when you go through the menus to set up a website as a data source, for example) it only takes a fraction of a second to parse, and I can just set an option to not parse dates. Unfortunately there is no static url I can point the query table at, recall that the table is buried behind some activex obfuscation and the intranet site updates without the url changing, or by using GET/POST or anything convenient like that.
My terrible ashamed solution was to take all the code to find the table and then, instead of walking through it, cache it to an html file and open that as a query table. Like I said, stupid solution. Maybe I can fix it later so it checks on how new the cache is and by-passes scraping the site if it is new enough thereby making a kludge into a feature.
' set the path to the cache Dim pth As String pth = ThisWorkbook.Path & "/_cache.htm" ' set up the file system object that will save the html Dim fso As Object Set fso = CreateObject("Scripting.FileSystemObject") Dim oFile As Object Set oFile = fso.CreateTextFile(pth) oFile.WriteLine tbl.outerHTML ' saves the table, tags and all oFile.Close ' clean up all those objects Set fso = Nothing Set oFile = Nothing ie.Quit Set ie = Nothing Set ieDoc = Nothing Set tables = Nothing ' Parse the cached html table back into the spreadsheet Dim qt As QueryTable Dim ws As Worksheet Set ws = Sheets("Sheet1") Set qt = ws.QueryTables.Add(Connection:="URL;file:\\" & pth, Destination:=ws.Range("A1")) qt.RefreshOnFileOpen = False qt.FieldNames = True ' my data table has headers qt.WebSelectionType = xlAllTables ' there's only one table in the cache qt.WebDisableDateRecognition = True ' no date recognition please! I can do that later qt.MaintainConnection = False qt.RefreshStyle = xlOverwriteCells ' if you don't set this it inserts cells to the left qt.Refresh BackgroundQuery:=False qt.Delete ' no reason to leave the connection around Set qt = Nothing Set ws = Nothing
With this set up what used to take literal minutes takes a fraction of a second.
I can't stress enough how important it is to set the
qt.RefreshStyle, otherwise by default every time you ran this it would insert the new data to the left of the old. This way excel overwrites, which is what I and all right thinking individuals want.
I chained this together with some filtering and other routines and tied the whole mess to a button, so now the spreadsheet gives you a nice dashboard and you can click the refresh button to query the intranet for data. It works quite nicely even if it is sketchy.
What I would like to do, but I can't find any documentation on how to do, is take the HTML table
tbl and turn it directly into an
Array. This seems like an obvious thing I should be able to do, and yet my google-fu fails me here.