在VBA里实现的screen scripting

在VBA里写过了次爬虫,在其他语言里大概爬虫都很成熟了,姑且记一下.

为了储存HTMLDocument格式的object,先在reference里加入microsoft html object library.

Dim aWB As Control
Dim vFrame As HTMLDocument
Dim vBuffer Ass String
Dim vTimeLimit As Date
Dim aTimeOut As Integer
Dim vURL As String

在这里是用access里的web control作为object打开网站,也可以直接打开html的网站.

Set aWB = Me!WB # the web control object
aWB.Object.Silent = True #让 web control不要报错
vURL = “www.google.ca”
aWB.ControlSource = "=(""" & vURL & """)"
aWB.Refresh

screen scripting的时候需要等网页两次.
一次是网页本体是否有load出来.
第二次是网页是否有load出所有的object.

Public Sub WaitForBrowserReady(ByRef aWebObj As Object, Optional ByVal aTimeOut As Long = 10)
  On Error GoTo ERR_Handler:

  Dim vTimeLimit As Date
  vTimeLimit = DateAdd("s", aTimeOut, VBA.Now)

  Do While aWebObj.Busy Or aWebObj.ReadyState <> 4
    If DateDiff("s", vTimeLimit, VBA.Now) > 0 Then
      Err.Raise # Add Err message here
    End If
    DoEvents
  Loop

EXIT_CleanUp:
  Exit Sub
ERR_CleanUp:
  On Error GoTo 0
  Err.Raise # Add Err message here
ERR_Handler:
  # Store Err message
  Resume ERR_CleanUp:
End Sub

等待page是否有load完全部就比较tricky,需要去某个page寻找“最后打印”出来的object,然后测是否有读出来.

Public Sub WaitForPageReady(ByRef aWebObj As Object, ByVal aWebStr As String, Optional ByVal aTimeOut As Long = 10)
  Dim vFrame As HTMLDocument
  Dim vElement As Object
  Dim vBuffer As Variant
  Dim vTimeLimit As Date
  vTimeLimit = DateAdd("s", aTimeOut, VBA.Now)

  On Error Resume Next

  Select Case aWebStr
  Case "Page1" # depends on different kind of page, the "last load object" is different
    Do While vBuffer <> "theTarget"
      Set vFrame = aWeb.Object.Document
      Set vElement = vFrame.Document.GetElementsByClassName("some class name here")
      vBuffer = vElement.innerText # get the compare string
      If DateDiff("s", vTimeLimit, VBA.Now) > 0 Then
          Err.Raise # Add Err message here
        End If
        DoEvents
      Loop
  End Select

EXIT_CleanUp:
  On Error GoTo 0
  Err.Clear
  Exit Sub
ERR_CleanUp:
  On Error GoTo 0
  Err.Raise # Add Err message here
ERR_Handler:
  # Store Err message
  Resume ERR_CleanUp:
End Sub

在pass这两个function后就可以开始scripting网页上的内容了.
注意,在access 网页object的时候,除了class name,还有frame的区别.
需要在“可以读取的frame”内,access 相应的object,不然读不到.
辨别方式:在inspect,select element in the page to inspect看看是否有不同的frame.

在这里查看属于哪一个frame.

注意vFrame.Document.GetElementsByClassName -> 根据不同需求,可以用getElementById等其他function.
Mentor的代码是自己搓了个类似WaitForBrowserReady的Function.
如果一定时间内getElement function没有get到target的内容,也time out 给error然后退出.

总结是爬虫跟其他代码区别,它是一种对着动态来写码(?),脑子里一定要分好如果有什么condition,代码是否能有对应的select case能跑得到.
然后就finger crossed(不是).