0%

在VBA里实现的screen scripting

在VBA里写过了次爬虫,在其他语言里大概爬虫都很成熟了,姑且记一下.



为了储存HTMLDocument格式的object,先在reference里加入microsoft html object library.

1
2
3
4
5
6
Dim aWB As Control
Dim vFrame As HTMLDocument
Dim vBuffer Ass String
Dim vTimeLimit As Date
Dim aTimeOut As Integer
Dim vURL As String

在这里是用access里的web control作为object打开网站,也可以直接打开html的网站.

1
2
3
4
5
Set aWB = Me!WB # the web control object
aWB.Object.Silent = True #让 web control不要报错
vURL = “www.google.ca”
aWB.ControlSource = "=(""" & vURL & """)"
aWB.Refresh

screen scripting的时候需要等网页两次.
一次是网页本体是否有load出来.
第二次是网页是否有load出所有的object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Public Sub WaitForBrowserReady(ByRef aWebObj As Object, Optional ByVal aTimeOut As Long = 10)
On Error GoTo ERR_Handler:

Dim vTimeLimit As Date
vTimeLimit = DateAdd("s", aTimeOut, VBA.Now)

Do While aWebObj.Busy Or aWebObj.ReadyState <> 4
If DateDiff("s", vTimeLimit, VBA.Now) > 0 Then
Err.Raise # Add Err message here
End If
DoEvents
Loop

EXIT_CleanUp:
Exit Sub
ERR_CleanUp:
On Error GoTo 0
Err.Raise # Add Err message here
ERR_Handler:
# Store Err message
Resume ERR_CleanUp:
End Sub

等待page是否有load完全部就比较tricky,需要去某个page寻找“最后打印”出来的object,然后测是否有读出来.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Public Sub WaitForPageReady(ByRef aWebObj As Object, ByVal aWebStr As String, Optional ByVal aTimeOut As Long = 10)
Dim vFrame As HTMLDocument
Dim vElement As Object
Dim vBuffer As Variant
Dim vTimeLimit As Date
vTimeLimit = DateAdd("s", aTimeOut, VBA.Now)

On Error Resume Next

Select Case aWebStr
Case "Page1" # depends on different kind of page, the "last load object" is different
Do While vBuffer <> "theTarget"
Set vFrame = aWeb.Object.Document
Set vElement = vFrame.Document.GetElementsByClassName("some class name here")
vBuffer = vElement.innerText # get the compare string
If DateDiff("s", vTimeLimit, VBA.Now) > 0 Then
Err.Raise # Add Err message here
End If
DoEvents
Loop
End Select

EXIT_CleanUp:
On Error GoTo 0
Err.Clear
Exit Sub
ERR_CleanUp:
On Error GoTo 0
Err.Raise # Add Err message here
ERR_Handler:
# Store Err message
Resume ERR_CleanUp:
End Sub

在pass这两个function后就可以开始scripting网页上的内容了.
注意,在access 网页object的时候,除了class name,还有frame的区别.
需要在“可以读取的frame”内,access 相应的object,不然读不到.
辨别方式:在inspect,select element in the page to inspect看看是否有不同的frame.

在这里查看属于哪一个frame.

注意vFrame.Document.GetElementsByClassName -> 根据不同需求,可以用getElementById等其他function.
Mentor的代码是自己搓了个类似WaitForBrowserReady的Function.
如果一定时间内getElement function没有get到target的内容,也time out 给error然后退出.



总结是爬虫跟其他代码区别,它是一种对着动态来写码(?),脑子里一定要分好如果有什么condition,代码是否能有对应的select case能跑得到.
然后就finger crossed(不是).