使用 Jsoup 和 HtmUnit 解析 JavaScript 生成的頁面

Created: November-22, 2018

page.html - 原始碼

<html>
<head>
    <script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
    <div class="container">
        <table id="data" border="1">
            <tr>
                <th>col1</th>
                <th>col2</th>
            </tr>
        </table>
    </div>
</body>
</html>

loadData.js

    // append rows and cols to table.data in page.html
    function loadData() {
        data = document.getElementById("data");
        for (var row = 0; row < 2; row++) {
            var tr = document.createElement("tr");
            for (var col = 0; col < 2; col++) {
                td = document.createElement("td");
                td.appendChild(document.createTextNode(row + "." + col));
                tr.appendChild(td);
            }
            data.appendChild(tr);
        }
    }

page.html 載入到瀏覽器時

`COL1`	COL2
0.0	0.1
1.0	1.1

使用 jsoup 解析 page.html 以獲取 col 資料

    // load source from file
    Document doc = Jsoup.parse(new File("page.html"), "UTF-8");

    // iterate over row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))
            
            // print results
            System.out.println(col.ownText());

輸出

（空）

發生了什麼？

Jsoup 解析從伺服器傳遞的原始碼（或者在本例中從檔案載入）。它不會呼叫 JavaScript 或 CSS DOM 操作等客戶端操作。在此示例中，行和列永遠不會附加到資料表。

如何解析瀏覽器中呈現的頁面？

    // load page using HTML Unit and fire scripts
    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());

    // convert page to generated HTML and convert to document
    doc = Jsoup.parse(myPage.asXml());
 
    // iterate row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

    // clean up resources        
    webClient.close();

輸出

0.0
0.1
1.0
1.1