|
楼主 |
发表于 2008-11-9 13:24:17
|
显示全部楼层
用XPath进行Screen Scraping
****
Screen Scraping就是从HTML代码中抽取需要的网页信息,因为HTML代码是为人类用户从屏幕上浏览的而不是直接给计算机程序使用的,所以不可避免地需要人工去解析它的提取模式。目前为止自己知道的有三种方式:
1。当然是“正则表达式”,但它把本来有着丰富层次化结构信息的网页当成一个流式的字符串对待,提取模式之复杂可以想象,另一个坏处是它对网页格式变化的极度敏感性;
2。第二种方式是通过像NekoHTML这样的把HTML转换成标准XML格式的工具,利用DOM树解析的办法提取,这种方式充分利用了网页固有的内在结构,但编码效率依然低下而且很容易出现运行期的空指针异常;
3。第三种也是目前自己发现的最好的方法是利用XPath语言从HTML对应的XMLDOM树结构中提取,它无需编写遍历树形的代码,仅需一个极为简洁的字符串即可。另外,还可以借助于FireFox中的DomInspector和XPather插件,很容易发现这样极为健壮的XPath字符串。比如想要提取自己博客中文章的标题,如下一条XPath语句再借助于某种XPath类包即可自动抽取出来: //*[@class='tit']/A/text()
下面给出一段从网页中提取自己想要的链接的简单程序,它需要nekohtml.jar, xercesImpl.jar, xalan.jar等软件包:
BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Please input the URL: ");
String url = stdin.readLine();
System.out.println("Please input the charset(default is GB2312): ");
String charset = stdin.readLine();
if (charset.trim().length() == 0) {
charset = "GB2312";
System.out.println("Assuming default charset: GB2312");
}
HttpClient userAgent = new HttpClient();
DOMParser domParser = new DOMParser();
Document dom = null;
HttpMethod method = new GetMethod(url);
System.out.println("Start downloading...");
userAgent.executeMethod(method);
System.out.println("Done!");
if (method.getStatusCode() == HttpStatus.SC_OK) {
InputSource source = new InputSource(method
.getResponseBodyAsStream());
source.setEncoding(charset);
// NekoHTML没有实现XML的namespace功能
// 见http://people.apache.org/~andyc/neko/doc/html/faq.html#hierarchy
domParser.setFeature("http://xml.org/sax/features/namespaces", false);
domParser.parse(source);
dom = domParser.getDocument();
if (dom != null)
dom.normalize();
Node rootNode = dom.getDocumentElement();
System.out.println("The DOM tree has been produced successfully.");
String xpathForAnchor = null;
String yesOrNo = "y";
List<String> history = new ArrayList<String>();
NodeList anchorNodes = null;
do {
if (history.size() > 0) {
System.out.println("============");
System.out.println("history:");
for (String item : history) {
System.out.println("\t" + item);
}
System.out.println("============");
}
System.out.println("Please input your XPath for Anchor:");
xpathForAnchor = stdin.readLine();
history.add(xpathForAnchor);
try {
anchorNodes = XPathAPI.selectNodeList(rootNode, xpathForAnchor);
} catch (TransformerException e) {
System.out.println("Your xpath is not valid, sorry");
continue;
}
for (int i = 0; i < anchorNodes.getLength(); i++) {
Node anchorNode = anchorNodes.item(i);
StringBuilder sb = new StringBuilder();
sb.append(String.format("<item>%n"));
sb.append(String.format("\t<text>%s</text>%n", anchorNode.getTextContent()));
sb.append(String.format("</item>%n"));
System.out.println(sb.toString());
}
System.out.println("Are you satisfied with the result(y/n) ? ");
yesOrNo = stdin.readLine();
} while (!yesOrNo.startsWith("y")); |
|