JSOUP是偶然看到的一个处理HTML的JAVA 类库,其官方网址是:http://jsoup.org/
1、编写相关的试用程序(只需要在工程中引用jsoup-1.3.3.jar即可):
- import java.io.File;
- import java.io.IOException;
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- import org.jsoup.select.Elements;
- public class Test {
- public static void main(String[] args) {
- Test t = new Test();
- t.parseFile();
- }
- public void parseString() {
-
String html = "
blog Parsed HTML into a doc.
"; - Document doc = Jsoup.parse(html);
- System.out.println(doc);
- Elements es = doc.body().getAllElements();
- System.out.println(es.attr("onload"));
- System.out.println(es.select("p"));
- }
- public void parseUrl() {
- try {
- Document doc = Jsoup.connect("http://www.baidu.com/").get();
- Elements hrefs = doc.select("a[href]");
- System.out.println(hrefs);
- System.out.println("------------------");
- System.out.println(hrefs.select("[href^=http]"));
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- public void parseFile() {
- try {
- File input = new File("input.html");
- Document doc = Jsoup.parse(input, "UTF-8");
- // 提取出所有的编号
- Elements codes = doc.body().select("td[title^=IA] > a[href^=javascript:view]");
- System.out.println(codes);
- System.out.println("------------------");
- System.out.println(codes.html());
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
2、parseString的输出:
-
blog - "test()">
-
Parsed HTML into a doc.
- test()
-
Parsed HTML into a doc.
3、parseUrl的输出:
- "/gaoji/preferences.html">设置
- "http://passport.baidu.com/?login&tpl=mn">登录
- "http://news.baidu.com">新 闻
- "http://tieba.baidu.com">贴 吧
- "http://zhidao.baidu.com">知 道
- "http://mp3.baidu.com">MP3
- "http://image.baidu.com">图 片
- "http://video.baidu.com">视 频
- "http://map.baidu.com">地 图
- "#" name="ime_hw">手写
- "#" name="ime_py">拼音
- "#" name="ime_cl">关闭
- "http://hi.baidu.com">空间
- "http://baike.baidu.com">百科
- "http://www.hao123.com">hao123
- "/more/">更多>>
- "st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页
- "http://e.baidu.com/?refer=888">加入百度推广
- "http://top.baidu.com">搜索风云榜
- "http://home.baidu.com">关于百度
- "http://ir.baidu.com">About Baidu
- "/duty/">使用百度前必读
- "http://www.miibeian.gov.cn" target="_blank">京ICP证030173号
- ------------------
- "http://passport.baidu.com/?login&tpl=mn">登录
- "http://news.baidu.com">新 闻
- "http://tieba.baidu.com">贴 吧
- "http://zhidao.baidu.com">知 道
- "http://mp3.baidu.com">MP3
- "http://image.baidu.com">图 片
- "http://video.baidu.com">视 频
- "http://map.baidu.com">地 图
- "http://hi.baidu.com">空间
- "http://baike.baidu.com">百科
- "http://www.hao123.com">hao123
- "st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页
- "http://e.baidu.com/?refer=888">加入百度推广
- "http://top.baidu.com">搜索风云榜
- "http://home.baidu.com">关于百度
- "http://ir.baidu.com">About Baidu
- "http://www.miibeian.gov.cn" target="_blank">京ICP证030173号
3、parseFile的输出:
- "javascript:view('67530','67530','0');">IA100908-002
- "javascript:view('67529','67529','0');">IA100908-001
- "javascript:view('67544','67544','0');">IA100908-016
- "javascript:view('67364','67364','0');">IA100903-008
- "javascript:view('67363','67363','0');">IA100903-007
- "javascript:view('66104','66104','0');">IA100710-013
- "javascript:view('57916','57916','0');">IA100515-013
- "javascript:view('56962','56962','0');">IA100430-022
- "javascript:view('66958','66958','0');">IA100830-001
- "javascript:view('66319','66319','0');">IA100713-003
- "javascript:view('66317','66317','0');">IA100713-001
- "javascript:view('66321','66321','0');">IA100713-005
- "javascript:view('66967','66967','0');">IA100830-010
- "javascript:view('66999','66999','0');">IA100831-001
- "javascript:view('67377','67377','0');">IA100904-004
- "javascript:view('67378','67378','0');">IA100904-005
- "javascript:view('3271','3271','0');">IA080115-031
- ------------------
- IA100908-002
- IA100908-001
- IA100908-016
- IA100903-008
- IA100903-007
- IA100710-013
- IA100515-013
- IA100430-022
- IA100830-001
- IA100713-003
- IA100713-001
- IA100713-005
- IA100830-010
- IA100831-001
- IA100904-004
- IA100904-005
- IA080115-031
补充下,input.html的基本结果如图:
出处:http://kinkding.iteye.com/blog/787465