caineng 发表于 2009-6-25 15:35:11

天涯论坛的内容页java分页可以采集吗?

http://www.tianya.cn/techforum/content/402/9101.shtml


天涯论坛的内容页
javascript:GotoPage('3') 这样的分页样子,如何才能采集到

rq204 发表于 2009-6-25 16:32:39

这个使用的是post方式获取内容的,不是静态页面,post请求如下,实际采集难以实现POST /techforum/content/402/9101.shtml HTTP/1.1
Accept: image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-silverlight, application/x-shockwave-flash, */*
Referer: http://www.tianya.cn/techforum/content/402/9101.shtml
Accept-Language: zh-cn
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; QQDownload 538; .NET CLR 2.0.50727; CIBA; TheWorld)
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
Host: www.tianya.cn
Content-Length: 337
Connection: Keep-Alive
Pragma: no-cache
Cookie: vjuids=-2d58113e.12187936fc5.0.bb202becd88ce8; vjlast=1243520135,1243734076,13.1245918546.11; __guid=128146221; __ptime=1245918611203; __cid=29; user=w=helloworld&id=25154679; __utma=22245310.1228705546990093800.1244889428.1245168097.1245511573.7; __utmz=22245310.1244889428.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __gads=ID=341d553de46f8352:T=1244889498:S=ALNI_MZwwkdDRqVbqr5ICwkNi6UDv6mbow; right=web4=n&portal=n; temp4=rm=4E8FC9BAACF267E39209BC894C54A146; ASPSESSIONIDCCTDCSQD=GHLAOKHDOMFFMLIHENGLGDCP; ASPSESSIONIDACTCCSQD=GKLIJKHDHGEIFMILDIIMNBMM

rs_strTitle_aa=%5B%D0%A1%CB%B5%5D%D0%A3%D4%B0%B0%AE%C7%E9%D0%A1%CB%B5%3Cfont+color%3Dred%3E%3Cb%3E%A1%B6%CC%EC%C1%B9%BA%C3%B8%F6%C7%EF%A1%B7%C1%AC%D4%D8%D6%D0&intLogo=0&rs_permission=1&apn=83533%2C84717%2C85421%2C85986%2C86866%2C87593%2C88276%2C89155%2C89803%2C90485%2C91188%2C91654%2C92317%2C92835%2C93778%2C95389%2C99338%2C102560&pID=4

998877 发表于 2009-7-17 09:35:00

这种方法很好解决,只要火车头技术人员给火车头增加一个 “按规律生成分页网址”的功能就解决了“可析现在还没有这个功能,
页: [1]
查看完整版本: 天涯论坛的内容页java分页可以采集吗?