采集英文站好多字符不能正确采,文章会不完整
比如原来是这样Remember that old television series “The Twilight Zone”? It featured sci-fi fantasies that metaphorically demonstrated people’s hopes, fears and despairs. The actors would be cruising along enjoying a somewhat normal life and then suddenly they entered “a new dimension” where everything was confusing and intimidating.
用default编码采集后的就是这样的
Remember that old television series 揟he Twilight Zone? It featured sci-fi
fantasies that metaphorically demonstrated people抯 hopes, fears and despairs.
The actors would be cruising along enjoying a somewhat normal life and then
suddenly they entered 揳 new dimension?where everything was confusing and
intimidating.
用UTF-8编码采集就是这样的
Remember that old television series The Twilight Zone? It featured sci-fi
fantasies that metaphorically demonstrated peoples hopes, fears and despairs.
The actors would be cruising along enjoying a somewhat normal life and then
suddenly they entered a new dimension where everything was confusing and
intimidating.
里面的标点几种标点符号都不能正常识,常见的用 ' & " " 这些符号要是出现在文章里就会中断文章或是出现乱码
请教高手怎么解决 试试用替换功能呢
这是个问题,希望火车能解决. 原帖由 netdream 于 2007-3-17 14:22 发表
试试用替换功能呢
这是个问题,希望火车能解决.
替换不能完全解决这个问题,比如
正常是 series “The采集后成了这个series 揟he 里面的 "T这两个字符变了了一个 揟这样的乱码,没有规则的!
希望能解决下这个问题不然采集E文很郁闷 最好增加自定义编码的功能。。。好像哪个采集器有这个功能来着小蜜蜂? 现在3.1的还是有这个问题,采集charset=iso-8859-1的页面就是这样,编码的选择能多增加几个就好了
[ 本帖最后由 ocpsys 于 2007-8-27 15:55 编辑 ]
页:
[1]