jxnczy 发表于 2007-3-17 09:07:03

采集英文站好多字符不能正确采,文章会不完整

比如原来是这样

Remember that old television series “The Twilight Zone”? It featured sci-fi fantasies that metaphorically demonstrated people’s hopes, fears and despairs. The actors would be cruising along enjoying a somewhat normal life and then suddenly they entered “a new dimension” where everything was confusing and intimidating.

用default编码采集后的就是这样的

Remember that old television series 揟he Twilight Zone? It featured sci-fi
fantasies that metaphorically demonstrated people抯 hopes, fears and despairs.
The actors would be cruising along enjoying a somewhat normal life and then
suddenly they entered 揳 new dimension?where everything was confusing and
intimidating.

用UTF-8编码采集就是这样的

Remember that old television series The Twilight Zone? It featured sci-fi
fantasies that metaphorically demonstrated peoples hopes, fears and despairs.
The actors would be cruising along enjoying a somewhat normal life and then
suddenly they entered a new dimension where everything was confusing and
intimidating.

里面的标点几种标点符号都不能正常识,常见的用  '   &   " "   这些符号要是出现在文章里就会中断文章或是出现乱码

请教高手怎么解决

netdream 发表于 2007-3-17 14:22:12

试试用替换功能呢
这是个问题,希望火车能解决.

jxnczy 发表于 2007-3-17 19:38:48

原帖由 netdream 于 2007-3-17 14:22 发表
试试用替换功能呢
这是个问题,希望火车能解决.

替换不能完全解决这个问题,比如
正常是 series “The采集后成了这个series 揟he   里面的 "T这两个字符变了了一个 揟这样的乱码,没有规则的!


希望能解决下这个问题不然采集E文很郁闷

frogjy 发表于 2007-6-20 18:33:26

最好增加自定义编码的功能。。。好像哪个采集器有这个功能来着小蜜蜂?

ocpsys 发表于 2007-8-27 15:53:25

现在3.1的还是有这个问题,采集charset=iso-8859-1的页面就是这样,编码的选择能多增加几个就好了

[ 本帖最后由 ocpsys 于 2007-8-27 15:55 编辑 ]
页: [1]
查看完整版本: 采集英文站好多字符不能正确采,文章会不完整