雷达智富

首页 > 内容 > 程序笔记 > 正文

程序笔记

.NET Core HttpClient报错The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

2024-10-12 24

使用.NET Core做一个爬虫工具,使用HttpClinet下载网页时得到了HttpResponseMessage并且状态是200,但是调用ReadAsStringAsync方法时报错:System.InvalidOperationException: The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

我的代码是这样的:

using (HttpClient client = new HttpClient()) {
    var res = await client.GetAsync(Url);
    var str = await res.Content.ReadAsStringAsync();
}

报错信息:

System.InvalidOperationException: The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.
 ---> System.ArgumentException: 'gb2312' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')
   at System.Text.EncodingTable.InternalGetCodePageFromName(String name)
   at System.Text.EncodingTable.GetCodePageFromName(String name)
   at System.Text.Encoding.GetEncoding(String name)
   at System.Net.Http.HttpContent.ReadBufferAsString(ArraySegment`1 buffer, HttpContentHeaders headers)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpContent.ReadBufferAsString(ArraySegment`1 buffer, HttpContentHeaders headers)
   at System.Net.Http.HttpContent.ReadBufferedContentAsString()
   at System.Net.Http.HttpContent.<>c.<ReadAsStringAsync>b__36_0(HttpContent s)
   at System.Net.Http.HttpContent.WaitAndReturnAsync[TState,TResult](Task waitTask, TState state, Func`2 returnFunc)
   at BlazorSpider.Pages.Spider.handleClickAsync() in D:\GitHub\BlazorSpider\BlazorSpider\Pages\Spider.razor:line 49
   at Microsoft.AspNetCore.Components.ComponentBase.CallStateHasChangedOnAsyncCompletion(Task task)
   at Microsoft.AspNetCore.Components.RenderTree.Renderer.GetErrorHandledTask(Task taskToHandle, ComponentState owningComponentState)

网上有说要安装System.Text.Encoding.CodePages包,然后注册provider。

EncodingProvider provider = CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(provider);

这样确实不报错了。但是得到的网页内容是乱码。

后来注意到网页header里有content-encoding: gzip,应该是gzip压缩了,所以应该需要在HttpClient中处理响应gzip压缩。

创建一个HttpClientHandler对象处理gzip压缩,实现代码如下:

EncodingProvider provider = CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(provider);
var handler = new HttpClientHandler() { AutomaticDecompression = System.Net.DecompressionMethods.GZip };
using (HttpClient client = new HttpClient(handler)) {
    var res = await client.GetAsync(Url);
    var str = await res.Content.ReadAsStringAsync();
}

这样就不报错了,获取了正确的网页内容。

那么一定要装这个System.Text.Encoding.CodePages包吗?我尝试不实用这个包,然后使用其他方法获取网页内容:

using (HttpClient client = new HttpClient(handler)) {
    var res = await client.GetAsync(Url);
    var arr = await res.Content.ReadAsByteArrayAsync();
    var str = Encoding.UTF8.GetString(arr);
}

这样获取网页不需要安装System.Text.Encoding.CodePages包,也没有报错,但是获得的网页内容中的中文是乱码

检查了header里content-type: text/html; charset=gb2312,那么代码改为

var str = Encoding.GetEncoding("gb2312").GetString(arr);

结果不仅没有解决乱码问题,还出现了新的报错:

System.ArgumentException: 'gb2312' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')
   at System.Text.EncodingTable.InternalGetCodePageFromName(String name)
   at System.Text.EncodingTable.GetCodePageFromName(String name)
   at System.Text.Encoding.GetEncoding(String name)
   at BlazorSpider.Pages.Spider.handleClickAsync() in D:\GitHub\BlazorSpider\BlazorSpider\Pages\Spider.razor:line 55
   at Microsoft.AspNetCore.Components.ComponentBase.CallStateHasChangedOnAsyncCompletion(Task task)
   at Microsoft.AspNetCore.Components.RenderTree.Renderer.GetErrorHandledTask(Task taskToHandle, ComponentState owningComponentState)

最后发现只要安装了System.Text.Encoding.CodePages包之后,上面的代码Encoding.GetEncoding("gb2312").GetString(arr)就不会报错了,而且也没有乱码了。

所以实际上还是需要安装System.Text.Encoding.CodePages包。安装之后GetEncoding(gb2312)就不会报错,并且乱码问题也解决了。

System.Text.Encoding.CodePages 是 .NET Framework 和 .NET Core 中的一个命名空间,它包含了一些实现了各种字符编码的编码器和解码器类。这些编码器和解码器可以用于将文本数据从一个字符编码转换为另一个字符编码,以及在不同的字符编码之间进行转换。

更新于:6天前
赞一波!

文章评论

评论问答