Blog has been moved. Actual post url: http://aakinshin.net/en/blog/dotnet/mono-utf8-conversions/.
This post is a logical continuation of the Jon Skeet's blog post “When is a string not a string?”. Jon showed very interesting things about behavior of ill-formed Unicode strings in .NET. I wondered about how similar examples will work on Mono. And I have got very interesting results.
Experiment 1: Compilation
Let's take the Jon's code with a small modification. We will just add text
null check in DumpString
:
using System; using System.ComponentModel; using System.Text; using System.Linq; [Description(Value)] class Test { const string Value = "X\ud800Y"; static void Main() { var description = (DescriptionAttribute)typeof(Test). GetCustomAttributes(typeof(DescriptionAttribute), true)[0]; DumpString("Attribute", description.Description); DumpString("Constant", Value); } static void DumpString(string name, string text) { Console.Write("{0}: ", name); if (text != null) { var utf16 = text.Select(c => ((uint) c).ToString("x4")); Console.WriteLine(string.Join(" ", utf16)); } else Console.WriteLine("null"); } }
Let's compile the code with MS.NET (csc) and Mono (mcs). The resulting IL files will have one important distinction:
// MS.NET compiler .custom instance void class [System]System.ComponentModel.DescriptionAttribute::'.ctor'(string) = (01 00 05 58 ED A0 80 59 00 00 ) // ...X...Y.. // Mono compiler .custom instance void class [System]System.ComponentModel.DescriptionAttribute::'.ctor'(string) = (01 00 05 58 59 BF BD 00 00 00 ) // ...XY.....
The interesting fact 1: MS.NET and Mono transform original C# strings to UTF-8 IL strings in different ways. But both ways give non-valid UTF-8 strings (58 ED A0 80 59
and 58 59 BF BD 00
).
Experiment 2: Run
Ok, let's run it:
// MS.NET compiler / MS.NET runtime Attribute: 0058 fffd fffd 0059 Constant: 0058 d800 0059 // MS.NET compiler / Mono runtime Attribute: null Constant: 0058 d800 0059 // Mono compiler / MS.NET runtime Attribute: 0058 0059 fffd fffd 0000 Constant: 0058 d800 0059 // Mono compiler / Mono runtime Attribute: null Constant: 0058 d800 0059
The interesting fact 2: Mono runtime can't use our non-valid UTF-8 IL strings. Instead, Mono use null
.
Experiment 3: Manual UTF-8 to String conversion
Ok, but what if we create non-valid UTF-8 string in runtime? Let's check it! The code:
using System; using System.Text; using System.Linq; class Test { static void Main() { DumpString("(1)", Encoding.UTF8.GetString( new byte[] { 0x58, 0xED, 0xA0, 0x80, 0x59 })); DumpString("(2)", Encoding.UTF8.GetString( new byte[] { 0x58, 0x59, 0xBF, 0xBD, 0x00 })); } static void DumpString(string name, string text) { Console.Write("{0}: ", name); if (text != null) { var utf16 = text.Select(c => ((uint)c).ToString("x4")); Console.WriteLine(string.Join(" ", utf16)); } else Console.WriteLine("null"); } }
And the result:
// MS.NET runtime (1): 0058 fffd fffd 0059 (2): 0058 0059 fffd fffd 0000 // Mono runtime (1): 0058 fffd fffd fffd 0059 (2): 0058 0059 fffd fffd 0000
The interesting fact 3: MS.NET and Mono implement UTF-8 to String conversion in different ways. The ED A0 80
sequence transforms to FFDD FFDD
on MS.NET and to FFDD FFDD FFDD
on Mono.
Experiment 4: Manual String to UTF-8 conversion
Let's look to the reverse conversion (from String to UTF-8). The code:
var bytes = Encoding.UTF8.GetBytes("X\ud800Y"); Console.WriteLine(string.Join(" ", bytes.Select(b => b.ToString("x2"))));
And the result:
// MS.NET runtime 58 ef bf bd 59 // Mono runtime 58 59 bf bd 00
The interesting fact 4: MS.NET and Mono implement String to UTF-8 conversion in different ways too.
Experiment 5: Prohibition of ill-formed string
Also, Jon's has written about prohibition of ill-formed strings in some attributes. For example, the code
[DllImport(Value)] static extern void Foo();will not compile on csc or Roslyn. But it will be successfully compile on Mono!
Another example: the code
[Conditional(Value)] void Bar() {}will not compile on csc and Mono:
// MS.NET compiler error CS0647: Error emitting 'DllImportAttribute' attribute // Mono compiler error CS0633: The argument to the 'ConditionalAttribute' attribute must be a valid identifier
Conclusion
Encodings are hard.
Комментариев нет:
Отправить комментарий