java - How do I get the correct count of characters on .NET, Java and Sql Server? (read this in Google Chrome)

Question

Given this string

HELLO水</p>

Legend: http://en.wikipedia.org/wiki/UTF-16

 is 4 bytes
水 is 2 bytes

Postgresql database (UTF-8) returns the correct length of 7:

select length('HELLO水');

I noticed both .NET and Java returns 8:

Console.WriteLine("HELLO水");

System.out.println("HELLO水");

And Sql Server returns 8 too:

SELECT LEN('HELLO水');

.NET,Java and Sql Server returns correct string length when a given unicode character is not variable-length, they all return 6:

  HELLO水

They return 7 for variable-length ones, which is incorrect:

  HELLO

.NET,Java and Sql Server uses UTF-16. It seems that their implementation of counting the length of UTF-16 string is broken. Or is this mandated by UTF-16? UTF-16 is variable-length capable as its UTF-8 cousin. But why UTF-16 (or is it the fault of .NET,Java,SQL Server and whatnot?) is not capable of counting the length of string correctly like with UTF-8?

Python returns a length of 12, I dont know how to interpret why it returns 12 though. This might be another topic entirely, I digress.

len("HELLO水")

Question is, how do I get the correct count of characters on .NET, Java and Sql Server? It will be difficult to implement the next twitter if a function returns incorrect character count.

If I may add, I was not able to post this using Firefox. I posted this question in Google Chrome. Firefox cannot display variable-length unicodes

score 4 · Accepted Answer

C# (and likely SQL and Java) are returning number of Char elements in a string.

String.Length

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

score 3 · Accepted Answer

Javaの場合：

String s = "HELLO水";
System.out.println(s.codePointCount(0, s.length())); // 7
System.out.println(s.length()); // 8

score 0 · Accepted Answer

.Net: String.Length Property

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

So we should use StringInfo class to get the correct count of Unicode characters.

String s = "HELLO水";
Console.WriteLine (s);
Console.WriteLine ("Count of char: {0:d}", s.Length);

StringInfo info = new StringInfo (s);
Console.WriteLine ("Count of Unicode characters: {0:d}", info.LengthInTextElements);

The output:

HELLO水<br> Count of char: 8
Count of Unicode characters: 7

java - How do I get the correct count of characters on .NET, Java and Sql Server? (read this in Google Chrome)

3 に答える 3

Related

Reference