phpBB broken ?

rjlittlefield · Post by **rjlittlefield** » Tue Oct 28, 2014 10:57 pm

Chris S. has observed that there's still a problem.

It turns out that despite their identical appearance, there's a striking difference in behavior between these two Unicode symbols:
C2 B5 MICRO SIGN
CE BC GREEK SMALL LETTER MU

If I type alt-m using the international keyboard, or I copy/paste from a word document into which I have done an Insert > Symbol > (Latin-1 Supplement) µ , or I copy from http://en.wikipedia.org/wiki/Micrometre, then I get the C2 B5 and everything works fine.

But if I copy/paste from a Google response for "mu symbol", then I get the CE BC and things go bad. Preview looks OK, but at the very least the mu symbol gets stored and retrieved as "?", and depending on exactly how the message is formatted, it may also produce a nasty error message "SQL Error : 1271 Illegal mix of collations for Operation 'IN' ".

Taking another look at the problem, I see from database backups that the database tables are now and always have been encoded using the MYSQL default of Latin1.

The significance of this fact is that the "micro sign" character can be represented in Latin1, while the "Greek small letter mu" cannot.

I assume from the trouble report that "Greek small letter mu" used to work, but I'm not quite sure how that happened.

I suspect it's tied to the version of PHP that was being used.

The PHP currently provided by our ISP is 5.4.33. The sticking point for the initially reported problem with blank previews was that the forum software contains a gazillion calls to a PHP library function named "htmlspecialchars", using a form of the call in which the character set encoding is not specified. Unfortunately, the authors of PHP have apparently never heard of backward compatibility, or maybe they just reject the concept, because according to the documentation for that function, "PHP 5.4 and 5.5 will use UTF-8 as the default. Earlier versions of PHP use ISO-8859-1." This is an important issue, because if characters are provided that are not valid in the (assumed) encoding, then htmlspecialchars returns an empty string. That's where the blank previews were coming from -- micro signs were getting encoded as single bytes of value 181, hex B5, which was resulting in character strings that made no sense in UTF-8.

(Yes, it seems that htmlspecialchars is being called only on the Preview path, not on the Submit path, hence the initially observed ability to store and retrieve micro signs even though they would cause a blank preview. I agree completely that raised and rolling eyes are a natural reaction at this point, but that would be pointless railing at the elements. We have better things to do...)

Anyway, I'm thinking that in the past all input, output, storage, and processing were done in ISO-8859-1 (synonymous with Latin1), and somehow or other (maybe in the browser?) Greek small letter mu's were getting converted to micro signs because that was the closest equivalent in ISO-8859-1.

In contrast, at the moment we're doing storage in ISO-8859-1 (because that's the way the database is defined), but processing in UTF-8 (because that's what htmlspecialchars requires to avoid blank previews), and also input/output in UTF-8 (because the forum software contains no calls to encode/decode anything).

This all works fine, except for the annoying problem that now any Greek small letter mu's (and almost all other Unicode characters) cause "?" to be stored and may prompt the "Illegal mix of collations" error message.

I don't see any way to fix this remaining problem except by making a lot of mods to the forum software. But that would be insane, given all the other factors including that phpBB2 is thoroughly obsolete anyway.

So, having now thoroughly analyzed and documented the problem to the best of my abilities, I'm walking away from this one.

Use the micro sign, and all will be well. Go in peace...

--Rik

ChrisR · Post by **ChrisR** » Wed Oct 29, 2014 3:06 am

Alt 0181 gives µ
3BC version , Spaceless & # 9 5 6 ; gives ? if I use preview first, otherwise μ
"mu symbol" copied from google result gives ? .
This is new!! - middle method always worked before today! I hadn't tried the last, before.

--
"nasty error message "SQL Error : 1271 Illegal mix of collations for Operation 'IN' ". " I have seen erratically, elsewhere on the forum, but only in the last month.

johan · Post by **johan** » Wed Oct 29, 2014 7:34 am

rjlittlefield wrote: I suspect it's tied to the version of PHP that was being used.

The PHP currently provided by our ISP is 5.4.33. The sticking point for the initially reported problem with blank previews was that the forum software contains a gazillion calls to a PHP library function named "htmlspecialchars", using a form of the call in which the character set encoding is not specified. Unfortunately, the authors of PHP have apparently never heard of backward compatibility, or maybe they just reject the concept

... the really interesting things will happen when our hosting companies update BOTH php and MySQL. Then we're all in trouble.

TheLostVertex · Post by **TheLostVertex** » Thu Oct 30, 2014 2:59 pm

rjlittlefield wrote: Anyway, I'm thinking that in the past all input, output, storage, and processing were done in ISO-8859-1 (synonymous with Latin1), and somehow or other (maybe in the browser?) Greek small letter mu's were getting converted to micro signs because that was the closest equivalent in ISO-8859-1.

In contrast, at the moment we're doing storage in ISO-8859-1 (because that's the way the database is defined), but processing in UTF-8 (because that's what htmlspecialchars requires to avoid blank previews), and also input/output in UTF-8 (because the forum software contains no calls to encode/decode anything).

Ah, I see. So my initial assessment of the situation was correct, but reversed (I was assuming the database was likely unicode and was getting 8859 characters).

It is possible to convert the database to UTF-8 from latin1. In theory that should solve the problem. I am pretty sure all latin1 characters are in utf-8.

Then again I imagine you are pretty tired of messing with this issue.