Linux And World-Domination

“What Google has done for Linux, over the past few years, no other company has managed to pull off. By releasing two major platforms, both getting the most out of a Linux kernel, Google has put Linux in more hands than Canonical, Red Hat, SuSE, and any other company to have attempted to bring to life the Linux platform.” Yep. Jack Wallen is right. Canonical did a lot for GNU/Linux on desktop and server but that’s just a drop in the bucket compared to the hundreds of millions Google has introduced to the joys of Free Software, stuff you can run anywhere anyway, examine, modify and distribute. Google did that by shipping hardware running the software and selling/shipping units. OEMs pay attention to that. Retailers pay attention to that. Consumers love it when businesses do what they do best and ignore the lock-step stupidity of Wintel. Now, anyone who has a better idea does not have to agree to the EULA or any other silliness from M$. Neither do they have to agree to use an Intel-compatible hair-drier for computation.

Further, Google combined all the best features of the cloud and thin client technologies to bring the user what they want, small cheap computers that can do all the things the user wants to do. Cheap hardware combined with cheap software makes cheap personal computers, exactly what users want. If you spend 6 hours a day on FaceBook, all you need is a browser and a platform to run it. Done! If you take thousands of photographs of your world and want to share them with the rest of the world, all you need is a device with built-in camera, operating system and networking hardware. Done! If you want to know where on Earth you are and how to get where you want to be, all you need is a device that is GPS-aware and goes with you. Done! To Heck with Wintel, a burden to all mankind. Wintel is for slaves, not free people. Thanks, StatCounter.

Google has leveraged its search engine, on-line ad business and FLOSS to bring */Linux to the masses and it took only five years to replace that other OS as the dominant OS of personal computers. They proved that FLOSS is good business. They proved that ARM works for everyone. Thanks, Google! You finished off the monopoly.

See Google propels Linux to the top.

About Robert Pogson

I am a retired teacher in Canada. I taught in the subject areas where I have worked for almost forty years: maths, physics, chemistry and computers. I love hunting, fishing, picking berries and mushrooms, too.
This entry was posted in technology and tagged , , , , , , , , . Bookmark the permalink.

156 Responses to Linux And World-Domination

  1. DrLoser wrote, “let’s spell it as “EBCDIC,” shall we?”

    It’s so sad that I am old enough to know what that means without looking it up, Extended Binary Coded Decimal Interchange Code. I first encountered it on the IBM 26 keypunches in the olden days. Here’s a dictionary-entry:
    “[abbreviation, Extended Binary Coded Decimal Interchange Code] An alleged character set used on IBM {dinosaur}s. It exists in at least six mutually incompatible versions, all featuring such delights as non-contiguous letter sequences and the absence of several ASCII punctuation characters fairly important for modern computer languages (exactly which characters are absent varies according to which version of EBCDIC you’re looking at). IBM adapted EBCDIC from {punched card} code in the early 1960s and promulgated it as a customer-control tactic (see {connector conspiracy}), spurning the already established ASCII standard. Today, IBM claims to be an open-systems company, but IBM’s own description of the EBCDIC variants and how to convert between them is still internally classified top-secret, burn-before-reading. Hackers blanch at the very name of EBCDIC and consider it a manifestation of purest {evil}. See also {fear and loathing}.”

    I carried a magnetic tape around with me for ages with a lot of my programming from the olden days. It was EBCDIC. I donated it to a school’s computer-museum eventually. I don’t even remember which one, after I discovered FLOSS. There’s a lot less reason to use magnetic tapes and EBCDIC in the modern age. I’m surprised people anywhere still use it and I may never touch another tape drive.

    EEWWW! I think I need a shower.

  2. DrLoser says:

    Also: Many thanks, Robert, for putting up with this. I hope it was informative and useful. Signing off now:

    Mazel Tov!

  3. DrLoser says:

    There is truly a 3 billion char mess caused by lack of EDCDIC support in unicode and the resulting odd mappings.

    Just for once, let’s spell it as “EBCDIC,” shall we?

    Now we’ve corrected that unimportant detail, can we go on to correct your other unimportant detail, oiaohm?

    No, sorry, not three billion. Not a mess, in fact. But should such a mess ensue with code points (always possible), it would, as of 2012 standards, be a 1,114K mess. Not a three billion mess.

    I can live with that sort of mess, and though I might regret a “code page” for 160 characters … on the whole, I think there are more important things to worry about.

    Have you fed Fred the parakeet this morning? I think you should.

  4. DrLoser says:

    That Exploit Guy when you are around IBM mainframes you will strike odd chars encoded. Because there is no other choice.

    Fair enough. A single example, please. Give us an “odd char,” not even a code point (I suspect we can figure the code point out).

    Go on, oiaohm.

    Go on.

    Show, show us your pasa doble.

  5. DrLoser says:

    I don’t claim to speak for TEG. I am old and wrinkly and I know horrible pointless stuff about ancient IBM code points. By comparison, TEG is a sprightly youngster who would probably get the digits in “3270” the wrong way round.

    But since you asked, Hamster, I can tell you exactly the ISO/Unicode code point that will meet your quoted need:

    (decimal) 20
    (hex) 0x14
    (old-fashioned) Ctrl-T
    (EBCDIC) DC4 device control 4
    (verbal EBCDIC) RES/ENP restore/enable presentation
    Or, to put it another way, U+0014.

    There now, that wasn’t so difficult, was it?
    Now, concerning the unique features of a “Posix UTF-8 standard,” enthusiastically supported in some yet to be discovered way by the Open Source Group and, naturally, in line with what the Chinese authorities were looking for when they started the definition of GB18030 … have you made any progress on these vital and important non-issues that nobody but you thinks is relevant?

    I mean, I appreciate the effort you’ve put in. A little thought and/or education wouldn’t hurt, though.

    160 extra characters for a code page in EBCDIC, pah!.

  6. DrLoser says:

    Since I’ve just made a mess of that, I won’t detain you all any longer. The “Posix/GNU/Linux” tool to validate UTF-8 is, of course, libiconv. Which everybody except oiaohm knows.

    I’ve built it on a Cygwin base (sorry: all I had) using gcc 4.8.2. The relevant GNU packages are:

    libiconv-1.14
    libgcrypt-1.6.1
    libgpg-error-1.11>/b>
    libassuan-2.11
    libksba-1.3.0
    pth-2.0.7
    gnupg-2.0.22

    If it doesn’t build for you, or if you need a link, just ask.

    Anyway, I now have the canonical iconv -f UTF-8 insert-input-file-here. Let’s try it with my test cases, shall we? We’ll begin with the double nibbles:
    0x00 and 0x7F are fine. Good.
    0x80 and 0xC0 fail: cannot convert. Good.
    0xF0, 0xF1 and 0xF5 fail: incomplete character or shift sequence. Good. I’d quibble about how this is different to 0xC0, but what the heck: correct and to standard.

    0x0000 passes ($? is 0). This is correct. It’s equivalent to 0x00 0x00.

    Let’s have a squiz at a couple of faulty null extensions, shall we?

    0xC000 and 0xC080: Computer says, “cannot convert”. Again correct, but very interesting. Where did this magical “Modified UTF-8” go when it came to Posix, I wonder?

    iconv won’t allow the BOMs (0xFFFE and 0xFEFF) either, but that’s perfectly fair and conformant with the standards. I’d expect an incoming stream to be passed through a sanitizer that strips this UTF-16 stuff off, assuming it was there in the first place.

    Now let’s try the bottom end of five byte encoding and the top end of six end encoding:

    F880808080.txt:1:0: cannot convert
    FDDFDFDFDFDF.txt:1:0: cannot convert

    Which leaves us with two interesting cases. And, remember, oiaohm, you didn’t find these. I did.

    I believe 0xF7BFBFBF is the UTF-8 encoding for U+1FFFFF. If you remember: you corrected me and it isn’t a legit code point. Apparently iconv has a slight glitch here.

    Let’s look at the 0x10FFFF boundary:

    0xF3BFBFBF passes. (Good)

    I may be wrong here, but 0xF4B0B0B0 and 0xF4BFBFBF are out of range when encoding Unicode into UTF-8 (ie they are beyond the maximum defined code point of 0x10FFFF).

    Anyhow, whether or not I’ve caught a discrepancy:

    No, oiaohm: “Posix” UTF-8, as represented by the universally used library (libiconv) demonstrates that RFC3269 is indeed the standard, and that you are hopelessly wrong.

    Don’t mess with us professionals. TEG and me are paid to look this stuff up. It’s nice to know that you have fun trying, but, honestly:

    Get an education first.

    * iconv: FDDFDFDFDFDF.txt:1:0: cannot convert
    * iconv: F880808080.txt:1:0: cannot convert

    Good so far. These correspond to the end of the six-character range and the start of the five character range, respectively.

  7. oiaohm says:

    That Exploit Guy the mapping table you just used to/from ascii destroyed the control chars EBCDIC. one to one relationship does not apply with EDCDIC to ascii. Its destructive conversion.

    Like please encode char 20 of EDCDIC RES ENP. Come on That Exploit Guy you are so smart tell me the encoding value in Unicode or ascii for it. The answer there is not one. This where you are screwed EDCDIC to unicode is dropping chars.

    That Exploit Guy you are googling you don’t know the EDCDIC problem. A lot of newbies dealing with this problem find the mapping tables and go great that solves it then after conversion to unicode and back the mainframe application plays up and they cannot work out why. The lost chars are important.

    Look up tables don’t cover the all important control chars. Yes you can converted the printed chars from EDCDIC to ASCI and back but that is not enough to make applications work correctly.

    You cannot us the PUA because that is already inside the encoding space IBM has used. What IBM encodings do is EDCDIC + Unicode in each of the encoding methods. Yes + every char of Unicode. Also mainframe applications will be using custom PUA chars in unicode. So use PUA to convert EDCDIC risk overlap. Or again how to have mainframe application do something stupid because in the process of encoding or decoding you have destroyed one of its chars.

    That Exploit Guy basically to deal with what IBM has done at a min the EDCDIC control chars need a formal page in unicode or you have to encode the extra information outside unicode. There are three places is either just above the end of unicode so between 10FFFF and 1FFFFF. Or well clear in UTF-8 5 and 6. 5 and 6 can save you some conversion.

    locale under Posix systems change what is acceptable. Like you can change locale to ascii only and it will call all chars not ascii in valid. The linux kernel will save and store utf-8 5 and 6 without issue. Libraries in userspace will be depending on locale if its acceptable.

    That Exploit Guy when you are around IBM mainframes you will strike odd chars encoded. Because there is no other choice.

  8. DrLoser says:

    Damn, I thought I’d deleted the rest of that Creative Commons stolen gibberish. Again:

    There are eight million stories in the naked city. Oiaohm has invented 0x10FFFF of them.

    Apparently there are other sites that are CSS-deficient; sigh.

  9. DrLoser says:

    “There are eight million stories in the naked city. Oiaohm has invented 0x10FFFF of them.”

    [show]
    v ·
    t ·
    e

    Classic-era films noir in the National Film Registry

    Categories: 1948 films
    English-language films
    American films
    Black-and-white films
    Culture of New York City
    Fictional portrayals of the New York City Police Department
    Film noir
    Films directed by Jules Dassin
    Films set in New York City
    Films shot in New York City
    Films whose cinematographer won the Best Cinematography Academy Award
    Films whose editor won the Best Film Editing Academy Award
    American mystery films
    Police detective films
    Procedural films
    United States National Film Registry films
    Universal Pictures films

    Navigation menu

    Create account
    Log in

    Article

    Talk

    Read

    Edit

    View history

    Main page
    Contents
    Featured content
    Current events
    Random article
    Donate to Wikipedia
    Wikimedia Shop

    Interaction

    Help
    About Wikipedia
    Community portal
    Recent changes
    Contact page

    Tools

    Print/export

    Languages

    العربية
    Deutsch
    Español
    Français
    Italiano
    Nederlands
    Polski
    Română
    Русский
    Srpskohrvatski / српскохрватски
    Suomi
    Svenska
    Türkçe
    Edit links

    This page was last modified on 9 April 2014 at 23:40.

    Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.
    Privacy policy
    About Wikipedia
    Disclaimers
    Contact Wikipedia
    Developers
    Mobile view
    Wikimedia Foundation
    Powered by MediaWiki

  10. That Exploit Guy says:

    That Exploit Guy you just stated the problem current unicode standard don’t tell you what you should do in a case you must deal with EBCDIC.
    Putting aside that neither the Unicode Consortium nor ISO has any obligation to bend over backwards to just support the esoteric stuff pretty much only IBM mainframes use (and in case anyone feels the urge to argue with me over this – I am merely using free software advocate logic here), your so-called “problem” is a no-brainer.
    What you are looking for is called a “lookup table”, genius (http://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php). Don’t tell me you don’t know how to do this kind of elementary programming stuff. (Well, whom am I kidding? Of course you don’t!) All you have to do is to set up an array 256 elements in length. Make the index your (extended) ASCII. Then, assign to each index number the corresponding EBCDIC value, i.e. 0x00 for index 0, 0x01 for index 1, 0x02 for index 2, 0x03 for index 3, 0x37 for index 4, and so on. When you convert a file from ASCII to EBCDIC, simply use each byte you read from the file as the index and grab the corresponding EBCDIC value from the array. Conversely, if you want to convert a file from EBCDIC to ASCII, simply instead make EBCDIC your array index and assign to each index number the corresponding ASCII value.
    And that’s, in a nutshell, how conversion is done between ASCII and EBCDIC or between any two random sets of scalar values with a one-to-one relationship.
    Don’t call something a “problem” just because your pea brain doesn’t know how to solve it.
    Current unicode and iso standard still states UTF-8 as 6 long not 4 as you claimed.
    Again, Table 3-7 and Table 3 of the respective standards have already proved your above statement completely false. The least you could do would be to read them before making up any more this kind of ridiculous nonsense.
    If you are seeing 5 and 6 UTF-8 you have custom chars using some non unicode encoding standard.
    Again, Unicode has this particular provision for custom characters known as “PUA”. We discussed this back when you were gibbering about two billion Chinese characters (GB18030-2005 has 76556; see http://baike.baidu.com/view/889058.htm).
    Mind you UTF-8-Mod not the only evil unable to convert non destructively created by IBM using only Unicode standards
    Again, read before you cite. UTR #16 tells you as-a-matter-of-factly that each step of the conversion from Unicode code point to UTF-EBCDIC sequence is reversible and one-to-one. There is nothing “unable to convert non destructively” (sic) involved.
    The error on your part is presuming that a UTF-8 6 long is completely invalid.
    Again, UTF-EBCDIC has fundamentally nothing to do with UTF-8 aside from their superficially similar construct. Your argument is akin to claiming that a person has the authority over all branches of the US military as long as he looks like Barack Obama.
    What a load of rubbish.
    EDCDIC sources can contain the complete Unicode chars + extras.
    No. Just no.
    Had you paid attention to even just one paragraph of the document, you would have learned that there is nothing UTF-EBCDIC covers that is not within U+0000 to U+10FFFF inclusive. You would have also noticed that “neither UTF-EBCDIC nor its intermediate form called UTF-8-Mod… are intended to be used in open interchange environments”.
    I know you are desperately attempting to create an illusion of intellectual superiority, but this just downright lazy.
    Next idiot, please.

  11. DrLoser says:

    Yer a Gentlemann and a Scholar, Robert.

    To save you dragging the original out of the bin:

    Here we go, oiaohm: I hold in my hands (figuratively) the standard “Posix” tool to verify UTF-8 encodings. You, of course, already know what this tool is; I don’t want to deter you from proving your various points by using a “Posix” tool.

    Here are some representative possible encodings to help you start testing. Feel free to add more.

    0x00; 0x7f; 0x80; 0xC0; 0xC1; 0xF0; 0xF1; 0xF5; 0x0000; 0xC000; 0xC080; 0xFEFF; 0xFFFE; 0xF08282AC; 0xF5BFBFBF; 0xF7BFBFBF; 0xF880808080; 0xFDDFDFDFDFDF.

    The results may surprise you. (Some of them certainly surprised me.)
    ——-
    And I swear blind that I have GNU/Linux evidence that “Posix UTF-8” is basically RFC3629 UTF-8 … with maybe an oddity or two, but that’s what examining the code is all about.

    I built the entire thing from scratch, using nothing but GNU configure/make/make install packages, btw, which essentially places me as a Downstream packager in this discussion.

  12. DrLoser says:

    Robert, a request: would it be OK if you dug my previous “Posix challenge” post out of the troll-dump?

    I fully appreciate the reasoning behind your message filter, but that particular post was part one of two. (It includes a set of about 15 encodings which may or may not be legal in “Posix UTF-8”).

    In the second post, I am hoping to prove that there is essentially no such thing as “Posix UTF-8” and that the Open Group, as everybody bar oiaohm would correctly suspect, follows the same official standard as everybody else. Which is to say RFC3629.

    To whet your appetite, I intend to prove this armed with nothing but GNU/Linux tools. I’m even hoping to describe (very briefly, three or four lines) how I built them from GNU packages.

    Thanks!

  13. DrLoser says:

    That argument would be far more compelling, oiaohm, if even a single one of the 160 basic EBCDIC characters did not correspond to a code point already assigned in Unicode/ISO 10646-1:12. I believe all of them are mapped to code points that can be encoded to less than three characters in UTF-8.

    It would be compelling, but for the fact that the code points in GB18030 are closely aligned to the code points in ISO 10646-1:12. And to quote a wiki source,

    That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of Unicode 4.1, so of course, they are in Unicode 5.0 and later versions.

    You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters. These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB.

    You argument might even be compelling, but for the fact that CCSID 1388 is an IBM “code page” (I define it loosely as such for current purposes) which is used almost entirely for internal purposes on DB2. Were I charged to interoperate with another platform, I would map CCSID 1388 to Unicode (or ISO 10646-1:12) and encode/decode it into UTF-8. With the normal limit of 0x10FFFF for code points. If necessary I would do what GB18030 used to do, and allow some spillage from the basic plane to the PUA.

    Come to think of it, that’s three totally uncompelling arguments you’ve presented. Well done: you’re speeding up! A small plea: could you address them one at a time, please? I don’t care and TEG doesn’t care, but other people find it tiring to read too much rubbish in one place,.

  14. oiaohm says:

    That Exploit Guy you just stated the problem current unicode standard don’t tell you what you should do in a case you must deal with EBCDIC. Current unicode and iso standard still states UTF-8 as 6 long not 4 as you claimed. That is no error on their part. If you are seeing 5 and 6 UTF-8 you have custom chars using some non unicode encoding standard.

    CCSID 1388 (EBCDIC with GB18030 extensions)

    Mind you UTF-8-Mod not the only evil unable to convert non destructively created by IBM using only Unicode standards,

    The error on your part is presuming that a UTF-8 6 long is completely invalid.

    Unicode standard Annexes you cannot deal with EBCDIC data without being destructive or using mapping of unused chars. EDCDIC sources can contain the complete Unicode chars + extras. The extras are the problem you cannot use the PUA because the EDCDIC source might be using them.

    EDCDIC is one of the types sets that basically stuffs you sticking to unicode standard.

    That Exploit Guy the true fact of the matter there are extra chars you cannot map into standard unicode because where they come from is larger than the current day unicode.
    UTF-EDCDIC to UTF-8 non destructively is what leads to the horible solutions using 5 and 6 UTF-8. Note I said there are 2 independent point of views how this should be done.

    Reality the main unicode body dropped UTF-EDCDIC since the did not want to deal with the nightmare of it. This does not make the problem disappear.

    You could call 5 and 6 utf-8 disputed zones for what should be in there.

    I was remembering GB18030 as trouble. The trouble is in fact CCSID 1388. There are many charsets EDCDIC that include full Unicode + extras. This a very nice way to stuff you. Worse IBM is still making new ones because IBM mainframe OS’s are not dieing.

    There are a huge number of allocated chars outside the Unicode range dealing with the nightmare of EDCDIC. Everything would be made so much simpler if the 160 EDCDIC chars were allocated a codepage. I know it would mean some duplication with the ascii chars. There is truly a 3 billion char mess caused by lack of EDCDIC support in unicode and the resulting odd mappings. Only give from the ISO standards and unicode standards is that UTF-8 is not restricted to 4 chars.

  15. That Exploit Guy says:

    That Exploit Guy Please note I have also cited the 2000 copy of ISO 10646-1 as well that is 1996 UTF-8.
    Again, you are confusing “code point” with “encoding”. One is a scalar value representing a character and the other is a communication norm. Have you realised, even from your favourite source Wikipedia, “UTF” stands for “UCS Transformation Format” or “Unicode Transformation Format”?
    Also, claiming authority from older standards is just dumb: what you are in effect saying is that your information might be either part of the current standard or considered obsolete. It’s just not a very good way to defend your already shaky view on the subject.
    EBCDIC and ASCII are two distinctively different encoding schemes and are fundamentally incompatible with each other. Likewise, UTF-EBCDIC is an expansion of EBCDIC using a construct similar to UTF-8 but otherwise they have fundamentally nothing to do with each other.
    UTF-EBCDIC is not even part of Unicode Standard 6.3.0 or ISO 10646:2012. (A number of former UTRs are considered part of the current Unicode Standard, though UTR #16 is not one of them. See Chapter 3 under “Unicode Standard Annexes”.)
    As I said, such pathetic covering of lies with even more lies will only prolong your own humiliation. Isn’t it about time you stop trying to flim-flam your way out of this and start learning some good ol’ fashioned honesty?

  16. oiaohm says:

    UTF-8-Mod officially sanctions use of 7 byte and defines what it will be for UTF-8 if usage ever comes required. There is a dispute over the 8 byte I forgot about. One has FF as 8 byte and one has FF a 7byte with 1 bit in the first byte.

    Yes longer than 6 bytes in UTF-8 has been considered. I will admit a mistake I miss remembered the dispute over how encoding will be started after 8 byte when the correct is the dispute over how with UTF-8 starts at 7 byte UTF-8.

    http://www.unicode.org/reports/tr16/
    Reality the UTF-8 sub forms get really nasty.

    Ritchie and Pike designed the core of UTF-8. Others have come along and extended the thing. That is the problem. UTF-8-Mod nails down what the 7 byte form of UTF-8 will look like other than if FF equals a bit on the 7 level.

  17. oiaohm says:

    That Exploit Guy Please note I have also cited the 2000 copy of ISO 10646-1 as well that is 1996 UTF-8.

    Utf-8 5 and 6 do have chars to map just not to unicode. Ok not unicode as you know it.

    http://en.wikipedia.org/wiki/UTF-EBCDIC also time to start eating your words about no extra chars. General Unicode in fact does not cover everything.

    So there is a 2 extra code points UTF-8 5 byte and 6 byte for EBCDIC encoded. Legacy support without having conflict hell on newer systems. The two different code points exist because 2 different companies had 2 different ideas. Yes 2 zones that can hold the slightly bigger unicode numbers due to the extra chars of EDCDIC.

    EDCDIC is not ascii compatible. Note the different there are 160 starting chars not 128. All the unicode numbers are offset by that difference if something is using EDCDIC. Guess how process heavy it gets attempting to convert backwards and forwards for anything insane using EDCDIC.

    RFC 3629 does not include EDCDIC or any of the other legacy encodings that are not ascii. Yet RFC 2279 can include these legacy due to the free space RFC3629 and the Unicode change to stop at 21bits created. 5 and 6 byte UTF-8 exists for a reason other than just getting past 31 bits as the old unicode encoding requirements demanded. Microsoft and IBM decided to use an encoding that was shorter.
    http://www.unicode.org/reports/tr16/#Comparison

    UTF-8-Mod or UTF-EDCBC when you look at byte stream can look absolutely identical to UTF-8 except that the unicode conversion with the wrong version coming out as a complete garbled mess.
    http://www.unicode.org/reports/tr16/#Comparison
    Yep 2-6 byte identical lead in chars different contents.

    So there is a true headache the posix supporting platforms had to solve. EDCBC had to be placed somewhere. Somewhere that was not going to run into by Unicode. When unicode said we are stopping at 21 bits for UTF-16 compatible instead of the prior 31. It created a nice hole.

    Of course we hope that one day no more EDCBC applications exist. That Exploit Guy yes this is a full breach of Unicode rules that each char should only have one Unicode number. Each unicode char has 3 numbers on UTF-8.

    If you are on a posix system and you pretend EDCBC does not exist you can be rudely.

    Not everything is encoded by unicode standards. Yes UTF-EBCDIC to UTF-16 keeping all details is also impossible. UTF-EDCDIC with the new limit on Unicode chars can be shoved into upper end UTF-8 without issue.. The EDCDIC special chars don’t exist in general unicode. This is not the only time Unicode fails with OS special chars.

    Yes there might be free space in Unicode for printed chars but you you want to add extra OS control chars you cannot.

    Yes I agree that the chars in 5 and 6 byte UTF-8 should not appear on the internet it would be placing insane overhead on web browsers. Find it on a OS filesystem that has to talk to legacy systems on the other hand is not something to be surprised by.

  18. DrLoser says:

    Guess where you find the 8 byte define UTF-8.

    I don’t do guesses, oiaohm. I do proof, confessions of guilt, and repentance.

    Oh, and also links. In all this interminable thread, you have not once presented a verifiable link to anybody who uses a standards-sanctified form of UTF-8 that either

    a) permits more than six bytes to encode a code point or
    b) considers the concept of “fixed length UTF-8 encoding” anything other than a risible waste of everybody’s time. Not to say a contradiction of the basic design of UTF-8 (by Ritchie and Pike, may I remind you).

    Before going any further, it would behove you to provide a link to either of these claims.

  19. DrLoser says:

    Interesting you should mention the 1993 standards, oiaohm, because I’ve just searched through this thread and they’re mentioned in precisely three posts.

    Naturally, TEG’s latest post mentions them, because he can’t very well expose you as a fraud who has never mentioned the distinction without, well, mentioning the distinction.

    Naturally, you mentioned them,, once. But not of your own accord. Guess who the third (earliest) person is?

    Me, that’s right. All you’re doing is to follow up on my coat-tails. And you’re so totally inadequate at this that you didn’t even notice my earlier typo (which would have made four mentions of the 1993 standards), when I accidentally called them “2003.”

    I was wrong. I was also wrong on the far limit of legitimate UTF-8 encoding, where I somehow missed the ISO requirement that it is 0x10FFFF.

    See? When I’m wrong, I admit that I’m wrong. This is a learning exercise for me.

    You should try this new-fangled “learning” business some time, oiaohm. It would possibly enrich your life beyond measure.

    Now, if you’ll excuse me, I’m currently in the process of demolishing your ludicrous “Posix” argument. You can either google for defensive links, or you can just wait for somebody else (TEG, me, anybody) to explain the bleeding obvious to you.

    Here’s a hint to keep everybody else on the thread interested, btw: my Posix comments will focus exclusively on GNU/Linux.

  20. That Exploit Guy says:

    The alternative forms in fact still conform to specification.
    Putting aside that is, in fact, a proposed modification of ISO 10646-1:1993 from 1996 and not part of ISO 10646:2012 (that you claim to cite from), you are still having trouble telling the difference between code points and encoding.
    This is not to mention the supplementary planes as defined in the current Unicode standard are, as we speak, mostly vacant and leave us with no reason to need to expand the current set of assigned code points beyond U+10FFFF.
    This is also not to mention you have yet to produce a single piece of evidence to support your utterly ridiculous claim that UTF-8 code units of 5 or 6 octets are in use despite there are no characters to map to.
    This is also not to mention your also utterly ridiculous claim of, effectively, several ways to encode one code point in UTF-8 despite the obvious non-conformity to at least three established standards and despite well-known security issues (that you claim OSX, Linux and BSD feature).
    This is not to mention you are still confusing “require” with “requite”.
    At this point, you are simply grasping at straws and alienating yourself from even those desperately wanting to be on your side of the argument. Keep pulling farcical claims out of your own behind, if you so insist, but by do so you will only prolong the misery and humiliation that you have put yourself into.

  21. oiaohm says:

    RFC3629
    12. Changes from RFC 2279
    o Restricted the range of characters to 0000-10FFFF (the UTF-16
    accessible range).

    As I said don’t trust particular people on standards. One party here incorrect claims RFC3629 allows 0x1FFFFF. RFC3629 clearly states last will be 10FFFF.

    http://www.ietf.org/rfc/rfc2279.txt Yes RFC2279 is another 6 byte UTF-8. It also does not declare starting range.

    And when you read RFC2279 you find.
    X/Open Joint Internationalization Group as the source of UTF-8. In other words this is the http://www.opengroup.org/ today or the group that defines what is POSIX. Guess where you find the 8 byte define UTF-8.

    The extra code points in UTF-8 come from posix standard production processes up in UTF-8 5 and 6. They are for backwards compatibility so you don’t have to be doing complex transformations all the time. Yes not all old Unix systems used ascii char sets.

    It is very important to state what UTF-8 you are implementing. You have 3 major streams of UTF-8. ISO, RFC and POSIX.

    The master source of UTF-8 being the opengroup does not declare a starting range to the code points in UTF-8. If you are on a POSIX system and you are expecting shortest form only in UTF-8 you are kidding yourself.

    The reality here UTF-8 define is in 3-4 standard bodies who don’t agree with each other. It is very important to be aware they don’t agree with each other. Worse over the generation of standards in each of the standard groups defining UTF-8 does not agree with each other either.

    The opengroup define handles everything.

  22. oiaohm says:

    RFC3629
    12. Changes from RFC 2279
    o Restricted the range of characters to 0000-10FFFF (the UTF-16
    accessible range).

    As I said don’t trust DrLoser on standards. He incorrect claims RFC3629 allows 0x1FFFFF. RFC3629 clearly states last will be 10FFFF.

    http://www.ietf.org/rfc/rfc2279.txt Yes RFC2279 is another 6 byte UTF-8. It also does not declare starting range.

    And when you read RFC2279 you find.
    X/Open Joint Internationalization Group as the source of UTF-8. In other words this is the http://www.opengroup.org/ today or the group that defines what is POSIX. Guess where you find the 8 byte define UTF-8.

    The extra code points in UTF-8 come from posix standard production processes up in UTF-8 5 and 6. They are for backwards compatibility so you don’t have to be doing complex transformations all the time. Yes not all old Unix systems used ascii char sets.

    It is very important to state what UTF-8 you are implementing. You have 3 major streams of UTF-8. ISO, RFC and POSIX.

    The master source of UTF-8 being the opengroup does not declare a starting range to the code points in UTF-8. If you are on a POSIX system and you are expecting shortest form only in UTF-8 you are kidding yourself.

    The reality here UTF-8 define is in 3-4 standard bodies who don’t agree with each other. It is very important to be aware they don’t agree with each other. Worse over the generation of standards defining UTF-8 there is not agreement either.

  23. oiaohm says:

    UCS code unit sequence that purports to be in a UCS encoding form which does not conform to the specification of that encoding form. (Section 4.33)
    The alternative forms in fact still conform to specification.
    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
    As long as I declare my UTF-8 as 1996 implementation ISO I am fine to use alternative forms. I have been stating as per ISO. Just you have not got what ISO.

    There was a big argument over Java using 2 byte null in UTF-8 the reality its valid in ISO. Not valid in later Unicode but Valid in early unicode.

    Alternative forms are still encoded as per specification.

    http://www.unicode.org/reports/tr36/#UTF-8_Exploit Notice it depends what version Unicode you are implementing.

    ISO still does not declare non shortest forms as forbin.

    Unicode has declared extra limitations on ISO UTF-8. I cannot claim a fixed length UTF-8 as a unicode.

    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

    This is the problem this is 1996 the original UTF-8 encoding. It declares a Max for each code point but no min for each code point.

    That Exploit Guy I can still be perfectly valid UTF-8 ISO using multi byte. I told you to look at the 1993 and the 2012 standards. I should have told you to look at the 1996 pack.

    ISO allows you to pick year of what you are implementing. This is the big mistake people have made with Unicode and attempting to alter later versions to patch over the so called flaw.

    ISO UTF-8 its perfectly acceptable to use longer forms. You should not pass the longer forms to non ISO.

  24. DrLoser says:

    I’m going to summarize this for you, oiaohm. An encoding is an encoding (eg UTF-8 in the present case). One standardizes an encoding by a) defining a one-for-one bijective correspondence — this is obvious, but seems to be beyond you — and b) publishing it in a standard. In this case, RFC3629, accepted worldwide with only a single dissident voice.

    Character sets and code points are at the next level of abstraction. This is where we meet up with Unicode and ISO/IEC 10646:2012, which are as of now largely interchangeable saving details like boustrophedons. International standards for character sets and code points, however, depend upon standardized encodings. And for once this isn’t Microsoft’s “standard” (which was terrible, and which they persisted in calling “ANSI” even though it wasn’t). Nor is it IBM’s “standard.” It’s the entirely innocent and unremarkable UTF-8, as defined by two Unix gurus on the back of a napkin in 1992 and modified very little since.

    I don’t see what you hope to prove by your, pardon me, weird and unnatural assertions. But one small thing bothers me still:

    There are also code points in UTF-8 that are non unicode they are in 5 and 6 byte UTF-8. These are same between OS X, Linux, Unix and BSD.

    Really? Name a single one. I might moan about GNU/Linux occasionally (troll alert!), but the very last thing I’d accuse it of is an attempt to develop a closed, non-standardized communications system.

    In fact, I will defend GNU/Linux from your accusation. You are wrong. Prove otherwise.

    A couple of attempts to tidy up.

    Yes there is another reason why the RFC states 0x10FFFF as end because 0x1FFFFF will let some non Unicode in chars in from ISO UTF-8.

    Nope, try again. RFC3629 encodes up to 2^21, ie 0x1FFFFF.

    So yes it can be impossible to do a conversion from UTF-8 ISO to UTF-16 because UTF-16 does not have the chars or the code points.

    This is true, but not very interesting. UTF-8 is an encoding. UTF-16 is an encoding. The maximum value that UTF-16 can encode is 0x10FFFF (a value I notice you have erroneously assigned to UTF-8, but hey, this stuff is complicated)

    Now, the set of code points (in any standard) up to 0x10FFFF is good enough for most purposes, but if not?

    Encode the stuff in UTF-8 (RFC3629, I hate to keep repeating this) or even UTF-32. Both are standard.

    Finally, from RFC3629:

    Now the “Korean mess” (ISO/IEC 10646 amendment 5) is an incompatible change, in principle contradicting the appropriateness of a version independent MIME charset label as described above.

    Do not try to replicate the “Korean mess,” oiaohm. Better specification writers than you have tried, and they have repined.

    (Though probably not requited.)

  25. DrLoser says:

    Well, we try our best, ram. Sometimes just hiding under bridges gets a bit dull, y’know?
    :-;

  26. ram says:

    Well, at least serious programmers read and comment on this column. That, and a few Microsoft trolls 😉

  27. That Exploit Guy says:

    Love conquers all.
    Now, now. This isn’t an advice column for programmers, you know?

  28. That Exploit Guy says:

    That Exploit Guy ISO 10646 and the Unicode standard body does not forbid using non well formed UTF-8.
    Try again. Both standards consider encoding not listed in “Well-formed UTF-8 octet/byte sequences” ill-formed. ISO/IEC 10646:2012 (UCS), in particular, defines “ill-formed code unit sequence” as:

    UCS code unit sequence that purports to be in a UCS encoding form which does not conform to the specification of that encoding form. (Section 4.33)

    Similarly, Unicode Standard 6.2.0 states the following regarding ill-formed byte sequences:

    When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. (C9, Chapter 3)

    Therefore, despite your claim, neither standards consider your suggested way to encode characters as UTF-8 conformant. In other words, even if we completely ignore everything stated in RFC 3629, a three-byte sequence for a single character in the range between U+0000 and U+007F inclusive is still not a valid UTF-8 sequence.
    The extent your words contradicts what’s explicitly stated in the standard you claim to understand and follow is simply damning.

  29. oiaohm says:

    That Exploit Guy find me a line in Unicode or ISO body standard saying you must only use Well Formed with UTF-8. The answer is there is no such line. This is the problem. 1 encoding only for UTF-8 is only for RFC 3629. ISO suggests you don’t use the alternatives encoding but does not forbid it. So truly each level in ISO UTF-8 you have to presume goes from 0 to max or you will make buggy iso decoder. Security aspects of ISO UTF-8 is harder.

    That Exploit Guy yes you had a coding error where you were turning a unsigned into a signed. And if you had not made that error it removes one error check.

    There are also code points in UTF-8 that are non unicode they are in 5 and 6 byte UTF-8. These are same between OS X, Linux, Unix and BSD. The ISO standard of UTF-8 support UTF-8 that include non unicode chars. Yes UTF-8 with non uni-code chars could be particularly unfriendly to webbrowers. Yes there is another reason why the RFC states 0x10FFFF as end because 0x1FFFFF will let some non Unicode in chars in from ISO UTF-8. ISO UTF-8 does not state that UTF-8 only has to contain unicode chars. So yes it can be impossible to do a conversion from UTF-8 ISO to UTF-16 because UTF-16 does not have the chars or the code points.

    It pays to know what UTF-8 you are handling. ISO UTF-8 is a lot harder to handle. It pays to mark you code correctly.

    So you should have called yours a RFC 3629 calling it just UTF-8 is badly wrong. Just UTF-8 you should take either ISO or the broadest define.

    UTF-8 is not what you thought it was.

  30. oiaohm says:

    That Exploit Guy well formed in uncode and iso standard body is only a recommendation not mandatory,

    Dr Loser and That Exploit Guy. UTF-8 is also not unicode only. OS X, LInux, BSD and Unix have code points for different non unicode charsets that are in UTF-8 5 and 6. The claim of no code points above 0x10FFFF is invalid that only applies if the UTF-8 only contains unicode.

  31. oiaohm says:

    That Exploit Guy ISO 10646 and the Unicode standard body does not forbid using non well formed UTF-8. Example of well formed UTF-8 in ISO 10646 and from the Unicode standard body is just a suggestion not an mandatory requirement. RFC3629 has a mandatory requirement that a UNICODE value should only have 1 value. This is the difference between RFC UTF-8 and ISO UTF-8.

    Dr Loser RFC3629 is the 2003 this is not a ISO document. ISO 10646:2012 still lists the 6 byte UTF-8 and the other 2 making up the 8 as not in use at this stage. Also ISO 10646:2012 does not state that Unicode in UTF-8 has to be encoded only 1 way. Yes ISO encoded UTF-8 is different.

    That Exploit Guy yes you had it as unsigned and signed unicode. In other-words a basic coding error.

    That Exploit Guy there are non unicode codepoints above the unicode in UTF-8. Unicode consortium is not the only partly that provides code points to UTF-8. This is why you have to know the difference between ISO and RFC documents when it comes to UTF-8. If the UTF-8 in a document is ISO it might not be Unicode.

    http://www.iana.org/assignments/character-sets/character-sets.xhtml

    There are a stack of what are called vendor charsets.

    DrLoser and That Exploit Guy you do find UTF-8 5 and 6 on Unix, BSD, OS X and LInux filesystems. You use some non unicode charset guess where it ends up mapped. Stuff doing conversion to unicode and back again.

    Sorry unicode is not the be all and end all when it comes to code points. There are code points above 21 bit. Some convert to UTF-32 without issue due to this being 31 bit.

    UTF-16 or UTF-32 can always be converted to UTF-8 UTF-8 may contain chars that cannot be converted to UTF-16 or UTF-32.

    UTF-8 is not unicode only.

  32. DrLoser says:

    Nothing wrong with unrequited checks, TEG.
    Love conquers all.

  33. That Exploit Guy says:

    You had unrequited checks in your source code to achieve your ends. Yes RFC 3629 conforming code
    If value is unsigned less than zero is also impossible.

    Is that so? Have a look at the code again (http://pastebin.com/qYyxPv64). The variable “unicode” in that context is a signed long (even though in the calling function “unicode” is an unsigned long).
    Sometimes it pays to be verbose.
    Also, for the purpose given (i.e. illustrating a point), the code had to be verbose.
    Not to shabby for some code I whipped up in an afternoon, huh?

  34. That Exploit Guy says:

    That Exploit Guy the issue here is there are two standards ISO UTF-8 and IETF limited form.
    ISO/IEC 10646:2012 (http://standards.iso.org/ittf/PubliclyAvailableStandards/c056921_ISO_IEC_10646_2012.zip), you say? I guess you have forgotten to look at “Table 3: Well-formed UTF-8 Octet sequences”. Have you perhaps taken time to look at the Unicode Consortium equivalent, “Table 3-7: Well-formed UTF-8 Byte Sequences” in Unicode Standard 6.2.0 (http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf), then?
    More than one way to encode a code point? Nope.
    Code points above U+10FFFF? Nope.
    6-byte wide, UTF-8 encoded character? Nope.
    It seems your words have once again contradicted trivially verifiable facts. What a shame!

  35. DrLoser says:

    Excuse my impertinence for butting in, Robert.
    Briefly, on the cascading if-else, oiaohm: there is nothing wrong with this. Ever heard of “layered software?” All the quoted code does is to select between a 1 and four byte character (the fourth byte being restricted, as per 2003). My choice would be a series of bitmasks; the choice here is a series of if-elses. Fundamentally, it’s the same choice. Both are O(1), should you care about such things.

    Now, on the following layer, you deal with “double value checks.” Do you know how UTF-8 deals with “double value checks?” You’ve been told often enough. Any code point has only one valid UTF-8 encoding.
    A correct UTF-8 aware encoder/decoder, like this one, will therefore call out the “double value” as invalid. And it will do so at the second layer.

    Passing on to your previous post. Can we please pick a normative version of the standards in question? RFC3629 is the present UTF-8 standard for the internet, and indeed for everything else as far as I can establish. Your link to a Cambridge site for ISO/IEC 10646:2012 actually points to ISO/IEC 10646:1993 (rev. 1996), which is not normative. Furthermore, nobody would argue that the 1993 version allowed six byte UTF-8 encodings (although not the fantastic extra-long schemes you have devised).

    To put this in perspective: you could be strict, like TEG, and conform to the standard in 2003. Your encoding and decoding would be limited to 3 3/4 byte UTF-8.
    Or you could be Postel-like, and accept the 2003 version. Your encoding and decoding would allow all 6 bytes,.
    And you know what?
    If there was any purpose at all to allowing the 1993 interpretation, I’d say TEG was wrong. But he isn’t, because it’s just an encoding mechanism.
    There are no code points in any ISO or other standard (Unicode is a separate industry organisation, I should point out) that correspond to any encoding beyond 0x1FFFF.
    So, let’s say your Magic Decoder Wand comes up with something like 0x2FFFFF. What do you do? Guess? Because, Standards-wise, that’s all you got there.
    Two final observations.

    CO 80 Null is in fact used by Java.

    How cunning of you to find this on Wikipedia. Shame you didn’t read the rest:

    In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter. However it uses Modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constant strings in class files.

    For all object serialization, JNI and native constant strings matter, Java could use Klingon as the encoding format. The outside world wouldn’t care. The outside world is what standards are for.
    Oh, and note the careful use of “Modified UTF-8.”
    Finally:

    Under ISO standard you are allowed to map what ever char does not have a matching Unicode char into UTF-8 5 and 6 byte long strings.

    Post 2003, no, you ain’t.
    But we live in a free world. Go ahead and encode whatever you like into “UTF-8 5 and 6 byte long strings.” Know what?
    They’ll be rejected at the other end during decoding. Without fail, assuming you’re talking to an ISO/IEC 10646:2012 compliant system.
    See, at the end of the day, it doesn’t matter what you or I or TEG thinks. When transmitting data via a standardized encoding, it matters what the other end of the wire thinks.

  36. oiaohm says:

    That Exploit Guy also you need to learn to code.

    if else if Guess what if the first if is true the second if does not get run. So you first if checked if something was under X value. You don’t in the next if need to check if it over Y value because it has to be. You had unrequited checks in your source code to achieve your ends. Yes RFC 3629 conforming code

    If value is unsigned less than zero is also impossible.
    if (uncode <0x80) {
    byte=1;
    } else if (uncode <0x800) {
    byte=2;
    } else if(uncode <=0xFFFF) {
    byte=3
    } else if(unicode<=0x1FFFFF){
    byte=4
    }
    That is for a 4 byte compact ISO UTF-8 encoder. Notice no stack of && or double value checks.

    That Exploit Guy you just showed me that you cannot code C to save yourself in your paste bin.

    Validating UTF-8 and UTF-16 on decode is another matter. Yes can be done way cleaner than you did.

    That Exploit Guy that I have to view source to see what the link is. Is just you being a pain. Its not that I don't know how to view source of html its just a waste of time to have to go read it.

  37. oiaohm says:

    That Exploit Guy the issue here is there are two standards ISO UTF-8 and IETF limited form.

    Reality is if your decoder and processing is built only to handle IETF you may have security holes if a ISO UTF-8 is ever dropped on it. CO 80 Null is in fact used by Java.

    Java is conforms to ISO UTF-8 not IETF UTF-8.

    ISO/IEC 10646:2012 still has UTF-8 as 6 bytes wide. This is the newest standard of UTF-8.

    That Exploit Guy if you are not transmitting over the Internet you UTF-8 should be done by ISO standard not the IETF.

    Also if you read into RFC3629 a little more you will find more issues.

    8. MIME registration section of RFC3629 is a very good read.

    That Exploit Guy a known security hole is funny.

    International Organization for Standardization(ISO) and IETF have different ideas on what UTF-8 is.

    That Exploit Guy your implementation may create data destruction because you failed to label it correctly.

    Yes trouble also comes that people are not aware of UTF8mb3 that comes before rfc3629 and UTF8mb4 that comes after.

    Yes a UTF8mb3 decoder only support to 3 byte encode. Rejects everything past that. UTF8mb4 supports full range where rfc3629 does not. There is such a beast as a UTF8mb6 that is ISO and then there is UTF8mb8 that is full range UTF-8 using the 2 reserved.

    That Exploit Guy of course I don’t recommend transmitting UTF8mb6 r UTF8mb8 as a webpage. Now inside a iso format like ODF UTF8mb6 is acceptable.

    The big problem people are using web limited UTF8 inside ISO standard formats. Hello expect to be burnt at some point. ISO reserves the right to open up UTF8 to 8 bytes. It also reserves the right to open UTF16 to more than 4 bytes.

    That Exploit Guy also IETF has never said they will not allow larger. The agreement at ISO was to use out the 21 bits first in unicode encoding due to UTF-16 being the smallest. You are reading it that UTF-8 will always stop at bits this is not the case.

    You really need to read UTF-8 in Annex D of ISO/IEC 10646-1:2000 and ISO/IEC 10646:2012
    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

    Under ISO standard you are allowed to map what ever char does not have a matching Unicode char into UTF-8 5 and 6 byte long strings.

    That Exploit Guy UTF-8 is not restricted to only containing UNICODE chars. Web transmitting says you must restrict. As I said you have not implemented UTF-8 you implemented rfc3629.

    rfc3629 limitation of UTF-8 has never been ratified by ISO.

    FE or FF in ISO are currently declared as not used so are to be rejected by current ISO decoders.

    The reality by ISO using 5 and 6 byte utf-8 for non unicode chars is valid. The PUA area for when all 21 bits fill up will be in the bit ranges 5 and 6 byte UTF-8 cover.

    Unicode has very long term plans. Yes at some point current UTF-16 will be too small.

  38. That Exploit Guy says:

    That Exploit Guy if you posting insults insulting links to draw my attention back here again I will never answer you again
    Really? No one asked you to “answer” me nor did I put forth any question whatsoever that solicit such. I was simply laying bare your penchant to cover your lies with even more lies. Take that as entertainment for everyone else here at your expense.
    The reality is the ranges in UTF-8 are not how you had them. Spec gives encoding recommendations. You included unrequited checks.
    You seem to have not figured out how to right-click on your browser to view source yet. Don’t worry – here is the complete URL to the actual RFC to make up for your underwhelming intelligence:
    https://tools.ietf.org/html/rfc3629
    Notice that the RFC is considered “Internet Standard”. Also notice that the RFC does not state that there “may be several way to encode a character” or there “should be only one way to encode a character”. Instead, it explicitly states:

    Determine the number of octets required from the character number and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character.

    Furthermore:

    Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
    A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F (“/../”), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

    In other words, not only are you advocating implementation of decoders that are not conformant to an Internet Standard ratified by the IETF; you are in fact recommending the deliberate introduction of a known security hole to computer software. Well done!

  39. oiaohm says:

    That Exploit Guy if you posting insults insulting links to draw my attention back here again I will never answer you again.

    The reality is the ranges in UTF-8 are not how you had them. Spec gives encoding recommendations. You included unrequited checks.

    That Exploit Guy the guarding against over-long in the decoded is in fact accept and process if your decoder is iso.

    This is from RFC3629 10. Security Considerations

    “the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to
    U+7FFFFFFF, yielding sequences of up to 6 bytes.”

    If you read that full document that you have not you would have found this. In fact UTF-8 is not only defined by rfc3629. There are many standards defining what UTF-8 is. If your UTF-8 is rfc3629 is a max of 4 bytes chars if your UTF-8 is ISO/IEC 10646 its 6 bytes. If your UTF-8 is from another standard body it is 8 bytes.

    That Exploit Guy reality you have not implemented UTF-8. You have implemented rfc3629 sub form of UTF-8. I know its confusing when both are so similar. Yes this would be like implementing UCS-2 and calling it UTF-16. Yes you are using the wrong name.

    ISO/IEC 10646 does not say you have to guard against the different encoding forms. So a decoder supporting ISO/IEC10646 expects 6 different forms of NULL.

    That Exploit Guy your decoder/encoder is not ISO standard UTF-8.

    This is a classic case of standard bodies not being standard.

    That Exploit Guy basically you are incompetent you never read the full document before implementing. The Wikipedia shows UTF-8 6 chars long because that is the ISO standard and the end of the 4 group U+1FFFFF.

    RFC 3629 limitation is for web transport. Not a implementation applied to internal database storage or file-system names.

    Yes the traps for the unaware. ISO/IEC 10646 allows for fixed length encoded UTF-8.

    Reason why someone might want to use fixed length UTF-8. You cannot tell the difference safely between UTF-16 and UTF-32. In fact UTF-32 is forbid from Internet transport due to the fact you cannot tell it apart from UTF-16. UTF-8 locked to 4 bytes takes the same space as UTF-32 and is clearly identify as not UTF16 and will decode with any ISO standard UTF-8 decoder. Yes will fail in some RFC 3629 decoders but be accepted perfectly by other RFC3629 decoders.

    Calling UTF-8 variable length alone means you have not read the ISO documents. UTF-8 is variable or fixed if you decoder is ISO.

  40. That Exploit Guy says:

    That Exploit Guy what you lose in conversion from GB18030 to unicode is regional accenting
    天方夜譚。
    Have you forgotten to consult your ever-reliable source (Wikipedia) on the subject? There is no accenting in written Chinese, mate. Accent notations are used only in pronunciation notations such as the Romanised Pinyin system and the Zhuyin system. This is not to mention that all Zhuyin notations have already been assigned code points in the Unicode BMP for a long, long time.
    unicode >= 0×80 && unicode = 0×80 was not required because you should not have got to that point with a value less.
    鬼話連篇。
    No such line exists in my code.
    The closest I have got there is:”else if (unicode >= 0x80 && unicode <= 0x7ff)” and that corresponds to the 2nd iteration of the UTF-8 encoding scheme. Also, code points from U+0080 to U+07FF inclusive are currently assigned to a broad variety of characters such as accented Latin, Greek, Cyrillic, Hebrew and Arabic. There is no reason to not include this range in a UTF-16 to UTF-8 conversion program.
    There is no reason, however, to include code point values beyond 0x10FFFF because, as you say, they are “officially unused” in the Unicode standard.
    0x7ff look closer this is 0×111 1111 1111
    胡說八道。
    That’s “7FF”, not “FF”. Unlike you, when I said “FF”, I did mean “FF”.
    And the real range of 3 byte UTF-8 is 0 to U+FFFF
    狗屁不通。
    One of the key design rationales behind UTF-8 was absolute compatibility with ASCII. This is why the UTF-8 encoding scheme uses only one octet to encode code points between U+0000 (Null) and U+007F (Delete) inclusive. Even you have admitted this very fact several posts earlier in your so-called “extended table”.
    Fool.
    When you have a choice of length you are meant to take the shortest it will fit into with UTF-8 but if you decoder receives something longer it is meant to accept it.
    痴人說夢。
    First of all, my little program is a UTF-16 to UTF-8 converter, so whichever part of your diseased mind that tells you that it somehow “decodes” UTF-8 is simply wrong.
    Second of all, the decoder that you describe is in total violation of Section 3 of RFC 3629, which explicit states that “there is only one valid way to encode a given character”. The same standard also classifies “overlong” encoding as a form of potential security risks and should be guarded against by the decoder (c.f. Section 3 and Section 10).
    Any more gibbering nonsense that you would like me to debunk?

  41. oiaohm says:

    That Exploit Guy what you lose in conversion from GB18030 to unicode is regional accenting. Yes GB18030 converts to unicode in a lossy way.

    GB18020 to UTF-8 or UTF-16 is more complex. The shorter chars in GB18020 are not based on unicode numbers. They are a complete new assignment.

    Out of all the variable width options UTF-8 style is the largest.

    Of course your code point encoder is presuming the code point will be inside unicode even that utf-8 formally declared 5 and 6. Currently officially unused.

    There is a nice odd ball way of generating the breakers.

    I now can see your issue.

    unicode >= 0x80 && unicode = 0x80 was not required because you should not have got to that point with a value less.

    0x7ff look closer this is 0x111 1111 1111

    Interesting right. 110xxxxx 10xxxxxx count the X. Remember all x zero is valid amd there are exactly 11 x values.

    So the real range of the 2 byte UTF-8 is 0 to 0x7ff
    And the real range of 3 byte UTF-8 is 0 to U+FFFF
    And the real range of 4 byte UTF-8 is 0 to U+1FFFFF You checked for U+10FFFF that is in fact wrong for encoding to UTF-8. This error says you have implemented rfc3629 not pure UTF-8. rfc3629 was to maintain compatibility with UTF-16. UTF-8 agreed to limit.

    When you have a choice of length you are meant to take the shortest it will fit into with UTF-8 but if you decoder receives something longer it is meant to accept it.

    Each if in a UTF-8 encode is only meant to be check 1 value. Check the largest. There is absolutely no need using if else to check the smallest since smaller than max should still encode in readable format by UTF-8 decoders.

    The fact you are checking for the smallest means you miss the fact that each utf-8 codepoint in fact starts from 0. Size optimisation says you don’t put smaller values in larger utf-8.

    Basically there is a big error that there is a min value check requirement with UTF-8. If you are checking that you are wasting cpu time.

  42. That Exploit Guy says:

    That Exploit Guy the 9 location is the break point. UTF-8 can be extended to where self-syncing becomes more complex and that is 9 and more.
    Not with octets as the atomic unit, and most definitely not with your ridiculous scheme.
    That Exploit Guy china is allocating code pages that are not submitted to Unicode standard.
    On the Simplified Chinese side, GB18030, the official encoding of the PRC, guarantees interoperability with Unicode. On the Traditional Chinese side, Big5 (RoC) and the extended Big5-HKSCS (HK) are both as ancient as Bob’s behind and mostly disused in favour of Unicode. Whatever these “unsubmitted” characters might be, they would certainly be mappable to Unicode (in the PUA, no less) and certainly no chance to reach a number of billions, let alone trillions (just imagine how big your software would have to be in order to store all those font faces).
    Next is not forget china is not following Unicode numberings.
    No joke. GB18030 isn’t part of the Unicode standard, so what makes you think it has to follow Unicode anything except what it has guaranteed?
    Unicode is maped into a 4 byte value
    “FF” in hexadecimal is 255 in decimal or 11111111 in binary. How you managed to (deliriously) see 4 bytes there in a hexadecimal sequence of “XX XX” is beyond me.
    Besides, this is how you encode the same code points properly in UTF-8 (nice bit of plagiarism from Wikipedia, by the way):
    U+00DE (Þ) → C3 9E
    U+00DF (ß) → C3 9F
    U+00E0 (à) → C3 A0
    U+00E1 (á) → C3 A1
    U+00E2 (â) → C3 A2
    U+00E3 (ã) → C3 A3
    Encoding them in UTF-16 (little-endian) is even more straightforward:
    U+00DE (Þ) → DE 00
    U+00DF (ß) → DF 00
    U+00E0 (à) → E0 00
    U+00E1 (á) → E1 00
    U+00E2 (â) → E2 00
    U+00E3 (ã) → E3 00
    No 4-byte sequences whatsoever.
    The ability to use iteration 5 in UTF-8 to place the china encoding as kinda is can save a lot of pain
    No, it doesn’t.
    It is obvious that you haven’t noticed, but UTF-8, UTF-16 and GB18030 are all variable-width. This means the size of the octet sequence varies depending on the character you are encoding at the time. This is also why self-synchronising is a nice property to have (because you cannot predict where the next character is going to be by simply counting octets). All this stuff you made up about the 7th and the 8th iteration? Absolutely pointless.
    Also, here is a simple UTF-16 to UTF-8 converter I whipped up during my spare time. Notice the distinction between a code point (represented by a long) and encoding (represented by unsigned shorts and unsigned chars). Notice also the if-else-if clauses used for determining the proper octet number for each code point in unicode_to_utf8(). Compile the code with your favourite compiler (even GCC will do the job), try it with both BMP and SIP characters and observe the input and the output both in a UTF-8/16 compatible text editor and a Hex editor (provided you have already got the fonts to go with the former, of course). The text editor result should give everyone a good idea as to who is right here.
    Note: Simply use the command syntax “(compiled binary) (input text file) (output text file)” in the case you are not good at reading C code.

  43. oiaohm says:

    That Exploit Guy the 9 location is the break point. UTF-8 can be extended to where self-syncing becomes more complex and that is 9 and more. UTF-8 in 9 form still has self sync. 10vvvvvv or 0vvvvvvv become your markers you know a start of a char happens after 1 of those two and the first 1 is a char in its own right.

    9 form becomes lossy and losses a char here or there and you have to start with a char under 127 to start the file off.

    That Exploit Guy the point of UTF-8 is that is in fact that large that you should never hit its max.

    That Exploit Guy china is allocating code pages that are not submitted to Unicode standard. Next is not forget china is not following Unicode numberings.
    U+00DE (Þ) → 81 30 89 37
    U+00DF (ß) → 81 30 89 38
    U+00E0 (à) → A8 A4
    U+00E1 (á) → A8 A2
    U+00E2 (â) → 81 30 89 39
    U+00E3 (ã) → 81 30 8A 30
    Yes first bit unicode next bit china idea of encoding. Unicode is maped into a 4 byte value unless of course they have decided to assign another value to that char as well that is shorter for optimisation.

    The ability to use iteration 5 in UTF-8 to place the china encoding as kinda is can save a lot of pain in forwards and backwards conversion. GB_18030 is a evil bit of works. Unlike UTF-8 having alternative encodings GB_18030 mandates the usage of the alternative encoding over unicode if alternative exists. Yes processing a GB_18030 document you can hit chars you cannot convert to Unicode that are in fact Unicode chars. Just your alternative encoding table is not upto date.

    That Exploit Guy a space with no known chars that is ignored but not deleted is highly useful. UTF-8 massive size allows it to deal with odd ball events better by simplely placing odd ball in space somewhere. The lack of UTF-8 in sql server and other things limits what can be done.

    UTF-8 is 8×8 Using the same self syncing method. 9 bytes+ there is another self syncing method that does work on UTF-8.

    Yes UTF-8 for something designed quickly is very well designed and majorally overkill. Implementing a database and you don’t want future problems use UTF-8.

  44. That Exploit Guy says:

    Let me do some maths for you, my gibbering friend.
    For the 1st iteration: 2^7 = 128
    For the 2nd iteration: 2^11 – 128 = 1,920
    For the 3rd iteration: 2^16 – 1,920 – 128 = 2^16 – 2^11 = 63,488
    For the 4th iteration: 2^21 – 63,488 – 1,920 – 128 = 2^21 – 2^16 = 2,031,616
    By this point, we have already reached 2,097,152 possible code points – a number that is far larger the 1,114,112 (U+0000 to U+10FFFF inclusive) code point limit set by the Unicode standard. Even putting aside the fact that many, many code points of the supplementary planes are yet to be assigned to any characters, this is still 983,040 code points ahead of the standard requirement. Now…
    For the 5th iteration: 2^26 – 2,097,152 = 65,011,712
    For the 6th iteration: 2^31 – 2,097,152 – 65,011,712 = 2^31 – 2^26 = 2,080,374,784
    This is the entirety of the encoding scheme as defined by Thompson et. al. At this point, we are already 2,146,369,536 code points ahead of what Unicode covers. If we were to encode all (non-existent) code points and put them in one file, the file would be about 95GB in size. Now, if we were to factor in with your broken design, then…
    For the 7th iteration: 2^36 – 2,147,843,648 = 2^36 – 2^31 = 66,571,993,088
    For the 8th iteration: 2^42 – 2,147,843,648 – 66,571,993,088 = 2^42 – 2^36 = 4,329,327,034,368
    At this point, we have already gone beyond the display precision my brand-new, ten-digit scientific calculator can handle (yes, I am this old-fashioned), and our hypothetical all-code-points-in-one file is now ~32TB in size. With a space this large, you can encode not only all characters in all languages from both the past and the present, but also perhaps characters from all extra-terrestrial civilisations in the Milky Way. This is not to mention:
    1) We still haven’t factored in the fabled 9th iteration, which breaks the self-synchronising property of the entire encoding scheme.
    2) We are hardly close to exhausting even the current, measly space (by comparison) of 1,114,112 code points. Given this, there is simply no reason for one to employ the 5th iteration as defined by Thompson et. al., let alone the 7th iteration as defined by you.
    In other words, a 6+ octet UTF 8-like scheme is not only wholly unnecessary but also unlikely to be seen in the wild as there is simply no known character currently associated with a 5-octet sequence or beyond.
    Any more made-up “implementations” you would like to discuss?

  45. oiaohm says:

    That Exploit Guy Number 4 is a typo. Yes it should only be 4 bytes long its where I copied the table from to start off with.

    I hate doing tables in ascii.

    Note its 64 bits long before UTF-8 ceases to be self-synchronising at a 8 bit level.

    8×8 is a odd ball because the first byte does not contain a 0. Most people think UTF-8 ends at 8×7. My wording was horrible in one place.

    The design of UTF-8 is 8 bytes long that is the reality that is when the self-sync breaks.

    Since UTF-8 was designed after UTF-32 and UTF-16 it is in fact designed to be larger than both of them.

  46. That Exploit Guy says:

    Let me jog your memory a little bit, my gibbering friend:
    forgot 8×8 was a odd ball.
    9 11111111 110vvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
    This is not just a simple case of miscounting. This is a clear indication of someone not understanding what “self-synchronising” means (but instead trying to making things up as he goes along).
    On top of that:
    4 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
    You do realise “001FFFFF” is a hexadecimal value, not binary, right? Even if you want to pass someone else’s stuff as yours, at least try and understand what you are copying. Fortunately for you, Bob will always agree with you unquestioningly (at least in front of his perceived enemies), so that’s one listening ear that you are guaranteed to have as long as this blog stays running.
    It’s a freak show.

  47. oiaohm says:

    That Exploit Guy its not that I did not notice on the 9 oct

    There is a typo on 6 7 and 8 I missed

    That Exploit Guy the max and what tomson stated with UTF-8 are two different things. First highs 8 bits ie 11111111
    Basically extend the table out
    1 0vvvvvvv
    2 110vvvvv 10vvvvvv
    3 1110vvvv 10vvvvvv 10vvvvvv
    4 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
    5 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
    6 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
    7 11111110 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
    8 11111111 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

    The level is the number of bytes with UTF-8 Its self-synchonising until it is 8×8. 8×9 is when self-synchronising has to be broken.

    That Exploit Guy no you did not even spot the error I made correctly.

    That Exploit Guy you keep on saying we are not close to running out of code points. China is not bothering to put the code points in their own encoding up to the standard bodies for Unicode.

    The extra space in UTF-8 its possible to embed the china encoding in a UTF-8 string whole.

  48. That Exploit Guy says:

    That Exploit Guy the max and what tomson stated with UTF-8 are two different things. First highs 8 bits ie 11111111
    Oh, boy, aren’t you an adorable try-hard?
    Notice that the second octet of the 9th iteration is the same as the first octet of the 1st iteration of your so-called “implemented utf-8”? Yep, by introducing this funky expansion, you have unknowingly broken the self-synchronising property of UTF-8. It’s still worth a good chuckle, though.
    This is not to mention we are hardly close to exhausting the available code points between U+0000 and U+10FFFF.

  49. oiaohm says:

    That Exploit Guy the max and what tomson stated with UTF-8 are two different things. First highs 8 bits ie 11111111

    Basically extend the table out
    1 0vvvvvvv

    2 110vvvvv 10vvvvvv

    3 1110vvvv 10vvvvvv 10vvvvvv

    4 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

    5 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

    6 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

    7 11111110 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

    8 11111111 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
    8×7 right

    I forgot 8×8 was a odd ball.
    9 11111111 110vvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

    This is the thing about UTF-8 encoding rules latter on it was have the history document it was formulated not to limit UTF-8 length.

    8×7 and 8×8

    Proposed FSS-UTF and implemented utf-8 are two different things.

    Alternative encoding means you are not obeying bit packing rules. Yes this is one of the hazards of UTF-8 if you design your decode on the presume everyone will obey bit packing rules you can get gamed.

    DrL is a not a person to trust on technical.

  50. That Exploit Guy says:

    Why those are coded in C and C++.
    Irrelevant. One simply does not null-terminate on an MS SQL server unless one is dealing with an MS SQL server that exists only in your head.
    That Exploit Guy there are failures using nvarchar as well in ms sql server with conversion errors.
    Again, fictional issues are not of my interest. Only real ones.
    nvarchar is lot larger in storage require utf-16.
    Quite the contrary especially if we are taking about Chinese characters. Chinese needs only two bytes per character to encode in UTF-16 but three bytes per character in UTF-8.
    You are running a webpage you want utf-8. Lots and lots of things want utf-8.
    Conversion between UTF-8 and UTF-16 is a no-brainer and requires negligible resources for a up-to-date machine in 2014 to complete. Heck, covert the data however you see fit on the application server, if you must.
    Fun thing about utf-8 is alternative encoding allows you to make all chars exactly 64 bits long.
    No, this would be a total violation of the bit-packing rules of UTF-8 and would make your characters unreadable.
    Besides, the maximum number of octets for UTF-8, even by Thompson and co.’s (more comprehensive) original design, is 6. In other words, the largest number of bits possible for a UTF-8 character is 8 x 6 =48, not 64.
    Also, the valid Unicode code points are between U+0000 and U+10FFFF inclusive. 5- and 6-byte UTF-8 sequences are simply unused.
    (Link courtesy of DrL)

  51. Grace has an API but I will just run it with a given input file and let the user decide what to do with the graphics. This way, my code can output multiple curves v time or range and I don’t have to do much more than put some axes labels and coordinates in a file. Grace is very flexible. I like it because it can do SVG very well and I don’t have to do any more study of the file-format. The thing “prints” to SVG, JPEG, PNG, PostScript, X11 and PNM. So, my code just starts up an instance and the user can go crazy with choices or just look at the chart and close the window when he’s done. With file-sharing, the limit, if any, on the number of graphs the user can be studying is huge. GEBC, on the other hand was limited to just a few stored solutions and one or two plots. In the practice of ballistics, it is quite reasonable that a user would want to look at a few choices of calibres X choices of bullet-weight X choices of bullet-shape. With Grace, I can put out all those choices on three or dozens of charts depending on what the user wants. If all the user wants to do is look at a chart, he doesn’t need to know any more about Grace than closing the window when done. If the user wants to publish, which I do from time to time, he/she can generate fairly complex graphs labelled extensively in pretty fonts/styles and ship SVG so it scales to any size/shape. Why re-invent the wheel? There are plenty of reasons to re-invent GEBC, starting with bugs combined with lack of support for the code, and it’s truly limited use of modern technology to do the number-crunching.

    Grace has some drawbacks like no support for PDF in the Debian release, but overall it has far fewer bugs and more features than others such as jgraph. See Grace’s HomePage

  52. DrLoser says:

    Fair enough, Robert. I’d think that git is a bit of an overkill (CVS would work well enough), but I agree, it’s a good way to learn a new tool.

    Any thoughts on why you picked Grace? It’s a little obscure. Does it have Pascal bindings? What graphics formats does it support? (Yes, I’ve looked it up. Humour me. You have a purpose in investing your time in this choice — I don’t.)

    I take it you’re going to be using GraceGTK, rather than qtGrace. I guess either one would do the job, but QT is a bit heavyweight for your target, I would think. (Also probably too much C++, and I’m a C++ programmer and even I don’t feel entirely comfortable with it.)

    More power to your trigger finger, as it were!

  53. I’ve figured out what I will do with the graphic output. There is still 2 feet of snow in my yard so I have plenty of time before the ground thaws here. I have a couple of weeks of work to do. My current task is learning git so I can try out changes without breaking current working versions. That’s not a necessity, of course, but it’s useful in understanding what Linus and company do with the kernel that I build frequently. Target-shooting season also has to wait for the snow to go so I have plenty of time to give to programming.

    My plan for the graphics is to use Grace to take output from my code to render graphs. That allows the user to render the file in many formats and customize it from a nice GUI. I was thinking to generate SVG automatically but Grace is very flexible and already available on Debian. My code can insert into the intermediate file any default settings I need for my use. That leaves very little for me to do except to modify the internal data-structures and interpolation. I might produce a .deb package to give back to Debian. GEBC is not in Debian at the moment.

    I have been doing numerical graphing since the 1970s so I could output graphics without using Grace but it’s already there. This is one of the beauties of FLOSS. I don’t need to learn C or write tons of code to generate widely useful software. I also don’t need to translate GEBC’s graphics components to PASCAL.

  54. oiaohm says:

    That Exploit Guy there are failures using nvarchar as well in ms sql server with conversion errors. nvarchar is lot larger in storage require utf-16. You are running a webpage you want utf-8. Lots and lots of things want utf-8.

    utf-8 name is part a double joke. Smallest utf-8 is 8 char bits largest utf 8 char is 64 bits or 8×8. Fun thing about utf-8 is alternative encoding allows you to make all chars exactly 64 bits long. utf-8 is compact and bulky in one. 1 decoder 2 encoders.

    Leave the common languages and Microsoft conversion defects come out to play. Manderin has types of it language that are not commonly used.

  55. oiaohm says:

    That Exploit Guy Null chars are critical to be correct for database backend as well. Why those are coded in C and C++.

    rdbms are still using using C and C++.

    That Exploit Guy basically you start finding some very creative bugs with SQL Server. Like replication failures tracing back to alternative nulls.

    unprintable char is incorrect. You end up feeding the database backend a unable to be processed char.

    There are ways to DOS attack MS SQL Server by sending it the right chars due to inconsistent unicode processing.

    This is the fun of MS SQL server. I pass it utf-8 or utf-16 to a varchar it will run it past is conversion engine. utf-8 could be stored in varchar in MS SQL server if Microsoft did not want to keep on converting to local region setting.

    Basically you can send MS SQL conversion engines nuts. They are coded in C and C++.

  56. That Exploit Guy says:

    That Exploit Guy varchar should also store UTF-16 but null terminated.
    Obviously, you have read my convo with a fellow poster on a website that Bob has apparently programmatically forbidden everyone to mention here. (Or perhaps not – I am too busy to verify this.)
    You see, NULL termination is all fine and dandy except for this one bitsy, little problem – it’s only relevant to C. C, by design, uses the NULL character (0x00) to indicate the end of a string, and if you leave it out, there is a good chance you will end up with a buffer overflow issue, which may result in a security vulnerability in your application. On the other hand, NULL termination is an entirely irrelevant thing in the context of RDBMS, and what you will end up doing is feeding the DB with an unprintable character.
    Too bad, huh?

  57. That Exploit Guy says:

    I am a Pascal programmer.
    As if that matters.
    You could be a Haskell or a Pascal programmer, but when it comes to RDBMS, it’s all about SQL, relational algebra, ER, EER, etc. You either pick up a textbook on database theory and start learning about them, or keep making excuses like this and let people point and laugh at you silly.

  58. oiaohm says:

    That Exploit Guy varchar should also store UTF-16 but null terminated. MS SQL Server malfunctions due to not supporting the second NULL microsoft conversion allows.

    This is the funny part Microsoft says they only implement stuff to spec. Then go and implement stuff a head of spec. PUA in Unicode standard is where the second Null is. Microsoft implemented it and has failed to remove it from the conversion tools on Windows. So platforms like Linux and OS X still have to support it existence when handling UTF-16.

    That Exploit Guy I said 6 nul values. 6 values in UTF-8 that should decode to exactly the same thing in UTF-32.

    That Exploit Guy I could find the alternative null in utf8. In fact there are 6 Nul values in utf8 at max implementation..
    0×00
    0xC0,0×80
    0xE0,0×80,0×80
    0xF0,0×80,0×80,0×80
    These 4 in UTF-8 are in fact all Null. They are all different representation of zero. The next 2 that make up the 6 are UTF-8 to max length.

    That Exploit Guy the reality is processing can be hell. Null terminated strings can be problematic.

    Using nvarchar is avioding Null processing.

    MS SQL also attempts automatic conversion when placing data as varchar as well and explodes due to alternative null values.

    Yes the fun of Unicode UTF-8 and UTF-16 has more than 1 value for Null. UTF-32 is sane with only 1 value for Null.

  59. DrLoser says:

    “I am a Pascal programmer,” you say.
    Are you? Are you really? You haven’t shown any evidence of being one.
    How’s that GEBC (graphics and all) conversion to Pascal going?
    I seem to recall that you were leaving it for a while. It appears to me that your “while” is up, dude.

  60. TEG wrote, “changing a data type to a correct one does not involve altering your data model.”

    I am a Pascal programmer. When I write database-structure, I mean structure, including both type, position, etc.

  61. Truth says:

    Yeah!! Linux is the best OS.

  62. That Exploit Guy says:

    * is compatible

  63. That Exploit Guy says:

    That Exploit Guy changing data type to suite language is not suitable in so many cases it not funny.
    Here is the break down of what we have established in our previous discussion:
    1) nvarchar stores characters internally in UTF-16.
    2) MS SQL automatically takes care of the conversion from GB18030 to UTF-16 for your when you store a string in nvarchar.
    3) By design, the mapping between GB18030 and Unicode is bijective
    Now, the only reasons to not use nvarchar in preference to varchar would be those as suggested by you. However, as we have already established, they are all pure fabrications in your part. Unless the laws of physics of this world are in fact controlled by your very imagination, I think it is safe to disregard your concerns as fairy tales.
    Any other questions?

  64. That Exploit Guy says:

    That Exploit Guy I could find the alternative null in utf8. In fact there are 6 Nul values in utf8 at max implementation.
    There is no such thing as a “max implementation” of UTF-8 or a UTF-8 character with 6 “nul”. Since UTF-8 by design has a variable octet number for each character and compatible with ASCII, the null character (U+0000) is naturally mapped to 0x00 (1 byte).
    Utf-16 is 0x0000 and a extra null hidden on Plane 0xFF.
    There is no such thing as an “0xFF” “plane”. Unicode designations for code points and “planes” are encoding independent. The closest to what you describe is the Basic Multilingual Plane, which spans between U+0000 to U+FFFF inclusive. Code point U+00FF is assigned to “Latin Small Letter y with Diaeresis“.
    Planes above U+EFFFF (i.e. Planes 15 and 16) are PUA, by the way.

  65. oiaohm says:

    That Exploit Guy I could find the alternative null in utf8. In fact there are 6 Nul values in utf8 at max implementation..
    0x00
    0xC0,0x80
    These two are the most common used.
    Inside
    0xE0,0x80,0x80
    0xF0,0x80,0x80,0x80

    Utf-16 is 0x0000 and a extra null hidden on Plane 0xFF. Yes the exact inverse 0xFFFFFFFF. This is the 4 byte Null of UTF-16. Or why in hell did you do this to us Microsoft. Someone bright spark considered this a bright idea so a long strings of 4 byte UTF-16 could be processed without offset issues even if they included null in middle. Mine you its only Microsoft encoding that spits out 0xFFFFFFF in utf-16 as null and its only on some languages a lot.

    This is the problem That Exploit Guy Microsoft could make a lot of converion errors disappear if they sane up sections of their converters like doing what everyone does and never use UTF-16 4 byte null. Null does not have to equal zero.

    That Exploit Guy you are using varchar and the like so you can perform a string query database side and get results. base64 has no place in database. Using binary storage also stuffs up SQL searching effectively.

    That Exploit Guy changing data type to suite language is not suitable in so many cases it not funny.

    I know about this sites CSS limitations so do you see me using the direct link option?

    Mouse over my links they work while my post is the most current with links. The CSS here blocks using alternative text to links straight up also blocks when entry gets old.

    Over time I am getting a method for this site that more and more of my links work. If they don’t work. I have done them in a way they can be copied.

  66. That Exploit Guy says:

    Someone kindly drew my attention to this comment by Bob, which has been altered since my last reply to it. (Perhaps from now on I should screencap Bob’s comments before replying?)
    so M$ wants folks to rewrite their software and use a different database-structure to use a different data-type just to use the Chinese language characters.
    Nah, Bob. Unless you have absolutely no idea what you are doing, changing a data type to a correct one does not involve altering your data model.
    I suppose you are well aware of this, right?

  67. That Exploit Guy says:

    *notion

  68. That Exploit Guy says:

    That Exploit Guy sorry there is a set of issues. Bad support for the second null in UTF-16 Microsoft own coding convert solutions using the second null in UTF16 then their own database product not supporting it.
    Modified UTF-8/CESU-8 have fundamentally nothing to do with UTF-16 proper. We have already gone over this many, many time.
    Also, guess how long “0xC0,0x80” is. (Note for the slow: it’s not 4 bytes).
    This is the basic inconstantly problem inside windows. Shows up once you stop using using languages that code in UCS-2. MS SQL Server is not the only MS product that has the error.
    Again, UCS-2 is a strict subset of UTF-16. We have gone over this also many, many times. If you still have trouble remembering such facts, I’ll start billing you next time for consultant fee.
    Have you not noticed receiving large emails comes with a huge overhead due to Base64.
    The data increases by one-third in size from its original form to Base64. Performance is still at worst linear (reading from the beginning to the end of the string), can be effectively reduced to constant by combining the conversion with other reading operations and is no worse than a cross join.
    If you say you must use Base64 in a database world you are saying lets cripple performance to hell.
    If. MS SQL does not belong to this “if”.
    For one thing, MS SQL supports binary and varbinary type, should you wish to take this ridiculous, SQL-unfriendly route.
    Also, so far you have suggest absolutely no case in which MS SQL cannot store properly encoded UTF-16 strings (also a ridiculous notice) outside of your pure fantasies.
    You came after me on both my english and this point That Exploit Guy.
    Those subjects are where all the hilarity lies. Reading you talking about them is like watching an idiot parading in the streets in his birthday suit: he thinks he is showing off the beautiful, new garb he has just bought, but all that everyone clearly sees is his naughty bits hanging out. Ew.
    By the way the way you are doing links is blocked from access.
    Grab the link from the page’s source, Dumbo. I thought you’d know better about this site’s broken CSS.
    (On second thought, whom am I kidding? 😉 )

  69. oiaohm says:

    That Exploit Guy also with that link of yours stating all the UTF-16 issues. http://www.utf8everywhere.org/ . Exactly what good reason does MS SQL Server not provide a widechar that is UTF-8.

    All those errors with UTF-16 cause hell. Switching locale to UTF-8 under Windows also busts applications.

    Reality unicode in Windows is stuffed.

  70. DrLoser says:

    “Bad support for the second null?”
    I’m sorry, oiaohm, this is all too esoteric for me.
    Could you present us with a use case, or better still, a “Steps To Replicate?”
    You undoubtedly have one, maybe several. Don’t make us wait for the other shoe to drop.

  71. oiaohm says:

    That Exploit Guy sorry there is a set of issues. Bad support for the second null in UTF-16 Microsoft own coding convert solutions using the second null in UTF16 then their own database product not supporting it. This is the basic inconstantly problem inside windows. Shows up once you stop using using languages that code in UCS-2. MS SQL Server is not the only MS product that has the error.

    Base64 encoding is not a solution. Databases support binary blob options for a reason.

    Have you not noticed receiving large emails comes with a huge overhead due to Base64. If you say you must use Base64 in a database world you are saying lets cripple performance to hell. So bad are better off using a different database.

    You came after me on both my english and this point That Exploit Guy.

    By the way the way you are doing links is blocked from access. There is a reason why I use full linked. Why I don’t bother referencing for you most of the time is you don’t do accessable references.

  72. That Exploit Guy says:

    That Exploit Guy A utf-16 4 byte nul is not all zeros. That is the problem…
    Then you went on to talking some rubbish about modified UTF-8, which is completely irrelevant as far as UTF-16 proper or Unicode in general is concerned.
    Look, UTF-16 is not the greatest as far as character encoding standards are concerned, but fabricating a crisis where there is none does nothing more than hurting your own argument. In the worst case, Base64 is all you need to deal with character encoding that a DBMS doesn’t support natively. Heck, we have been doing this with e-mail pretty much since the beginning of time!
    Please, Pogson has already embarrassed himself once defending you over this, so why don’t you do yourself and the old fool a favour and stop telling these fantastic yet completely false tales already?

  73. oiaohm says:

    That Exploit Guy A utf-16 4 byte nul is not all zeros. That is the problem.

    Modified UTF-8, the null character (U+0000) is encoded as 0xC0,0x80

    I can dig up the UTF-16 one as well. 0000 hex is UCS-2 null. The 32 bit UTF-16 nul is lot like the UTF-8 modified nul. Its a mistake to think null is all zeros. This is why something processing UTF-16 or UTF-8 modified and is not compatible can go completely nuts and miss the end of string.

    UCS-2 processing does not support UTF-16 surrogate pair nul. If UTF-16 surrogate pair nul was all zeros or last half of the surrogate pair was zeros it would no bust. The reality is the surrogate pair nul is not zeros.

    Basically a string with surrogate pair nul is one of the tests if you have a working UTF-16 processing system.

    Tradition and Simplified writing was in fact designed for computers. Like it does not include regional things like the special chars for Tibetan and so on sub forms.

    As more and more regional areas of china are being developed more and more of the regional dialects of Mandarin chars are having to be added. Including complete sets of Tradition and Simplified with minor extra markings and region variation on meaning. Yes how the language is written is not 100 percent constant across china.

    The numbers I am talking about are conservative. Compatibility Encoding Scheme is to avoid the nightmares that are coming.

  74. That Exploit Guy says:

    CESU is another upgrade.
    Compatibility Encoding Scheme for UTF-16 is not part of the Unicode standard and there is no reason for a Unicode-compliant system to need to support it.
    Java added support for UTF-16 4 byte Null char.
    A UTF-16 character is either 16-bit or 32-bit long depending on whether the character is in the BMP. The Chinese block is in the BMP.
    Besides, you do realise “4 byte Null char” essentially means a zero repeated 32 times in binary, right? What makes you think this somehow represents a whole host of characters is simply beyond me, my Time Cube-advocating friend.
    MS SQL Server works find until you drop a 4 byte null on it something you get out of conversions of the large languages.
    Yeah, yeah, yeah… We get it. MS SQL is educated stupid. 4 corner Null bytes is God. Etcetera, etcetera.
    Basically you should not find UCS-2 processing engines any more.
    There is no more a “processing engine” than an API for you to handle Unicode characters in Java. Again, as the documents tells you, a Character instance is simply a object representation of a group of 16 bits. To encode a so-called “supplementary character” (i.e. non UCS-2 character), you need two surrogates units each of 16 bits wide, or, in other words, two Character instances in Java. This is in adherence to the “surrogate pair” definition as defined in Section 3.8 of the Unicode standard.
    That Exploit Guy china is mapping every single Chinese char that has ever been used.
    And why wouldn’t they? In any case the prescribed mapping mechanism between Unicode and GB18030 has already accommodated this nicely. The only problem here is your impaired cognitive ability failing at grasping the standards involved.
    That is approx 3 billion chars to map.
    That is an entirely speculative claim not grounded in facts.
    Each Chinese person in china could have 2 chars they call their own.
    Ditto. It appears you are only saying this because
    you have realised just how big a 31-bit space in fact is.
    That Exploit Guy basically unicode was design without allowing on how complex Mandarin really is
    Mandarin is either what you eat or what you speak. The written characters, on the other hand, are either Tradition Chinese or Simplified Chinese depending on geopolitical factors (e.g. PRC vs. RoC).
    That’s usually how I explain the difference to someone unfamiliar with the language.
    once you add in regional dialects
    Big whoop. Supporting regional characters are only a matter of allocating supplementary blocks for them (e.g. HKSCS).
    Besides, Unihan takes care of the glyphic differences between Traditional Chinese and Simplified Chinese (as well as Kanji and Hanja), so that is also a non-issue as far as Unicode implementation is concerned.
    >The private area in Unicode in fact does not have enough space to map all the chars required todo support the language in china properly.
    Again, that is completely speculative and not grounded in facts.
    Mapping to PUA is usually used as a interim measure until the characters in question are allocated proper code points (again, see HKSCS). Even GB18030 at this point has only a handful of code points mapped to PUA – hardly the hundreds of millions you are desperately claiming.

  75. oiaohm says:

    That Exploit Guy UTF-16 is a upgrade to UCS-2. Its not in fact piggy backed. Java 1.5 implementation of UCS-2 did not reject non UCS-2 chars. CESU is another upgrade. UTF-8 and UTF-32 on Linux heck UTF-16 on Linux does not destroy CESU. Yet UTF-16 processing on windows has issues.

    Java added support for UTF-16 4 byte Null char. This is the only super critical addition to turn a UCS-2 implementation to a UTF-16 implementation. Guess what MS SQL Server is missing. MS SQL Server works find until you drop a 4 byte null on it something you get out of conversions of the large languages. Yes stupid convention in UTF-16. With UTF-16 if most of the chars in a string are 4 bytes the nul is 4 bytes and if most of the chars are 2 bytes the nul is 2 bytes and the 2 byte nul is exactly the same as UCS-2. UTF-8 says the shortest possible exception being Java that uses a non standard null UTF char to mark data Null. Of course UTF-32 nul is a fixed size.

    Basically you should not find UCS-2 processing engines any more. If it UCS-2 it should be processed by UTF-16 engine. The extra UTF-16 chars are not valid UCS-2 so a valid UCS-2 item should not contain them. Yes UCS-2 string processes safely in UTF-16 engine. But a UTF-16 string does not process safely in a UCS-2 engine.

    That Exploit Guy china is mapping every single Chinese char that has ever been used. That is approx 3 billion chars to map. Each Chinese person in china could have 2 chars they call their own. In fact it turning out to be critical. Regional meanings of particular words in china is different. The different meaning is also backed by a slightly different symbol.

    That Exploit Guy basically unicode was design without allowing on how complex Mandarin really is once you add in regional dialects and how much char space it in fact needs.

    The private area in Unicode in fact does not have enough space to map all the chars required todo support the language in china properly.

  76. That Exploit Guy says:

    That_Exploit_Guy the problem is current code point assignments by China without Unicode main body approval exceed the 31 bit limit.
    Mate, 2^31 is 2,147,483,648. If characters were people, then even China had less than that number.

  77. That Exploit Guy says:

    That_Exploit_Guy UCS-2 in java has been replaced by UTF-16 then by CESU or Modified UTF-8.
    UTF-16 support in Java is piggybacked on the UCS-2 stuff from before version 1.5. This is well explained even in the base API documentation:

    The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

    China is not using the Private Use Area.
    No one does. That’s why it’s called “mapping”: it happens only on the machine where you do the GB-to-Unicode conversion. Boy, do I really need to explain something like this to you?

  78. oiaohm says:

    That_Exploit_Guy UTF-8 free space above unicode provides a place to map the extra chars.

    That_Exploit_Guy UCS-2 in java has been replaced by UTF-16 then by CESU or Modified UTF-8. CESU is a horible beast. CESU and Modified UTF-8 Is possible todo 36 bit Unicode.

    That_Exploit_Guy please find a document that is current saying java is using UCS-2. You will find that your comment is out of date badly. Java 6 was not using UCS-2 by default. Java 6 was already CESU,

    CESU is method that would allow gb18030 on ntfs. But this requires altering ntfs utf-16 processing not to stuff up CESU coding.

    That_Exploit_Guy the problem is current code point assignments by China without Unicode main body approval exceed the 31 bit limit. CESU and UTF-8 and UTF-32 is what china is design for. CESU is a major extension to UTF-16.

    China is not using the Private Use Area.

    http://www.unicode.org/reports/tr26/tr26-2.html

    That_Exploit_Guy the issues I am complaining about with Windows there are existing solutions to work around them if Microsoft will implement them.

    That_Exploit_Guy the limits are on wikipedia and in the official documentation of Unicode,

    That_Exploit_Guy the last major thing using UCS-2 is MS SQL Server everything else has migrated. The problem is there is a need to mirgrate off UTF-16 as well.

  79. dougman says:

    How does it feel being dominated by Linux daily?

  80. That_Exploit_Guy says:

    UTF-8 since 2005 stops at 32 bit limit matching UTF-32.
    If I need to buy a new desk because of you, I’ll make you pay the bill.
    UTF-32 is simply UTF-16 with a constant number of four octets per characters.
    UTF-16 is simply a variable-length extension of UCS-2 with two additional octets.
    UCS-2 is simply a two-octet encoding that used to form part of the Unicode standard and superseded by UTF-16.
    UCS-2 is notably used by the Java programming language, in which low-paid devs in China churn out code by the metric ton per second.
    GB 18030 is by design mappable bijectively and trivially to Unicode, and in the case that the Unicode Chinese block does not fully cover the assigned code points in GB 18030, the excess are simply mapped instead to the PUA. At the moment, only a handful of odd-ball characters are in this “excess” category and are a non-issue as long as you don’t happen to need the same PUA code points for other, non-Chinese characters.
    Also, no, Unicode, not UTF-16 or UTF-8, is the code point assignment standard. Switching everything to UTF-16/32 to UTF-8 simply won’t change how code points are mapped between Unicode and GB18030 even one iota.
    Understood?

  81. That_Exploit_Guy says:

    That Exploit Guy UTF-8 and UTF-16 don’t have the same number of chars limit
    Which part of “restricted to have the same number of valid code points” do you not understand?
    Oh, I see… You are again regurgitating what I have just told you as though it was a never-mentioned discovery. Nice going, jackass.

  82. oiaohm says:

    That Exploit Guy UTF-8 and UTF-16 don’t have the same number of chars limit. UTF-8 is has 36 bit limit. UTF-16 is a 31 bit limit. 2003 it was agreed that UTF-8 would stop at the same point as UTF-16 in 2005 this was changed. UTF-8 since 2005 stops at 32 bit limit matching UTF-32.

    That Exploit Guy SQL Server 2008 still uses UCS-2 internally for varchar not Unicode even that it was superseded in 1996 by UTF-16. This is why it explodes.

    With all the extra chars from all the extra languages 31 bits was not enough. 32 bits might not be enough either. UTF-8 only runs into trouble at 36 bits. Even after 5 years+ since the china char standard the Chinese char extensions to Unicode are not formally ratified even that they are in use. Basically how to you argue with what 5 billion people are doing.

    That Exploit Guy the issue is standards evolve sometimes by brute force. UTF-8 extension and UTF-32 extension is Brute force.

  83. That Exploit Guy says:

    UCS-2 only supports the older GB2312 so when you start using the new/old chars added in GB18030 windows busts doing the conversion. Microsoft has programs that depend on getting UCS-2.
    *headdesk*
    UCS-2 and GB 2312 are both encoding standards. The word you are look for is “mapping”, and, no, GB 2312 is not designed to have a simple or obvious way to map between itself and any Unicode encoding.

  84. That Exploit Guy says:

    UTF-8 and UTF-32 support all local char sets. UTF-16 and UCS-2 does not support all local char sets this is the problem hit with varchar in SQL server.
    What on Planet Earth are you babbling about, Mr. Gibbering Fruitcake? By standard, both UTF-8 and UTF-16 are restricted to have the same number of valid code points. If you somehow manage to store more characters in UTF-8 than in UTF-16, then the only explanation is that your implementation is non-standard.

  85. That Exploit Guy says:

    So, I create a database here and ship it over there and it won’t work with Chinese?
    *Sigh* The mapping between GB 18030 and Unicode is bijective, so what is wrong exactly to store GB 18030 strings as Unicode? Are you suggesting that Unicode will somehow make Chinese strings suddenly become non-Chinese, Bob?

  86. oiaohm says:

    That Exploit Guy the issue is UCS-2 and UTF-16 true don’t support GB18030 fully. UCS-2 only supports the older GB2312 so when you start using the new/old chars added in GB18030 windows busts doing the conversion. Microsoft has programs that depend on getting UCS-2. UTF-16 only supports some of the new chars in GB18030 but not all. In fact Linux users do suffer from this on samba where particular files will disappear to MS Windows users or at worst MS Windows crash.

    Interesting enough SQL Database when it loads scripts goes UTF-32 so it does not have problem handling a GB18030 script but since the storage in the database is locked to a limited Unicode you are stuffed using it properly. Every other database maker in existence did not choose UCS-2.

    Basically you have a mix of unicodes inside windows with different levels of compatibility due to two unicodes used by Microsoft that are not used by anyone other than Microsoft so don’t get updated to support the new standards.

    UTF-8 and UTF-32 support all local char sets. UTF-16 and UCS-2 does not support all local char sets this is the problem hit with varchar in SQL server.

    Result is a stack of random land mines all around windows.

    By the way the were new chars added in 2005 to GB18030. Linux supports the latest GB18030 release. Windows does not at all. Windows cannot covert the latest release of GB18030 without third party software.

    That Exploit Guy Microsoft thinks it fine to claim GB18030 support when they only support GB18030-2000 not GB18030-2005. Then there is a extension in 2006 for Tibetan.

    That Exploit Guy basically encoding conversion is a major weakness in windows.

  87. TEG wrote, “the best way to store GB 18030 strings is to set the type as nvarchar”.

    So, I create a database here and ship it over there and it won’t work with Chinese? That’s broken software. PostgreSQL doesn’t have that problem. I get Chinese comments sometimes in my spambox and I use MySQL. It’s a damned good thing I don’t use M$’s products.

  88. That Exploit Guy says:

    Seems to me an error-message is an issue
    No, Bob. What Qingsong Yao tells you is that the best way to store GB 18030 strings is to set the type as nvarchar and let MS SQL take care of the UnicodeGB mapping for you.

  89. TEG wrote, “Blah, blah, friggin’ blah.”

    The post he quotes contains this, “What if user has GB18030 encoding data/files/scripts, the answer is that SQL Server can recognize this GB18030 encoded input, and convert them into Unicode, so you will no issue with these files/scripts/data.”

    Yet the user reported, “When I try to insert different combination of GB18030 characters in to a varchar column, it gives me unique constraint violation error.”

    Seems to me an error-message is an issue… The advice? “the best way to store GB18030 characters is use SQL Server’s nvarchar type.”, so M$ wants folks to rewrite their software and use a different database-structure to use a different data-type just to use the Chinese language characters. That’s an obscene definition of support. Why not just use PostgreSQL from the start? What has all this to do with the death of XP? I’ve seen teachers who had French keyboards trying to use XP. What a joke is M$. Software that’s a patch-work covered over with layers of complexity. Meanwhile, GNU/Linux shipped anywhere in the world will work with anyone, any hardware, anywhere in the world. One of the problems is M$’s EULA. They forbid buying a product in one part of the world and using it elsewhere. That’s on top of the stupid technology underneath. That’s what you get with an OS designed by salescon-men.

    EULA: “All legally licensed Microsoft products should contain an End-User License Agreement (EULA), which is your primary proof that you own a legally acquired product. However, it is also recommended that you keep the original user’s manual (or at least the cover and first page of the manual), the product disks, the Certificate of Authenticity, and your purchase receipt.”

    HAHAHA! ROFL! Give me Free Software any day. In most of the schools where I worked very few items from that list were available.

  90. That Exploit Guy says:

    gb2312 is out of date. The current China official is GB18030
    Good to know that you have access to Wikipedia (though so does everyone). Shame that you don’t seem to have a clue as to why these standards stick around despite Unicode exists.
    Read the XP bit. NTFS is UTF16 that does not in fact support GB18030 filenames.
    And Linux uses UTF-8. Big whoop.
    Linux can because GB18030 encodes into UTF-8 and UTF-32.
    And XP does the same between local encoding standards and UTF-16/UCS-2. Again, big whoop.
    Welcome to fun XP can have GB18030 enabled but its only part.
    Yet entire localised versions of XP exists despite your claim. Also, think Big5(-HKSCS), (Shift-)JIS, etc. etc. Why do you think there is a need for an OS to support any of those despite the underlying platform doesn’t even use or need any of those encoding internally? The answer lies in the very first thing you have failed to grasp.
    Yes GB18030 on XP and Windows 7 and Windows 8.1 is glitch central.
    Really? Pray tell what these alleged glitches are. I have Windows 8.1 right in front of me to and a pair of hands to put those IMEs to work. All that remains now is you telling me where to look. (Oh, and the way I managed to find that Chinese XP installation guide might shock you a little!)
    Yes MS SQL Server does not support GB18030 at all yet Orcale, Mysql and Postgresql support it.
    Blah, blah, friggin’ blah.

  91. oiaohm says:

    That Exploit Guy the default style manual(that most departments don’t use and have written their own including centerlink) for document production used by Australian government covers creating telegrams on the cheap. Yes as if you could send a telegram in 2002.

    Sorry the English requirement on centerlink staff is very broad. Yes everything from morse and telegram style english to full formal using every grammar mark known to man. They have to be able to cope with the lot properly. By the way centerlink uses computer translation of morse so you must be able to type morse style. Particular areas its very bad to be too heavy in grammar marks.

  92. oiaohm says:

    That Exploit Guy really you don’t know Centerlink at all. Under disabled support services section of Centerlink they do support Morse code to english. Yes there is even a phone number you can ring and morse to centerlink. Has been useful when I have had badly damaged phone-lines due to storms. Too much static for voice but not enough to stop morse.

  93. oiaohm says:

    That Exploit Guy basically because I do deal with china. I do understand where the big bug bears are. Yes even the so called localised versions of Windows don’t work properly.

    Now you take english version of Ubuntu install it and then install the china parts everything works properly. Take localised version of Ubuntu for china install english and still everything works properly. Install every language Ubuntu supports and everything still works properly. WIndows you can have IME fights to the point you cannot input if you install the wrong languages.

    If you really wanted to walk into a weak area of windows international language support. Mostly caused by Microsoft choosing to go their own way with UTF16.

  94. That Exploit Guy says:

    That Exploit Guy centerlink does accept my form of english. If the form was acceptable to old telegrams it is in fact acceptable to centerlink.
    Fascinating. Tell me more about this Morse-Code-English-to-Centrelink business that you have got going. Do you use your psychic WiFi-bending ability to transmit your form? Or do you prefer to use the wires that you mend fences with?

  95. oiaohm says:

    That Exploit Guy you love digging your self into a deep hole. gb2312 is out of date. The current China official is GB18030

    http://en.wikipedia.org/wiki/GB18030

    Read the XP bit. NTFS is UTF16 that does not in fact support GB18030 filenames. Linux can because GB18030 encodes into UTF-8 and UTF-32. Welcome to fun XP can have GB18030 enabled but its only part. Yes GB18030 on XP and Windows 7 and Windows 8.1 is glitch central. Yes MS SQL Server does not support GB18030 at all yet Orcale, Mysql and Postgresql support it. What year was GB18030 in china released. The year 2000. So 14 years later Microsoft still does not support China official encoding.

    So how is something supported when Windows XP, 7 and 8.1 cannot handle the correct encoding????

    Its funny That Exploit Guy you hear Microsoft idiots all the time saying that X encoding is not Unicode supported when all it is that is not UTF16 supported. UTF16 is Microsoft beast of Unicode that Microsoft fails to keep upto date.

    http://www.pinyinjoe.com/linux/ubuntu-10-chinese-setup.htm IME in linux is iBus. IME is something Linux was missing but that has not been true for over 6 years so far.

  96. oiaohm says:

    That Exploit Guy centerlink does accept my form of english. If the form was acceptable to old telegrams it is in fact acceptable to centerlink. This is the problem That Exploit Guy you have no clue what is acceptable. Only grammar you used in old telegrams is full stops and caps.

  97. That Exploit Guy says:

    So DrLoser drew my attention to this…
    That Exploit Guy yes installing particular languages in Windows XP is that far fun that you find places talking through the instructions recommend you install a decent operating system. Yes a decent operating system being something Linux.
    Oh, isn’t this the same guy with alleged 25 years of experience “dealing with [China]“? Hasn’t the poor bugger heard of non-Unicode encoding (e.g., GB 2312)? Or IME (e.g. Pinyin)? Those are what East Asian languages support is for in the English version of XP.
    By the way, in East Asian-localised editions of Windows XP (e.g. Traditional Chinese), the East Asian languages support option is installed by default (of course).
    Twerp.

  98. That Exploit Guy says:

    “you trolls” was all encompassing
    You don’t address a collective as “you” unless you feel it’s something that you are not a part of. That’s my point. Are you too impaired cognitively to understand the meaning of your own words, or are you trying to pull an Oiaohm on me?
    then perhaps you suffer from a narcissistic personality disorder and need a check-up from the neck-up.
    Far from that! To borrow Lennart Pottering’s famous saying, “I love everybody!

  99. dougman says:

    “you trolls” was all encompassing, if you feel like I was point at ONLY you, then perhaps you suffer from a narcissistic personality disorder and need a check-up from the neck-up.

    Since you brought it up, do you own/run a business? Obviously not; a 90-day guarantee shows faith in the service and product to the customer. It pulls people off the fence and invites them into a new world with Linux.

    Now that M$ is building free software for Linux, perhaps there is a slight chance you may come to grips with the blog entry of “Linux Domination”, as it sure won’t be NT in the long-term.

    Eh.

  100. That Exploit Guy says:

    In fact, lets be honest shall we? The REAL reason you trolls perpetuate the lies against LInux as you do so, is to maintain and protect your vested interest.
    Lies? Sheesh, don’t confusing me with that other guy, whose lifelihood is built entirely on flim-flaming his customers with “90-day guarantee” infomercial deals!
    (Actually, I think he makes those fat-burning thingamajigs on TV look real by comparison. Don’t you agree?)

  101. That Exploit Guy says:

    Hmmmmm… funny if one mentions Linux, you trolls says its a waste of time and money, but eh, its the companies business.
    You troll”? Funny, I thought we were all having great fun here. It seems you are more than happy to accept Oiaohm to be a member here but not me. So, is that your way to discriminate the not-crazy?
    Boy, you are making me feel both flattered and upset at the same time! How conflicting!

  102. dougman says:

    “How a business wants to run thing internally is its own business”

    Hmmmmm… funny if one mentions Linux, you trolls says its a waste of time and money, but eh, its the companies business.

    In fact, lets be honest shall we? The REAL reason you trolls perpetuate the lies against LInux as you do so, is to maintain and protect your vested interest.

    Just think, a business could save a $100K or more by adopting Linux in the near-term, linux is inevitable and M$ will either die trying to fight it or drown in its own delusion hubris.

  103. That Exploit Guy says:

    @Gibbering Fruitcake
    That Exploit Guy the base book is 59.95 but each departments tweak/full replacement is not publicly printed.
    Nobody gives a toss about that. How an organisation wants to do things internally is its own business, but you will have to wait until hell freezes over to make other people follow your internal rules.
    Also, let’s face it: even friggin’ Centrelink won’t accept your “strictest” “English” on your application for unemployment benefits. Whom are you kidding here?
    I honestly don’t care if it was a car crash or crystal meth leading you to such a state: I just think it has been fun listening to Grampa Simpson telling dubious war stories. So, why won’t you take a hint and leave the conversation as that?

  104. oiaohm says:

    http://eastasiastudent.net/study/windows-xp-east-asian-languages-support-no-cd
    That Exploit Guy yes installing particular languages in Windows XP is that far fun that you find places talking through the instructions recommend you install a decent operating system. Yes a decent operating system being something Linux.

  105. oiaohm says:

    That Exploit Guy the base book is 59.95 but each departments tweak/full replacement is not publicly printed. You should be asking why the last update was 2002. The 59.95 style guide is now in fact wrong for what Australian Government departments make now.

    You are the fool here That Exploit Guy exactly why would a style guide not be updated for 10+ years. Australian government was the worst place to point to for a style guide/manual. The style manual is these days per department only those dealing with the government departments get the clear point of view that what is written on the web sites loop hole is now exploited to the extream. Yes government in power in 2002 attempted to tidy up Australian government department document style and has failed almost completely.

    That Exploit Guy swaping my words is also incorrect there is a huge list of Ubuntu Translations. The language bit of the table on the translators page is the list of languages that Ubuntu has either full or part translations for.
    https://translations.launchpad.net/ubuntu/precise/+translations

    Please note a part Ubuntu translation is far more complete than a MS product translations in most case.

    Translations exist for everyone you claimed no. Complete translations don’t exist for those running Windows either. Translating is a ongoing process.

  106. That Exploit Guy says:

    That Exploit Guy the Australian government style guide is worth less than toilet paper for how useful it is. The first 5 pages warning you its worthless is the most important part to read.
    The sheer galls you have to blatantly make ludicrous stories about a publicly available publication is admirable though wasted.
    To someone with no moral qualm to casually butcher a language at every turn, of course uniformity is unimportant, but when you have to deal with a multitude of organisations varying from public to private and from state to federal, you cannot afford to have everyone speaking their own micro-language. The legal and administrative nightmares you will end up facing simply outweigh any benefit you will gain from pushing your own rule in preference to just following the Style Manual. If you are taking about internal memos, sure, who cares you go crazy about it? But official communiques? As the Australian saying goes, “Don’t come the raw prawn, mate.”

  107. That Exploit Guy says:

    Let’s milk all the comedy out of this while we still can, shall we?
    That Exploit Guy I have already provided one reference to lack of grammar marks.
    “Grammar marks”? This is like Tommy Wiseau in The Room: if you don’t know what a “fiancée” is, just call her “my future wife“.
    Marks of the Grammar!

  108. That Exploit Guy says:

    Basically you did not read completely what you used as a reference. You have the book that you buy that provides you with a so called over view of the style guide. Then each god darn department is free to add their own tweak or alteration or complete replacement to the style guide and provide it to you under NDA conditions.
    Bahawhaw… Even cab drivers can’t match you in the ability to generate such tall tales!
    Whenever you see the “Style Manual” mentioned in anything government-produced, it refers to the Style Manual as published by Wiley & Sons, Ltd. The Manual itself is not under any NDA and is available for purchase from the publisher at the low, low price of AU$59.95. Check the ISBN, you fool!

  109. That Exploit Guy says:

    There are a huge stack of Ubuntu translations.
    Translators, not translations, my little dear.
    Beg me, and a copy of the Macquarie Dictionary will be all yours to replace your old, imaginary one. It’s a bargain you will otherwise not see inyour lifetime.
    Or do you prefer to win it by proving your argument? Go on, prove me wrong. (Actually, you might as well forget about that *chuckle*).

  110. oiaohm says:

    From you own link That Exploit Guy

    http://www.qld.gov.au/web/cue/module8/style-topics/
    Some agencies also have a supplementary in-house guide for department-specific terminology and conventions.
    Basically you did not read completely what you used as a reference. You have the book that you buy that provides you with a so called over view of the style guide. Then each god darn department is free to add their own tweak or alteration or complete replacement to the style guide and provide it to you under NDA conditions.

    That Exploit Guy the Australian government style guide is worth less than toilet paper for how useful it is. The first 5 pages warning you its worthless is the most important part to read.

  111. oiaohm says:

    That Exploit Guy no point referring to pages thinking its the forward of the book is all you have to read. The first 5 pages. There is no need to cite chapters or anything like that when its the first 5 pages.

  112. oiaohm says:

    That Exploit Guy there are a hell of a lot of examples in Windows translations of not a language. But being a mangled mess of two or more languages mixed to so be a language. Yes poor translations is a real fault of windows and applications failing due to api/abi changes based on Language pack applied to windows.

    Linux/Unix language handling is many times more stable than Windows.

  113. That Exploit Guy says:

    That Exploit Guy its recommend for min grammar when documents will be going to many different departments inside the Australian Government.
    You still haven’t cited the relevant part of the Style Manual. Chapters, pages, subheadings, anything!
    How did you manage to get through high school not knowing how to cite properly? (Well, did you?)
    Did you miss the note you the page you pointed to that every Australian government department has its own tweak.
    Irrelevant, and not as hilarious as the many names your absurdly poor grammar goes by.
    Next wack job, please.

  114. oiaohm says:

    That Exploit Guy by the way are you an idiot who cannot google who just wanted to claim a random set of Languages not supported.

    There are a huge stack of Ubuntu translations.
    https://translations.launchpad.net/+groups/ubuntu-translators

    Does Ubuntu/Mint support, say, Bulgarian? yes
    Or Estonian? yes
    Or Hungarian? yes
    Or Latvian? Yes
    How about Lithuanian? Norwegian? Polish? Thai?(Yes, yes , yes and yes).

    Correction of That Exploit Guy super incompetence. Every thing he claimed as not supported is in fact supported by Ubuntu and Mint. Norwegian their is in fact two languages in that country and Microsoft only thinks their is one language. In fact norwegian in Windows is a mix of both Norwegian langauges. Ubuntu supports both properly. Yes Norwegian is either Bokmai or Nynosk and Windows has a mix up mess translation to so call Norwegian that is not a language. Norwegian is a race of people. Bokmai and Nynosk are the Norwegian langauges, Norwegian schools do use Ubuntu and Debian a lot for a very valid reason. Microsoft support for their country suxs.

    Ubuntu can install all its translations side by side without issue. Windows has conflicts between language packs.

  115. oiaohm says:

    That Exploit Guy I have already provided one reference to lack of grammar marks. Missing of verbs and prepositions come from other places. What I am producing is pre style guide application.

  116. oiaohm says:

    That Exploit Guy its recommend for min grammar when documents will be going to many different departments inside the Australian Government. Did you miss the note you the page you pointed to that every Australian government department has its own tweak.

    So : and , or : and ; or , and , in that picked out sentence would have been required depending what department it going to.

    What I am producing is before customisation for departments.

  117. That Exploit Guy says:

    Linux is on track to supplanting Windows in the long-term, the exodus from XP will enable its extended growth
    If you have to base your “growth” on a purely hypothetical palmtop revolution (the One True Cause since 1985!), then it’s not really much more of a growth than a mere fantasy, isn’t it?
    Meanwhile in the real world

  118. That Exploit Guy says:

    I am fairly certainly that Linux has been translated in far more languages then M$ Windows has been.
    Really? Let’s pin the discussion down to about something more tangible than “Linux”, shall we?
    Does Ubuntu/Mint support, say, Bulgarian? (No)
    Or Estonian? (No)
    Or Hungarian? (No)
    Or Latvian? (No)
    How about Lithuanian? Norwegian? Polish? Thai?(No, no, no and no).
    Well, even XP supports all of them. Crap basket, innit?

  119. That Exploit Guy says:

    Speaking of mangled languages, has Bob realised that the Style Sheets here are a terrible mess?

  120. That Exploit Guy says:

    *historical literature
    It seems I am beginning to turn into Oiaohm myself.
    … beedictionary…
    Did you drop your copy of the Macquarie Dictionary into the loo, my dear?
    If you kindly ask me, I’ll perhaps consider buying you a new one.
    … it [sic] a common error…
    Though obviously not as common as “old english [sic]” or “Morse Code style English“.
    Strictest, yet flawed. How fascinating!

  121. That Exploit Guy says:

    I do have a copy of that book. Apparently you don’t. Formal and Structured are in that book.
    Do point out which part of the Manual features this alleged “strictest” style “English” of yours, or, failing that, any government document or publication that employs it.
    OK, forget the Style Manual (since obviously you do not have a copy) – point me to historical literature in which your bizarre, mangled version of “English” (with marked absence of proper punctuations, verbs, prepositions, etc. where they should be) is featured. Oiaohm’s Gazette of Gobbledygook does not count, however.
    In fact, you may also forget history literature; the alleged style of English that you insist is otherwise known as aphasia – an impairment most common to people suffering from partial loss of brain functions due to head injuries (which also explain your oft very, very, delusional thoughts)…

  122. oiaohm says:

    http://www.finance.gov.au/policy-guides-procurement/publishing-information/style-manual/

    I do have a copy of that book. Apparently you don’t. Formal and Structured are in that book.

    http://www.beedictionary.com/common-errors/seam_vs_seem Yes seam and seem is a common error. Since it a common error I have not put much effort into forever fixing it.

  123. dougman says:

    I am fairly certainly that Linux has been translated in far more languages then M$ Windows has been.

    Linux is on track to supplanting Windows in the long-term, the exodus from XP will enable its extended growth.

  124. That Exploit Guy says:

    That Exploit Guy of course I would expect by your prior mangling of english not to know of the types.
    You mean the types that have no chance on earth to be in Australian government documents and publications? Of course I know. Even if I didn’t, I sure have learned it by now from the twerp that misspells “seems” as “seams“.

  125. oiaohm says:

    Sentence with two or more sub-clauses is valid. I use sub-clauses more than most people. Attempting to rewrite a sentence containing sub-clauses into independent sentences ends with a context disaster. Please note independent sentences this is key. It is possible to split it into two dependent sentences.

    kurkosdr did not know how to grammar sub-clause sentence.

    Google account sync between desktop and chrome os device results in: If users have adblock plus installed on desktop, it ends up on the chrome OS device.

    This is another way to grammar it. The idea that it can be broken into individual sentences results error. Yes it can be broken in 2 but not by a full stop colon has to be used. Commas or Colon is valid. 1 long sentences or 2 sentences dependant never 3 sentences. Anyone saying 3 need to go back and learn english.

    A sentence must contain a main clause. Yes the first sub-clause is strong enough to be lifted to a dependant main clause. The second sub clause is not that strong.

    Most of what I write has 3 to 4 of doing the grammar without changing its meaning. semi-colon could also be used in place of the last comma. The problem here is my first line is valid and every one of the other grammar mark combinations I have done so far is valid.

  126. oiaohm says:

    http://www.chompchomp.com/terms/fusedsentence.htm

    A lot of people don’t know what a fused sentence is. The one that kurkosdr corrected was not a fused sentence because there was only 1 main clause. Example information does not magically become a new main clause.

    That Exploit Guy I am in fact better in english than a lot of people give me credit for. Just because a person uses rare form does not mean they are wrong.

  127. oiaohm says:

    That Exploit Guy of course I would expect by your prior mangling of english not to know of the types.

  128. That Exploit Guy says:

    wolfgang I think kurkosdr has seen it now. I am using Structured English in it strictest form. There is Formal English that is the exact inverse.
    Yeah, and this lady speaks feline.
    I am Mickey mouse.

  129. dougman says:

    Linux Mint trumps Windows!

  130. oiaohm says:

    wolfgang really you don’t know your cartoons. Yes Donald Duck has a Doctorate in Language it was taught to him by Professor Ludwig Von Drake. Basically Donald Duck can speak and write perfectly well if he can keep his temper in check.

    Professor of English is something Donald Duck is. If I did spend some time I could did up exactly what episode. It is like the fact Wile E. Coyote can speak but chooses not to. Wolfgang yes Disney and Cartoon worlds in general are full of contradictions.

  131. wolfgang says:

    …oiaohm now professor of English…

    who next? donald duck?

  132. oiaohm says:

    wolfgang I think kurkosdr has seen it now. I am using Structured English in it strictest form. There is Formal English that is the exact inverse. Most people write in the middle.
    Its like comma before & after and. In formal you must use them. In Structured you don’t have to. Most technical manuals are more often Structured English because it takes less chars. The writing style of Robert Pogson is closer to Structured.

    The problem is most children at school are not in fact taught English when it comes to grammar they are taught some local abomination of grammar. Then MS Word and other writing programs push another abomination of grammar. MS Word does not enforce Formal or Structured completely and does not allow you to insert a style guide either. Yes style guide basically lists where you will do Formal and where you will do Structured.

  133. wolfgang says:

    …kurkosdr professor of English…

    trick learned from oiaohm. leave out words not needed for point so can type more faster. make reader concentrate on meaning, too. more fun. can even go back and add different words to change meaning when found out wrong. win-win.

  134. dougman says:

    KUKU, do your homework! Obviously, you do not know anything. I backup everything I ever state on this website with written proof.

    MicroSh1t should be afraid: http://www.itworld.com/open-source/411261/microsoft-should-fear-android-desktop

    .xob ]X[ eht tih dna kcatta-traeh eht flesruoy evas ,stnemmoc eht daer ot srehtob ti fi ,ereh emoc ot uoy gnicrof si eno oN

  135. oiaohm says:

    kurkosdr to be correct this is the internet I don’t have to back up my claim every time because I might not visit the site again for months. It was a very simple google to find out how Adblock Plus killed statscounter and a few more googles and you would have see how it was killing other stats collecting methods.

    https://adblockplus.org/forum/viewtopic.php?t=6902 Yes people running websites running adblock plus wanting to see the stats.

    “Adblock Plus” on Firefox and Chrome use exactly the same default rules. The default rules blocks statcounter. So yes firefox users are going to disappear from statcounter as well.

    https://support.mozilla.org/en-US/questions/958513 only recently has Firefox addon sync started working.

    kurkosdr since it was such a simple google I did not think I had to explain it. How web stats is blocked I have explained to you before kurkosdr as well. So why this time could you not do you homework or do you have a brain that is highly forgetful.

    Its really rude to ask a person to explain something to repeatedly to you when you do have the means to find out the answer again yourself.

  136. kurkosdr says:

    Now THIS is structured writing. But when you make a claim, you have to back it up instead of telling people to do their homework. I don’t have to “do my homework” on every claim I see on the net.

  137. oiaohm says:

    kurkosdr I don’t have to describe how statcounter works.
    http://kirkstarr.wordpress.com/2007/08/25/how-to-foil-statcounter-and-retain-your-privacy/
    Its well and truly documented all over the place. Its just a fact you are commenting without doing you own homework. Kurkosdr why do I have todo your homework for you.

    “Adblock Plus” include http://www.statcounter.com url as a blocked url so the javascript never runs so you are never counted by statcounter. “Adblock Plus” is the most common chrome browser installed advertisement blocking solution.

    kurkosdr sorry how statscounter is defeated is common knowledge to anyone doing investigation into privacy. You also find a lot of other web stat collection tools are defeated the same way. Block the URL.

    Google sync replication of settings and extensions means once a user blocks themselves on one of their machines from a web stat collection. They disappear from all their machines they are running chrome on from the web stat. Firefox is also bring the same feature.

    Web Stats are highly worthless and will be highly miss leading. They will over count Internet Explorer due to lack of blocking extensions.

  138. oiaohm says:

    kurkosdr I do go into the effort of structuring. Just there are limits to what I can think about.

    In fact in this case both of those commas were in fact optional kurkosdr. I do use Morse as well. So unless a comma is required to alter context its optional. Full stop are mandatory. I do a lot of well formatted complex/compacted English.

    kurkosdr http://www.sffchronicles.co.uk/forum/32934-thoughts-on-comma-rule.html There is no universal rule on inserting or not inserting commas.

    This is your problem not mine kurkosdr . In fact the sentence without commas is also correct. There is no defined style nation style for writing here.

    Yes kurkosdr the breaks were in face to reader so you are not require by base English to place a comma at those locations. Yes both of the 2 commas were optional.

    dougman had no problem reading it. Mostly because he would have inserted mentally commas as you are meant to. This is just reading English correctly nothing more. What you failed todo kurkosdr.

    If I was putting in no formatting effort there would not be correctly placed full-stops.

    Common incorrect correction methods add full stops instead of commas. This is invalid editing. Many schools put too focus on full stops inserting and don’t cover comma inserting. Want a good books to go and open to see lack of comma’s open up the world book encyclopedia. Yes in a lot of sections that uses implied comma’s like I just did. My writing style is not invalid.

  139. kurkosdr says:

    “kurkosdr this is something people forget. With chrome does not matter if something is chrome OS or Chrome on Linux or Chrome on windows or Chrome on OS X or Chrome on Android. The sync attempts to make all identical without requiring any user effort. Default browser in Android is not Chrome.”

    Nope, still doesn’t make sense. Let’s teach you some writing skills. First of all, think about the claim you want to prove: ‘Adblock messes with web stats’. Secondly, think about the points you want to make in order to prove your claim, in other words, structure your writing into something other people can follow. Here is a proposed structure:

    1) Explain (with one or two sentences) how you think said web stats are made. Does StatCounter peek at the browser-string or os-string or do something else?

    2) Explain how Adblock or account sync or a combination of the two messes with the method you described in 1.

  140. kurkosdr says:

    “I warn people I do complex english”

    You don’t write complex english, you write brainfarts. Instead of going through the effort of structuring your writing into something that is readable, you just dump thoughts on the comment field as they come into your head.

  141. oiaohm says:

    https://support.google.com/chrome/answer/165139

    kurkosdr this is something people forget. With chrome does not matter if something is chrome OS or Chrome on Linux or Chrome on windows or Chrome on OS X or Chrome on Android. The sync attempts to make all identical without requiring any user effort. Default browser in Android is not Chrome. Google has removed a lot of Adblock software from the Google play store every time has resulted in Android numbers spiking until users get something Adblock back installed.

  142. oiaohm says:

    kurkosdr adding full stops to my english will stuff you. I do type every required full stop. If you think a place in my writing it should have a full stop it should have a comma.
    Let’s try to seperate this into sentences:
    How to screw up badly basically.

    Google account sync between desktop and chrome os device results in, if users have adblock plus installed on desktop, it ends up on the chrome OS device.
    That is the correct sentence. Yes it only one sentence. I warn people I do complex english. Yes the common correction methods will over add full stops so making my english unreadable. This is not my fault that you under use commas so destroying context of what I wrote.

    I never miss a full stop I only miss commas and : and other minor breaking things.

  143. dougman says:

    KUKU,

    What’s the matter, could you not discern what he was informing you?

    One can install Adblock on Chrome or Chrome OS, this is the reason for ChromeOS usage being so low, with Android it is somewhat difficult to install.

  144. kurkosdr says:

    “Google account sync between desktop and chrome os device results in if users have addblock plus installed on desktop it ends up on the chrome OS device.”

    Let’s try to seperate this into sentences:

    [Google account sync between desktop and chrome os device results in] [if users have addblock plus installed on desktop], [it ends up on the chrome OS device.]

    Did anyone managed to make sense of this? What does “ends up on the chrome OS device” mean? Who ends up? Aka to what thing “it” refers to?

    What does “end up” mean? “counted as” or “not counted as”?

    English hamster.

  145. oiaohm says:

    wolfgang the answer is from the Open Source point of view companies likes Samsung, Intel, AMD… should be raking in the bucks and making the OS.

    Why they designed the silicon. Since the designed the silicon if anything strange happens there staff does not require NDA or other items to look under the hood and find out what is wrong.

    Reality is compared to the hardware designers development teams Microsoft does not have staff.

    kurkosdr there is a answer why chrome OS is so low. Google account sync between desktop and chrome os device results in if users have addblock plus installed on desktop it ends up on the chrome OS device. Adblock plus happens to kill off a lot of the statistic sites collection sites. Its harder to install adblocking on android phones.

  146. wolfgang says:

    …Samsung…

    so all this open stuff exist so Samsung get rich? with 20K engineers, they should be able to do whole job themselves. that twice Microsoft staff.

  147. dougman says:

    These WinTard trolls that come here, could be hit over the head with a Linux hammer and still state, “Nope, no Linux to be seen”

    What they refuse to see is that, M$ see’s Android and ChromeOS as a venerable threat so much, that they are lowering prices. Isn’t competition wonderful? 🙂

    http://www.engadget.com/2014/02/21/microsoft-is-cutting-windows-pricing/

    FYI: Android and ChromeOS are both operating systems based on the Linux kernel and are both displacing MicroSh1t from the market.

    Deal with it.

  148. wolfgang wrote, “with Linux supposed to be free to study and change. who does that?”

    Uh, Samsung, who has hired ~20K programmers. Those guys tweak everything to work with all the kinds of hardware Samsung produces from refrigerators to smartphones.

  149. wofgang wrote, “chrome os low”.

    That’s from a standing start a couple of years ago. M$ was selling fewer units in those days. ChromeOS has done well in schools where browsers and cloud applications do a lot of what needs doing: finding, displaying, modifying and creating information. Google first tried it in USA where it took a big bite out of notebook sales. In 2014 that will happen globally. Governments and businesses see a lot of advantages to Chromebooks in that they don’t need a lot of maintenance. That’s catching on.

  150. kurkosdr says:

    *April 2010

  151. kurkosdr says:

    Correction: Regarding the XP vs 7 thing, I had switched the chart to bar graph with shows 32% for XP and 36.5% for 7, but later noticed the chart is from April 2011 to today, sorry.

  152. kurkosdr says:

    Also, Windows 7 is only 4,5 points ahead of XP? WTF.

    @wolfgang English man, do you speak it?

  153. wolfgang says:

    …chrome os low…
    go into store selling chromebook and see that not connected to internet, so not doing much except showing error message. not good for sales.
    no matter what you sell or even give away, you have to have salesman to show off somehow. lot to show to get someone to buy chromebook and not enough money in deal to pay for time, so eager beaver like dougman buy right away, but then run out of customers and have to find more but only have price tag for bait. not enough, I think.

  154. wolfgang says:

    android not real Linux, but you say kernel same. no doubt that is true, but big whoop. who care? big deal with Linux supposed to be free to study and change. who does that? nobody. so not such big deal. can use windows same way and make own programs with free tools. tools better than android tools, too.

    you big fan of cheap. not everyone that way. people buy apple because shows them chic and classy, not nerdy. people feel sorry for kid with fake iphone or ipad, put picture on poster to send money.

  155. kurkosdr says:

    Why is ChromeOS hovering so low? ChromeOS computers are built for the web, and supposedly have 20% of the laptop market, and occupy the top three spots of Amazon Top 10 for laptop computers. Where are all those people that bought them? I expected ChromeOS to have at least 3%.

    My guess is that either Amazon sales are a small part of computer sales (aka people prefer to purchase from other online shops or from brick and mortar), or they were impulse purchases and sit collecting dust.

Leave a Reply