Mini Beast: ARM Cortex A-72

Are we there yet? Nope. Not until 2016, according to ARM.“ARM’s new Cortex A-72 won’t ship until early 2016, according to ARM CPU group vice president of marketing Nandan Nayampally. But the chips should run at 2.5-GHz speeds” It does take some time for the smartphone-makers to catch up with ARM. Just last year they introduced the A57 BIG-little system at 20nm. Next year, it’s this thing at 16nm…

It’s only for smartphones, eh? Why, it closely resembles the CPU currently in my Beast, except Beast blows 95W for similar performance. No. These CPUs will find more uses than smartphones. If they use too little power for a desktop, folks can just throw in extra chips…“History proved that the market doesn’t care what I think,” he said, as proof that ARM shouldn’t decide what products to design for. “So the lesson to be learned here is that we shouldn’t dictate what our partners do with our technology.”

See ARM launches Cortex A-72 platform, powering flagship smartphones in 2016.

About Robert Pogson

I am a retired teacher in Canada. I taught in the subject areas where I have worked for almost forty years: maths, physics, chemistry and computers. I love hunting, fishing, picking berries and mushrooms, too.
This entry was posted in technology and tagged , , , , , , . Bookmark the permalink.

196 Responses to Mini Beast: ARM Cortex A-72

  1. IBM guy says:

    Hello all,

    The IBM System/360 had 128 bit hexadecimal floating point right from the beginning. I do not know why you say it had 32 bits with 24 bits of precision. REAL*16 (128 bit), REAL*8 (64 bit) and REAL (32 bit) were all supported in FORTRAN G and H.

    Hexadecimal floating point is still supported on IBM mainframes today as nearly everything that used to be present still is. But they have also added IEEE-standard (binary) and decimal floating point hardware support for the youngsters.

  2. oiaohm says:

    Ok Opps.
    1 clock your cpu clock speed.
    should be.
    1 lock your cpu clock speed.

  3. oiaohm says:

    Problem here is DrLoser the moron Googled or binged without understanding.

    Please note DrLoser clock_gettime and clock_getres is the same man page. Its not that I gave the wrong man page. DrLoser is to big of a idiot to know they are one and the same. Only thing that changes is the top line
    “clock_gettime(3) – Linux man page” and “clock_getres(3) – Linux man page”
    Why are they absolutely identical is they are a symbol link to the same file. clock_gettime is a symbolic link to clock_getres.

    If you look up using clock_gettime for proper benching you will find you are meant todo the following.
    1 clock your cpu clock speed.
    2 don’t use virtual machines.

    Both remove the clock speed from changing so making the nanoseconds reported line up with clock cycles. PAPI inside a virtual machine will give you close to correct. Yes it will still be wrong but will not be wildly wrong.

    clock_gettime is used as the posix netural way of benchmarking but its a pure pain in but to have to work out how to lock clock speed to get correct numbers when you can go and using PAPI or equal and get correct numbers and not care what the clock speed is doing.

  4. oiaohm says:

    DrLoser it is still the wrong clock.

    RT cycle count is wrong. Perf cycle count other than complete cpu resets that annoying SMM can do is truly a process clock.

    CLOCK_PROCESS_CPUTIME_ID
    Is still wrong event that is not cycles its nanoseconds at beast. 1 its the posix define of process. So it has to have nasty code to sync between cpus of multi threaded.

    http://en.wikipedia.org/wiki/Time_Stamp_Counter
    This is not raw accessed by either CLOCK_THREAD_CPUTIME_ID or CLOCK_PROCESS_CPUTIME_ID. Problem is both of these before your code gets it is processed into nanoseconds before you get your mits on it. This has been modified by CPU clock speed.

    Posix defines that a clock_gettime is nanoseconds and everything bar CLOCK_MONOTONIC is nanoseconds. CLOCK_MONOTONIC is undefined.

    You will see incorrect low values appear out of both of these. Why scheduler preemptive switches to another task on the cpu now the data the mmu part of the cpu is tranfering into L1 l2 and l3 does not get counted.

    Posix standard annoyingly does not include that returns a good clock tick.

    CLOCK_MONOTONIC is worse because NTP correct it. Linux provides a CLOCK_MONOTONIC_RAW that is not part of the posix standard. At this point you are ready to kill Posix. Because the RT library basically does not work.

    http://en.wikipedia.org/wiki/Hardware_performance_counter
    Hardware performance counters exist for very good reasons. Perf accesses these.

    Notice one of the counters is branch mis-prediction then you also cache misses and so on. So there is absolutely no reason to guess why X code is slow in fact you can put exact evidence of the case.

    The only way to access the Time_Stamp_Counter raw values under Linux is perf_events.

    CPU timing was in microseconds, in case anybody was interested:

    long long elapsed = (etimer.tv_sec – stimer.tv_sec) * 1e6
    + (etimer.tv_nsec – stimer.tv_nsec) / 1e3;

    This is your own statement DrLoser.

    2) Neither Deaf Spy nor I nor anybody else was measuring anything in seconds.
    DrLoser you can claim DeafSpy was not. But you are measuring in something related to seconds that is directly effected by clockspeed and virtual machine preempting. Guess what virtual machine does when it swaps away. Displays a lower clock speed. You fast examples could be simply that your code on those passes did not get preempted.

    DeafSpy measurement is also flawed because no performance counter is used to prove the cause. So you are left guessing.

    DrLoser if you redo your code using perf_event by PAPI you will most likely see that the times come out nice and constant completely disagreeing with DeafSpy complete arguement and showing exactly saying its complier related just as much as CPU. Complier can choose a different set of instructions to perform the same task. The different set of instructions can avoid branch prediction completely.

    If you had done this off the start it would have been worth while giving you how to push the L1 to make the microcode remove the branch prediction code. To see these modifications you must measure stuff correctly.

  5. DrLoser says:

    Face it, Fifi. The L1 cache has absolutely nothing to do with the issue at hand. You are wasting your (worthless) time, and also everybody else’s.

    Sorry DrLoser when you can bench properly would have BLAH BLAH BLAH and cmov code ends up BLAH BLAH BLAH

    Problem is while you code is in seconds BLAH BLAH BLAH. Yes a bit of asm BLAH BLAH BLAH.

    Suspicious minds here may perhaps assume that I have excised the parts of oiaohm’s explanation as to how L1 cache affects predictive branches in the CPU’s pipelines, and replaced them with “BLAH BLAH BLAH.”

    Trust me.

    I didn’t.

  6. DrLoser says:

    http://linux.die.net/man/3/clock_gettime
    DrLoser microseconds nanoseconds does make a bugger all of difference. If you are measuring in anything seconds its wrong.

    1) Wrong cite, Fifi. You should have used the more relevant clock_getres. And once again I am forced to point out that you are supposed to include the title of the goddamn thing as the text of the link. You don’t even do cites properly, do you?
    2) Neither Deaf Spy nor I nor anybody else was measuring anything in seconds.

    What is wrong with you? Why do you need to pollute the data provided by others with your needless ignorant gibberish?

    As cpu adjust clock speeds up and down the number of clock cycles per second change. So you want to bench correctly you have to measure in clock cycles that is a independent to different clock speeds the cpu can be running at.

    Which is why all modern clock-measurement avails itself of the onboard CPU registers — thus rendering everything else you have so far said on the subject nothing but abject fantasy.

    It’s possible that the Linux RT Posix API is not implemented in terms of this mechanism, even when specifying CLOCK_PROCESS_CPUTIME_ID, which is what I did. But that’s no concern of mine. I can only present numbers based on what other people claim they will trust. If you wish to disagree with the Linux RT Posix API implementation, I’m all ears.

    Did I need to look it up to know that what you pulled was incorrect not at all. Really I should have just wrote measuring in seconds of any form is wrong.

    Again with the “seconds?” This is pitiful, Fifi.

    Look, you don’t need to believe that you will learn anything from a cite before you check it out. (Otherwise nobody would bother with a single one of yours.) Just do it. And if it helps understand the problem, you didn’t waste your time, because you can now argue back on the terms presented.

    On the other hand, and with copious hints, it took you six whole days to understand the problem, didn’t it, Fifi?

    Six whole days that you could have avoided had you been intelligent enough to look up the cite.

  7. oiaohm says:

    http://linux.die.net/man/3/clock_gettime
    DrLoser microseconds nanoseconds does make a bugger all of difference. If you are measuring in anything seconds its wrong. As cpu adjust clock speeds up and down the number of clock cycles per second change. So you want to bench correctly you have to measure in clock cycles that is a independent to different clock speeds the cpu can be running at.

    Did I need to look it up to know that what you pulled was incorrect not at all. Really I should have just wrote measuring in seconds of any form is wrong.

    DrLoser rerun using a clock that measures in clock cycles. You will see different results. Including interesting strangeness when you do it over longer time frames.

    What, with a collection of random numbers as the input?

    BWAHAHAHAHA!
    Interesting enough GCC on -O3 build code paths presuming input values are random in the first place unless calculable otherwise. So performance alteration up due to branch prediction basically does not happen.

    So BWAHAHAHA is nothing. If there is some issue its just the complier.

    Face it, Fifi. The L1 cache has absolutely nothing to do with the issue at hand. You are wasting your (worthless) time, and also everybody else’s.
    Sorry DrLoser when you can bench properly would have got to you run branch predictor effected code multi times and notice that the cpu will take strange performance boosts so that branch predicted and cmov code ends up performing absolutly the same under particular conditions.

    Problem is while you code is in seconds and you not using something that is reporting branch prediction usage you cannot see it. Yes a bit of asm with a conditional jmp in it going up to the processor not resulting in branch prediction success/falure count not changing equals cpu microcode rewriting the code.

    Big problem with Microsoft windows is nothing like perf_events include by default.

  8. DrLoser says:

    Oh, and …

    Yes using a byte code and convert to native code at final CPU does offer chances of serous performance gains. Problem is this has to be AOT or the saving end up eating up by building over and over again.

    What, with a collection of random numbers as the input?

    BWAHAHAHAHA!

  9. DrLoser says:

    And back to my challenge, which I note that you wiggled around as usual, oiaohm:

    Moving on from the original stripped down purely pedagogical example, let us consider a similar case in real life.

    I am going to keep one half of the conditional as simple as possible. Compare A to B, for some value of A and B, and if they are equal, you return 0.

    Now: if A is not equal to B, you need to perform a calculation (not merely an increment to the accumulator). Take, for example, the need to multiply A by B and return a result.

    Now, as I mentioned, it took me all of a single second to realise that, when I issued this challenge, it’s a pretty easy one to explain.

    But you didn’t do that, did you, oiaohm?

    Cut the babble and get down to the nitty-gritty. And once you’ve done that, we can proceed to an even more realistic scenario of pipeline busting.

  10. DrLoser says:

    DrLoser GCC will do the L1 push but you have to tell it you want a old arch. I will admit when I was last in this section of Gcc was before pent 4. This is why I am think of the old L1 bash solution. Turns out since then intel has added a few instructions to make life nicer. But what is in L1 still effects branch prediction. This is why you see the extremes.

    And this is meant to be an explanation as to how the L1 cache (as I say, a dutiful but ignorant beast) affects either CPU branch prediction or code optimisation?

    Face it, Fifi. The L1 cache has absolutely nothing to do with the issue at hand. You are wasting your (worthless) time, and also everybody else’s.

  11. DrLoser says:

    DrLoser I did not even waste 1 min looking for that blog. It was that far incorrect I just ran from my old memory. I will admit my old memory was a little out of date missing the addition of cmov and a few other vectorization solutions.

    Vectorization solutions?

    BWAHAHAHAHA!

  12. DrLoser says:

    DrLoser even if you lock the test down to one cpu measuring in milliseconds still will not work.

    “Milliseconds?”

    Whatever makes you think I was measuring in milliseconds, Fifi? You do, of course, know the unit measurement of the Linux real-time library? The one that I specifically mentioned I was using?

    Pay attention at the back of the class, there. Matron will be in shortly with your daily sedatives.

  13. DrLoser says:

    DrLoser the cpu SMM operations will smash your pipeline if they have a critical event.

    So will hitting your CPU with a jack-hammer, Fifi. Or siting it under the impact point of a meteorite.

    But we’re not talking about that. We’re talking about a simple pedagogical example of pipeline-busting which can be demonstrated and measured via a perf test and which has obvious software implications.

    You’re not really addressing any of those implications, are you? It took you six days and a mountain of hints before you could even figure out what the problem at issue was.

  14. DrLoser says:

    This here is why you must know how to run profile guide optimization correctly. Neither the -O1 or the -O3 code produced by gcc produces a branch event.

    No, “that there” was a completely unsubstantiated and frankly preposterous report that a simple change from

    leal/cmovl/addq
    to
    setl/addl/movzbl

    …produces a speed-up of 110x. I’ve heard some loony tunes from you on performance, oiaohm, but surely even you don’t believe this.

    Yet one block of code is 110 times faster on that CPU. Please note I said that cpu. You can have two i7 same model both behave completely differently. Why something in 1 chip broken. Microcode loaded into chip detects it and altered x86 to internal risc conversion code sometimes the alteration is more effective sometimes worse.

    Utterly bizarre: and downright silly. Show me a single person who will not notice a perf downgrade of 10x to 100x and return the box to the store, and I will perhaps believe you.

    Not everybody works on dumpster-dived machines of questionable hardware integrity, oiaohm.

    Yes you can have the insane event where the cpu is broken in particular ways that the branch predicting code even with failures looks like a flat line and a cmov can look like wave.

    Very poetic. I can see an Anglo-Saxon Haiku in this particular gibberish.

    No, Fifi, you cannot have that insane event. And please confine your insane events to yourself. Most of the rest of us just deal with everyday life and realistic scenarios and test rigs and measurement and so on and so forth.

  15. DrLoser says:

    This here is why you must know how to run profile guide optimization correctly. Neither the -O1 or the -O3 code produced by gcc produces a branch event.

    Considering that I ran my tests on -O0, and that they showed no degradation around the boundary=5 mark, that’s pretty much an irrelevancy, isn’t it?

    Particularly since I specified -O0 at the time.

  16. oiaohm says:

    http://blogs.msdn.com/b/oldnewthing/archive/2014/06/13/10533875.aspx#10534337

    This here is why you must know how to run profile guide optimization correctly. Neither the -O1 or the -O3 code produced by gcc produces a branch event.

    Yet one block of code is 110 times faster on that CPU. Please note I said that cpu. You can have two i7 same model both behave completely differently. Why something in 1 chip broken. Microcode loaded into chip detects it and altered x86 to internal risc conversion code sometimes the alteration is more effective sometimes worse.

    Yes you can have the insane event where the cpu is broken in particular ways that the branch predicting code even with failures looks like a flat line and a cmov can look like wave.

    Perf is in fact a method to detect when CPU has failed parts and leaving spec. –tune=native with gcc on Linux is build the code bench the code on current cpu.

    The reality here is the x86 cpus since the first Pentium don’t in fact x86 asm. Instead run a translation engine from x86 to internal code that is different. So the only way to know if the code you have made performs or not is correctly perform benchmarking. Problem is this is cpu dependent.

    Yes using a byte code and convert to native code at final CPU does offer chances of serous performance gains. Problem is this has to be AOT or the saving end up eating up by building over and over again.

  17. oiaohm says:

    DrLoser the cpu SMM operations will smash your pipeline if they have a critical event. Only thing you can do to avoid that is not be on the core SMM will wish to take over. SMM at it worst will clear L1 and L2 as well as destroying your pipe line operation. You can kinda guess how badly that will throw your measurements around.

  18. oiaohm says:

    DrLoser even if you lock the test down to one cpu measuring in milliseconds still will not work.

    https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Developer_Guide/perf.html

    Run perf a bit DrLoser. Learn clock cycles and milliseconds even on a single cpu core don’t align. CPU clocks are not that dependable.

    Yes accessible by perf_events how many clocks something takes that is a fairly constant figure.

    Normally for most of this testing perf command is good enough. You run perf over the program of interest then extract the data of interest from the perf results. Yes perf includes where branch prediction failed. If you want the library to build it into your application it use PAPI on Linux. This gets you clock cycles.
    http://linux.die.net/man/3/papi_get_real_cyc

    libRT is posix neutral and worthless. PAPI use to be for Windows and does not work since XP.

    There is a intel solution that works on windows
    https://software.intel.com/en-us/articles/intel-performance-counter-monitor#license
    Of course this does not work correctly on all AMD or VIA processors.

    PAPI is just an api under Linux to access perf_events that provides the perf clocks that are based on hardware clocks inside the cpu.

    Basically it simple and painless todo these performance checks on Linux using PAPI no matter the CPU as long as the CPU has performance counters that measures how long in clock cycles in the first place. Yes PAPI cannot magically create hardware that does not exist. Its also very simple on anything intel that is modern.

    DrLoser GCC will do the L1 push but you have to tell it you want a old arch. I will admit when I was last in this section of Gcc was before pent 4. This is why I am think of the old L1 bash solution. Turns out since then intel has added a few instructions to make life nicer. But what is in L1 still effects branch prediction. This is why you see the extremes.

    DrLoser I did not even waste 1 min looking for that blog. It was that far incorrect I just ran from my old memory. I will admit my old memory was a little out of date missing the addition of cmov and a few other vectorization solutions.

    Real clock cycle measurement means you don’t have to do stacked loops to make the thing long enough to measure in milliseconds. You can measure a single instruction if you wish. Also Perf and PAPI provides highly useful things like the number of branch prediction failures and successes. Yes some code will show absolutely no branch prediction events.

    cmov is not branch prediction. cmp is also not branch prediction.

    Reality you would have known what Perf clocks were if you were qualified to talk on this topic and known how to access them. Yes by the perf command or PAPI. Gets worse you are not going to like this.

    Since performance counters are CPU core. You cannot use them in dependable form from inside a virtual machine.

    Naturally, VirtualBox made this almost impossible to copy over.
    Here you mention virtual box. Sorry DrLoser you cannot do this. Linux to run performance benchmarking correct has to be directly installed. Virtual machines causes your clocks to shake a hell of a lot. Yes when you start doing cycle counts you will notice that sections of code that should be exactly the same number of cycles like 20 nop instructions are in fact not due to virtual machine operations.

    In fact even installed on real hardware you can still pick up shakes coming from System Management Mode events. These are less common. Way to avoid System Management Mode events block you test code from touching core 0. numa controls allow you to limit what cpus your code runs on.

    System Management Mode runs on core 0 in my system and when it want to run it runs. 20 nops having a system management mode event happen could go from 2 cycle because of optimization by cpu to about 4000 cycles. You can guess what kinds of fun System Management Mode can cause you when you are attempting to run real-time code. taskset is one of your best friend commands under Linux when you need stable real-time performance.

    Yes when I was running my tests I avoided core 0.

    See its not that simple to perform these benches. You have to use something that measures cycles and understand what in you system is going to screw up results.

  19. DrLoser says:

    Right here you talk about clock. You are meant to use perf clock sources. Why perf clock cycles. Modern cpus change their clock speed up and down.

    You never tire of bringing total irrelevancy into the discussion, do you, oiaohm?

    I’m just looking for a compiler option to lock the test down to a single CPU. That is all. You are, as usual, no help.

    Alternatively, I will accept an alternative library to the time-hallowed *nix libRT. You can talk about “perf clocks” all you like, but you are worthless unless you point me to a better alternative.

    Oh, and that “blog post comment” that you’re so proud of digging out? Doesn’t mean a thing, even though you spend six ever-loving days trying to dig it up, and even though your excuse was “it’s not worth my valuable time … and yet, suddenly it is!!!!:

    Count me among the people who are surprised by the optimizer not transforming it into the branchless version (assuming optimization was enabled). Even many years ago I noticed that

    if (y == 0)

    x = 10;

    else

    x = 30;

    is compiled into a branchless version by MSVC & other compilers.

    The compiled code varied a bit by compiler of course, but it was in the spirit of

    cmp eax, 0 ; assume y is in eax; x is in ebx
    seteq bl ; bl == (y == 0) ? 1 : 0
    sub ebx, 1 ; ebx == (y == 0) ? 0 : -1 (0xFF..F)
    and ebx, 20 ; ebx == (y == 0) ? 0 : 20
    add ebx, 10 ; ebx == (y == 0) ? 10 : 30

    GCC now uses a cmov instead.

    The pedagogical example is about how to bust a pipeline via a branch comparison, Fifi. It is not a challenge as to how to get around it.

    This particular ASM example, as with mine, shows how to evade the issue on a simple count of a set of random variables. It is not generalizable, which is why I ignored it for the purposes of the current discussion.

    And, fairly obviously, it has nothing at all to do with L1 cache, which is the interesting novelty that you came up with.

    Go on, do tell, Fifi. How do you solve a generalised branching issue that potentially busts the CPU pipeline via a “smart” L1 cache push?

    Your audience is ready and waiting.

  20. oiaohm says:

    if (array[i] < boundary) count++;
    count += (array[i] < boundary) ? 1 : 0;

    Here is the big catch is both these lines of code if complier optimiser is good should perform absolutely identically.

    http://blogs.msdn.com/b/oldnewthing/archive/2014/06/13/10533875.aspx#10534095

    On gcc and llvm with optimizer on you get the branch-less version without needing to make your code look horible.

    DrLoser the fact it was a complier fault was in the blog post comments you never read.

    http://blogs.msdn.com/b/oldnewthing/archive/2014/06/13/10533875.aspx#10534162

    Here is the really stupid one add a else statement to the if statement then Microsoft complier make the correct code without having to do += stuff.

    http://mrpogson.com/2015/02/04/mini-beast-arm-cortex-a-72/#comment-243883

    Right here you talk about clock. You are meant to use perf clock sources. Why perf clock cycles. Modern cpus change their clock speed up and down. This changes the number of instructions they can process per millisecond. If you are measuring in milliseconds you are at the will of the cpu if it will be the same number of instructions so altering the amount of time for the code to complete. Yes even in real-time there is very little reason to sync time RTC between cores. Dynamic clock speeds of cpus ruin having predictable time.

    You wish to sync RTC you normally have to lock CPU cores full speed and calculate perf instruction counting drifts between cpus. That is if you serous-ally need real-time multi core with synced clock. And serous-ally waste power.

    DrLoser a cmov does not go into the branch predictor. ARM also has instructions that don’t trigger pipeline flushing as well. Yes a cmov code might be conditional but its not a branch.

    I was waiting to see how stupid you are.
    CMP is not an “conditional branch” a conditional branch requires a jump statement. No jump statement no branch.

    You can have million CMP? AX,BX statements and you will see absolutely no effect of branch prediction.

    Reality is CMP has nothing todo with branch prediction performance issues.

    There is a big difference between conditional branch and conditional statements. Conditional statements don’t cause performance behavior changes.

  21. DrLoser says:

    Incidentally, there’s a trivial answer to that question, as posed. It doesn’t even need googling for PDFs.

    But I’m willing to bet that Valentine Boy doesn’t get it.

  22. DrLoser says:

    So every bit of that fancy maths is a compete waste of time. Compiler really should have used a CMOV in a majority of cases where the code path is not somewhere near predicable in most modern processes.

    * Fancy maths (ooh! lumme! Bit-shifting!
    * Complete waste of time (ooh! lumme! Provable results!
    * Reliance on the compiler (ooh! lumme! Conditional piled upon conditional!
    * “Code path is not somewhere near predictable (ooh! lumme!) in most modern processors (ooh! lumme!)

    I’m in awe, Fifi. You’ve actually outdone yourself. Four ludicrous statements in a single paragraph of gibberish! But you did, at least, manage to reference a single concrete fact.

    Yes, there is an instruction called CMOV on that architecture you affect to despise … in fact, as with most architectures, it’s a family of such. In this case, we can resolve the problem to something like:

    LD DX,AX
    INC AX
    CMP DX,BX ... BX being the item in the array
    CMOVA DX,AX

    There you go, a conditional branch built into the assembler instructions! Can’t say fairer than that. And no “fancy math methods” involved! (Although I do wonder quite what your antipathy to “fancy math methods” might be.)

    Now, an exercise for the reader — specifically the very, very slow reader who can’t look up the relevant PDF to save his life and who insists upon CMOV or the equivalent to rescue him:

    Moving on from the original stripped down purely pedagogical example, let us consider a similar case in real life.

    I am going to keep one half of the conditional as simple as possible. Compare A to B, for some value of A and B, and if they are equal, you return 0.

    Now: if A is not equal to B, you need to perform a calculation (not merely an increment to the accumulator). Take, for example, the need to multiply A by B and return a result.

    Heck, I’m not even talking about a function call. Just a simple multiplication of two registers, either in-place or to a third register.

    Trivial, obviously. No pipeline-busting required.

  23. DrLoser says:

    Basically Deaf Spy example is completely incorrectly designed to test the branch predictor. So there is no need to go looking for the blog or Pdf when the sample code is completely wrong.

    Actually, it’s perfectly designed to test a “naïve” branch predictor, Fifi. And that is it’s sole purpose. As I keep banging on about it, the example is purely pedagogical and demonstrates nothing more than the interesting observation that a naïve CPU pipeline is apt to cough up when faced with both unpredictable input (in this case, randomised) and a conditional op.

    That’s not a difficult concept to grasp, I think. Even though it took you six whole days of faffing around before you sort of grasped it.

    What you do afterwards is another question. Indeed, that question was the subject of my second multiple-choice question. Did you answer that?

    No matter. You can apply those despicable “math methods” you mentioned, which have the advantage of reliability across architectures. You can rely on “hot pipelines,” which are of questionable use if more than one conditional is in operation, and in any case have a cost in both circuit complexity and in latency.

    You can do any damn thing you please. But you cannot deny the existence of the underlying issue, Fifi.

    Well, this being Valentine’s Day, and us being on a promise, here’s the relevant article.

    Which you didn’t find in six or more days of trawling around, did you? Pathetic. It took me fifteen minutes.

    Substitute “it isn’t worth my time” with “Mommy, make the bad men go away. I can’t find it however hard I try” — and believe me, oiaohm, I am absolutely certain that you tried very, very hard …

    … and we’re just about back to the present, aren’t we, Fifi?

  24. DrLoser says:

    DrLoser of course being the idiot you complain about Linux clocks being all over the place.

    I did? When?

  25. oiaohm says:

    DrLoser of course being the idiot you complain about Linux clocks being all over the place. Are you not aware intel and amd processes contain inside them the means to measure how many clock cycles a task takes to complete.

    Yes another problem with the blog site you were using. The measure in a correct perform test is clock cycles nothing else.

    Also notice the correct test uses tables of intentionally bias true false states to produce a nice little graphic showing where exactly the branch predictor goes south. Ideal states all true all false and worse is almost always 50/50.

    Basically Deaf Spy example is completely incorrectly designed to test the branch predictor. So there is no need to go looking for the blog or Pdf when the sample code is completely wrong.

  26. oiaohm says:

    https://github.com/xiadz/cmov
    There is in fact another way for unpredictable branches. Notice it only defaults on in -O3 gcc.

    If FF=1 you can rely on the result stored in the ZF. The L1 cache has arranged things such that your guess as to the next step in the pipeline, ie an increment of whatever (or not, depending upon the pipelined guess) is correct.

    If FF=0, you can’t. Insta-pipeline-bust.

    Rather useful, that.
    Branch prediction on x86 has 3 options. Not two . You idiot DrLoser.

    CMOV variation of CMP magically does not cause a pipeline-bust ever.

    Stack of CMOV to set what results get exported back to memory no pipeline-bust problem. Messing with L1 cache by direct shoving will get the microcode todo the same as if you used a CMOV solution in the first place.

    There is more than one way to skin the cat. Pipeline busting because of a IF causing performance is a failure to optimize implementation of if correctly.

    So every bit of that fancy maths is a compete waste of time. Compiler really should have used a CMOV in a majority of cases where the code path is not somewhere near predicable in most modern processes. Only issues is old processes like P4 that had broken CMOV that took like 10 instruction cycles to complete even so in higher random cases this was still faster than branch prediction failures.

    Deaf Spy and DrLoser the reality the complier can force the branch prediction different ways by code layout and ASM selection.
    branch prediction has 3 states.

    Breach is a False branch that will almost never run.
    Branch is a true branch that will almost always run.
    Branch is a unknown branch optimize with cmov and equal to disable CPU branch prediction.

    Stupid enough there is less than a 3 percent performance gain using branch prediction if the guess is correct. The price for incorrect most of the time is massive at worst is 5x slower.

    Some other cpu types have instructions particularly for branch prediction control particularly to tell the CPU this compare statement unpredictable.
    LOL then Drloser asks this.

    Maybe, oiaohm. Maybe. Name one.
    Every single intel processor since P4. All current x86 processes. You had to build your code with -O3 to have gcc do it by default.

    Then you use __builtin_expect to tell the complier what if statements are not to be handled by generic means neutral to random data effects.

    Yes shock horror right the complier control the branch prediction a hell of a lot. Yes cmov exists to make it possible for programmer/complier to disable of cpu run-time branch prediction.

    Yet for some reason people think they need maths to deal with Branch prediction errors. Smack the CPU on the side of head with either L1 control or cmov and problem disappears. L1 smacking was require for Pent 3 cpus and before.

    As I said the PDF you have been demanding I look up has been absolutely bogus right back to Pentium 1.

  27. DrLoser says:

    And a little more investigation suggests that I am wrong, and branch prediction does indeed coexist with CPU pipelining. (Told you I’d admit when I was wrong.)

    But, to quote:

    Other detriments are the following:
    * Predication complicates the hardware by adding levels of logic to critical paths and potentially degrades clock speed.
    * A predicated block includes cycles for all operations, so shorter paths may take longer and be penalized.

    Predication is most effective when paths are balanced or when the longest path is the most frequently executed, but determining such a path is very difficult at compile time, even in the presence of profiling information.

    Incidentally, the x86-64 architecture apparently does this (unlike ARM 64), so I’m at a loss to see where oioahm gets his claims from.

    So, let’s summarise:

    1) You have a potential degradation of clock-speed. Well, that sounds spiffy.
    2) You’re stuffed if there is one short path that happens more frequently, and one long path that doesn’t. Well, that sounds spiffy.
    3) This actually works for the pedagogical example, because both paths are short. Hooray!
    4) Not only does it not work in general. You can’t even reliably predict when it will work. Not even if your compiler (or, more to the point, JITter) has “reliable profiling information).

    So, it’s basically useless except in toy scenarios or else in massively parallel workloads like supercomputers. Anything in-between … you’re stuffed.

    And still no mention of magical intervention by the L1 cache. What a surprise.

  28. DrLoser says:

    That’s going to be quite a feat, really. I can’t think of a single architecture in the entire history of computers that has a nullable comparator instruction at its base.

    CMP? AX,BX

    … setting the ZF to 0 if AX==BX, or 1 otherwise. But, check out the FifiFlag!

    If FF=1 you can rely on the result stored in the ZF. The L1 cache has arranged things such that your guess as to the next step in the pipeline, ie an increment of whatever (or not, depending upon the pipelined guess) is correct.

    If FF=0, you can’t. Insta-pipeline-bust.

    Rather useful, that.

  29. DrLoser says:

    Having said that … (and I still confess a general ignorance) …

    Really the branch prediction control inside a intel cpu is done in the strangest way possible.

    I don’t entirely see how a simple “you have just bust the pipeline” is a particularly strange way of doing branch prediction. In other words, it’s NOP. Seems quite basic and reasonable to me.

    And now we get to a contentious observation that will no doubt generate fountains of gibberish, sans relevant cites:

    Some other cpu types have instructions particularly for branch prediction control particularly to tell the CPU this compare statement unpredictable.

    Maybe, oiaohm. Maybe. Name one.

    Don’t just name the CPU. Name one or more specific “instructions for branch prediction control.”

    And, no, CAT AX,[Schroedinger] ain’t gonna cut it.

    This time, no gibberish, please. One manufacturer. One CPU. Any number of assembler instructions your little heart desires.

    Why it classed as impossible by people who don’t know intel cpu properly is the lack of understanding that you can control branch prediction by controlling L1 cache.

  30. DrLoser says:

    Well, naming the issue: branch prediction, is the only correct point in the walls of text. Still good, provided that the Doctor explicitly allowed students to “Use a more sophisticated CPU pipelining architecture.

    I’m still revelling in the anticipated result of oiaohm picking up on an unambiguous statement of fact (“it’s pipeline-busting, dummy”) and then claiming that he knew that all along. Bit of a shame he never once mentioned it, really.

    He still doesn’t seem to have picked a choice from the second list, though. Presumably because he stands by the following extraordinary statement:

    DrLoser the answer to your Multi choice is in fact none of the above. The error in performance only appears because branch prediction is bias to picking 1 path over the other. CPU branch prediction need to be kicked in the head by L1 cache fill instructions to make it choose the 3 option that it wants to avoid.

    The bit I’ve highlighted is, in fact, true. It’s actually the essence of the problem. The other bit … and now I am going to have to admit my weakness in this area, because unlike oiaohm I will always confess when I am open to correction (and know it ahead of time) … sounds like gibberish to me.

    With a basic knowledge of how CPU pipelining works (Fetch, Decode, Execute, Access, Write-back) and how L1 cache works, I am at a loss to see how the latter can “kick” the former “in the head” without causing a pipeline flush. And, need I point out, a pipeline flush is precisely what you are trying to avoid here.

    Logically, the five classic pipeline steps above are a simple example of overlap: you can deal with five instructions at each clock tick, provided that each instruction is in one of those five microcode phases. I don’t see where L1 cache comes in here.

    Now, I can certainly see that a pipeline introduces dependency resolution issues. The canonical and trivial one is where one instruction in the pipeline is writing to a register and the other is reading from it. You could rely on the compiler to fix this up with “microcode instruction boundaries,” I suppose. You could also (and this is common practice, as far as I understand) introduce a “bubble” into the pipeline, which is basically an instruction boundary that prevents a fetch before a write-back.

    But I can’t see how L1 cache (which is, in all honesty, a fairly stupid and subservient beast) can help in either case.

    I would genuinely be interested if oiaohm can come up with an argument for this.

    Oh, and oiaohm? That “both paths are hot” thing? Again, I am entirely open to correction, but I believe you are confusing pipeline optimisation with cache optimisation.

    “Hot” code paths are amenable to L1 cache-line “over-eagerness,” because L1 cache intermediates between something very, very, slow (L2 cache) and something very, very fast (the CPU). It’s worth loading more stuff into L1, even though you know you are going to throw some of it away — because at an unpredictable point, owing to branching or other redirection, you’re going to have to bust the L1 cache in any case.

    It isn’t worth doing that to the CPU pipeline, because the CPU pipeline takes an input (measured in clock ticks) and turns it into an overlap (measured in clock ticks). You only need to bust a pipeline when you mis-guess a conditional. You only need to insert a bubble into the pipeline when you hit a “hazard,” ie an unavoidable inversion of dependencies in the five basic steps.

    I am quite prepared to confess that I pretty much had to work this stuff out for myself, with only basic help from Wikipedia and a couple of (hooray!) PDFs.

    I may well be wrong. I doubt that oiaohm will be any less wrong, though.

  31. DrLoser says:

    DrLoser wrote, moving the goalposts again, “I was talking about numerical analysis.”

    I don’t quite see how I was moving the goalposts, Robert. Your post consisted, broadly, of two parts.

    In part (1), you asserted that loop-unrolling optimisations have been known for a very long time. In this case you were moving the goalposts. Deaf Spy’s pedagogical example is not amenable to loop-unrolling, because there are side-effects (my apologies: I called them “invariants,” which is precisely what they are not) inside the loop, to whit: the clock starting and stopping.

    oiaohm attempted to minimise the cost of these side-effects by unrolling the loop, which completely alters the behaviour of the code. Thus, your comment on loop unrolling, whilst interesting, is actually irrelevant.

    In part (2), you described your experience of using integer arithmetic rather than floating point as a way of increasing precision when mapping an input set to an output set. I did, in fact, go along with this, to some degree, although I assume I wasn’t very clear. Let me then here agree with you. In the cases where this works, it works very well indeed. Your case was one of those.

    All I’m saying is that, given a sufficient (and in real life quite typical) number of intermediate steps between the input set and the output set, numerical analysis comes into play. Now, you could apply your method to each step, but then you’re likely to run foul of the unit-conversion issue (which I labelled mantissa-exponent). Considering that we’ve managed to crash a landing craft onto the surface of Mars simply by confusing inches with centimetres, I don’t really recommend this. I would seriously prefer to work with floating point arithmetic alone and let the hardware do the hard lifting for me.

    And, as I pointed out, some GPU hardware does precisely that, with specialised instructions. Is the cost-benefit appropriate? I don’t know. I think I’d set two teams on the problem, one using your method and one using pure floating point, and see which one “wins.” My bet would be on unit consistency and the specialised hardware instructions, rather than on human ingenuity. But it would be an interesting experiment.

    And it certainly isn’t “moving the goalposts.”

  32. Deaf Spy says:

    Ring the bells, sing songs, open a barrel of beer, and break a hundred bottles of wine! Six days, 12 hints and 2 simplifications later Fifi saw the light, and managed to say the name of the phenomenon! Branch prediction! You must be proud of yourself, Fifi!

    Well, naming the issue: branch prediction, is the only correct point in the walls of text. Still good, provided that the Doctor explicitly allowed students to “Use a more sophisticated CPU pipelining architecture. (The student may wax lyrical about ARM at this point. Or about any other alternative.)”

  33. DrLoser wrote, moving the goalposts again, “I was talking about numerical analysis.”

    The numerical analysis was simple, we just mapped a slowly varying function to an integer space. The difference operations represented the electric field in the devices reasonably well. High accuracy everywhere was of little benefit because thousands of small terms were accumulated in the simulations and the fields themselves focussed the beam so that any error tended to be corrected. The predictions of the software and the behaviour of the ultimately designed devices were pretty well in line. The interior of the cyclotron, for instance, had nearly mirror-perfect vertical symmetry, so first-order errors cancelled. On devices with even more symmetry, we used analytical approaches. This was all digital. We tweaked the model until it predicted the performance we wanted then built the device according to the model. Those D-tips were beautiful, machined from low-oxygen high-purity copper. You can read a description of the method in this thesis, described from section 3 Electric Field Calculations for D- Ion Acceleration, page 23, and a reference to my work at [OH83]. I’ve looked all over the web and haven’t found a picture. It’s too bad. I was into photography at the time but I don’t remember ever taking that picture.

  34. oiaohm says:

    I hate html at times
    1)array is less than boundary
    2)array is greater than or equal to boundary.
    3) Result is not guessable so flat generate as if both results will happen and

  35. oiaohm says:

    DrLoser intel avoided redesign the complete arch and implemented a hack in the Microcode Originally Intel branch prediction could only pick 1 path as fast. Intel cpu still defaults todo this unless you use the instructions to effect L1 cache.

    Build a bigger cache.
    Bigger cache will not fix up branch prediction errors. You control branch prediction by controlling contents of L1 cache. This is the Intel and AMD CPU choice.

    count += (((~(1 <> 31) + 1;

    This is in fact slower than what the microcode in the cpu will generate when its told both paths are hot. x86 is horible for the method you have to use to say hey both code paths are hot.

    if (array[i] < boundary) count0++; This is only slow if branch prediction has decide to pick one or other results. The microcode can in fact choose to put both outcomes into the instruction pipeline.

    Branch prediction can choose 3 outcomes.
    1)array is boundary
    3) Result is not guessable so flat generate as if both results will happen and throw away what ever one is wrong. Yes this is slightly longer in the pipeline than 1 or 2 but way shorter than deafspy replacement.

    Really the branch prediction control inside a intel cpu is done in the strangest way possible. Some other cpu types have instructions particularly for branch prediction control particularly to tell the CPU this compare statement unpredictable.

    Why it classed as impossible by people who don’t know intel cpu properly is the lack of understanding that you can control branch prediction by controlling L1 cache.

    DrLoser the answer to your Multi choice is in fact none of the above. The error in performance only appears because branch prediction is bias to picking 1 path over the other. CPU branch prediction need to be kicked in the head by L1 cache fill instructions to make it choose the 3 option that it wants to avoid.

    This is what I am getting at you guys have the issue completely wrong.

  36. DrLoser says:

    There are several forms of optimization. You can do things to improve the accuracy of calculations or the speed, for instance. In the case I remember we improved both. On the old IBM S/360, FP was 32 bits with 24 bits of precision. Using integers allowed 32bits of precision. Further, integer operations took about 400ns whereas FP were several times slower.

    I wasn’t talking about optimisation, Robert. I was talking about numerical analysis.

    Unless you are simply mapping one set (quite possibly very large) onto another set via a transform and a possible reduce operation, then you’re going to be SOOL using integers in place of floating points on any architecture whatsoever; because all you’re doing is to represent the mantissa and the exponent as arbitrary — and in one horrible case I encountered five years ago, fixed sets of bits. You don’t actually gain any precision whatsoever.

    What you do gain is the very likely brittle result of failed unit conversion. Useful, that.

    I believe the latest GPUs work on a single “add and multiply” instruction precisely because it offers about two bits more precision on a double-precision number. Propagate this through a series of calculations (which is what we Computer Scientists call “numerical analysis”), and you get far superior results.

    On the other hand, I have to admit, they didn’t do too badly in the 1970s. Fun times. No reason to accept advances in the last forty years or so, I suppose, just because they obviously make much more sense.

  37. DrLoser says:

    PS the existance of these instructions and how compliers use them I did talk about on the reactos developers and the very example you have been using was also brought up.

    Naturally, then, Fifi, you should find the multiple choice questions I just posed childishly easy to answer.

    You don’t even need cites. You just need to answer two trivial, multiple choice, questions. Hell, you could pick a total moron of the street and they’d get it right 8% of the time.

    We don’t need to go to the street for this one, though, do we, Fifi? We don’t even need to hang around dubious ill-lit lamp-posts.

    We’ve got you, my little darlin’.

    An honest challenge, then. You answer those two multiple choice questions.

    And, if Deaf Spy doesn’t do so before Valentine’s Day, I shall provide you with the relevant, documented, information.

    Sound fair?

  38. DrLoser says:

    oiaohm: Deaf Spy it is in fact cache linked.
    Deaf Spy: No, it is not.

    You remember that multiple-choice question that you mentioned earlier, Deaf Spy? I think we now have it nailed down.

    Question: Given this problem, posed in code, six days ago: which one of the following CPU issues is involved?

    1. Cache-Busting.
    2. Pipeline-Busting.
    3. Both of the above.

    Apparently Fifi has plumped for (1). As is my custom, I shall leave that choice dangling.

    Question: Given that the correct answer is (2) … oops, I gave it away, sorry, Deaf Spy, but you know Fifi, he’ll keep arguing for (1) anyway …
    … What solution would you propose?

    1. Redesign the entire Intel CPU pipelining architecture.
    2. Use a more sophisticated CPU pipelining architecture. (The student may wax lyrical about ARM at this point. Or about any other alternative.)
    3. Build a bigger cache. (The student need not concern his or herself as to whether an L1, L2 or L3 cache is appropriate.)
    4. Something else.

    I have a Debian Wheezy test-rig ready and waiting for your ineffable wisdom here, Fifi.

    Ready and waiting.

  39. Deaf Spy says:

    Deaf Spy it is in fact cache linked
    No, it is not.

    But it is always entertaining to see you banging you head at the altar of Stupidity. 🙂

  40. oiaohm says:

    Deaf Spy it is in fact cache linked. The MMX instructs interference with cache formulas in cpu also mess with pipelining so messing with when cache cache-busting or pipeline-busting happens.

    Working out the correct information the cpu requires to operate sanely is not exactly simple.

    https://community.freescale.com/thread/316003

    Deaf Spy sorry you have had it wrong all the way along. Pipelining performance is directly linked to cached data in the cpu. So pipeline bursting is many times less effective if you don’t setup things to intentionally trigger it.

    Any of these random performance differences a cpu can spit out can be trigger intentionally dependably and repeatedly. So the 1 faster events can in fact happen to them always. Problem is without putting the information in you are depending on the cpu to guess the correct code paths and cache the right data base off its guesses.

    You are most liking thinking that its instruction path failures. Those caching instructions can push execution code into CPU. Yes undermining Instruction pipeline code flow wrong guesses. If you have intentionally told the cpu that both code paths are required in L1 cpu then does not optimize into a single one so evenly weights both. Yes so both code paths are fast.

    What is in cache effects Instruction pipeline calculations.

  41. Deaf Spy says:

    The speed up happens because what is in cpu cache happens to be the most useful so the cpu avoids having to waste time getting data into cache
    No, it doesn’t. The phenomenon observed is not related with the cache at all.

    The rest is just the usual irrelevant gibberish.

  42. oiaohm says:

    PS the existance of these instructions and how compliers use them I did talk about on the reactos developers and the very example you have been using was also brought up. Yes Deaf Spy you just brough an argument before me that I had smashed into nothingness.

    Sorry DeafSpy you really don’t understand what compliers are in fact upto. Why optimization is so critical is because cpu’s can act like complete morons without correct information.

  43. oiaohm says:

    Mind you arm cpu and may other types of cpu also have these instructions for informing the cpu about what data you will be requiring in advance of in fact running the instructions requiring the data so giving the cpu time to pre transfer the data. Yes using the cache control instructions correctly can insanely boost the speed of code.

  44. oiaohm says:

    Deaf Spy You are keeping on presuming I don’t understand the problem at hand.

    The speed up happens because what is in cpu cache happens to be the most useful so the cpu avoids having to waste time getting data into cache.

    Problem is there are many ways to skin the cat. Optimization is about picking the most effective way to skin the cat. Cache clearing by CPU can also be slowed down by CPU being informed of data being required for upcoming processes.

    You can depend on CPU guess what data to cache and get it correct or you can include instructions in your programs asm to tell it what data it should cache.

    http://softpixel.com/~cwright/programming/simd/mmxext.php

    People forget about the MMX extensions in x86 cpus from 1999 yes they are still in modern day x86. Yes a x86 program can choose to directly control what is in the cache lines so avoiding cache line issues.

    Problem here MMX instructions to control the cache lines are not written in -O0 mode at all in gcc.

    Particular set of problems are solved by optimization end of story. Without doing a optimisation solve you don’t know what MMX instructions to create to pre request stuff into cache and to tell the cpu built in caching alogithm to be smarter.

    Yes you can either have a cpu with a idiot cache system because it uninformed because you did not run optimization or a cpu with a fairly smart cache system because you run optimization so adding the right instructions to your code so the cpu does not have to act like a idiot.

  45. Deaf Spy says:

    Here’s a fairly big hint: what happens when the “overlapped instructions in a pipeline” become inefficient?

    After that the only thing left is give them an multiple-choice test to choose the single correct answer. 🙂

  46. Deaf Spy says:

    Both you and me understand the logic of doing them manually.

    Dear poor boy, you can’t even understand the problem at hand and create a benchmark, let alone optimize it.

  47. oiaohm says:

    Robert Pogson to be truthful a lot of people got sick of doing all those optimizations to the intermediate code manually so built them into the modern day compliers. Of course some people like DrLoser are foolish enough to turn them off.

    Both you and me understand the logic of doing them manually.

  48. DrLoser wrote, of using integer arithmetic instead of FP in the old days, “the Numerical Analysis implications of that absurd micro-optimisation do not bear thinking about.”

    There are several forms of optimization. You can do things to improve the accuracy of calculations or the speed, for instance. In the case I remember we improved both. On the old IBM S/360, FP was 32 bits with 24 bits of precision. Using integers allowed 32bits of precision. Further, integer operations took about 400ns whereas FP were several times slower. I was doing 2-D and 3-D electric field calculations for the interior of a cyclotron and various electric beam focussing/deflection elements. Our machine had 3 queues, A, B, and C. By this means, I got my jobs into the A queue, first in line. They were low in I/O and by this optimization, the CPU time fit the cut off, 15s or so. Ah, the good old days of batch-processing… I wrote the software in Fortran H, dumped the intermediate code and then optimized in assembler. It was just a matter of putting the innermost loop in inline code and changing the array-subscripts to addresses/displacements. I did Successive Over-relaxation. Good fun solving Laplace’s equation, $latex \nabla^2 V = 0$. $latex V$ was the electric potential in free space. I defined the boundaries as negative values and skipped recalculating them. ISTR the array was 128x128X32, huge, eh? These days, we have so much more RAM and CPU power, I wouldn’t bother with such optimization, but in those days it was quite important, getting calculations done all day long instead of when there was some gap in the A-queue. Sometimes that did not happen until 10PM or later. I spent lots of long nights waiting for output… The real horror was the C queue which never got run before midnight and often not even then if the system engineers were tinkering…

  49. oiaohm says:

    DrLoser might be idea to check this out. I was the one who instructed Reactos developers on how to build there complier environment. Also how optimsing C code could end up making poorer code than letting the compliers optimisers do their job. You happen to walked into one my areas of high skill.

  50. oiaohm says:

    DrLoser and Deaf Spy are you aware that you can intentionally send instructions to push data/instructions into L caches of the CPU in advance of need. Gcc does not insert these instructions until you enable one of the optimizing states being -Os -O2 or -O3.

    The problem you are pointing to that something get magically faster at particular points happens less commonly in arch optimized code. Instead high performance becomes normal.

    https://software.intel.com/en-us/blogs/2012/09/26/gcc-x86-performance-hints

    If you tie a compliers hands behind it back you can sometimes make what some would call defects appear. But when the defeat is that arch Dependant code is not generated raises some serous questions. Are you just being academic not caring about real world issues.

    Drloser I am not even bothing looking for the PDF and never had. So far you are proving exactly what I suspect it was. Something the optimizer deals with these days.

  51. oiaohm says:

    DrLoser by the way different versions of Gcc and LLVM the performance change point in non optimized will be different. Yes there is a very good reason not to use Gcc and LLVM with optimisation off. With optimization off performance is almost totally not predicable.

  52. oiaohm says:

    (I would purely love to see the “optimising compiler” that can do that on a provable and regular basis. Because such a thing does not exist.)
    DrLoser LLVM and GCC both do it in Profile guided mode regularly. Once the complier knows where a hot path is it gets way more extreme on those code paths it also can massively limit options.

    DrLoser main reason for the focus on LLVM is that LLVM is used by Linux .net environment gcc is not. Not that LLVM can do a better job than GCC the example is .net so I restricted self to .net tools.

    In this case, oiaohm is not respecting the invariants. The invariants are when the clock starts, and when it stops.
    Function duplication removal. Without that you would have seen 10 starts and 10 ends in the unrolled form. If you have infact profile guide optimized threw those functions you end up with my result. This is one reason why having sources to libraries with debugging data is so good.

    http://www.gotw.ca/publications/optimizations.htm
    Interesting right. That code optimization is different for single core vs multi-core.

    (bear in mind, gcc compiled with -O0 and -std=c99 and no other frills)
    No linux program is normally build with -O0 because this is no optimize. The normal optimize values are -Os -O2 and -O3 . Yes -O1 was that lightly used it does not exist in gcc any more. Please note these -O flags are language neutral under gcc. Why because all program languages under gcc get converted to gimple then optimized. Yes what is referred to with gcc FORTRAN applies to C C++…. end every other language gcc supports.

    https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html

    LLVM is also using a intermediate language so optimization flags have almost nothing todo with language.

    The argument that languages are different when it comes to optimization is a bogus arguement in a large number of real world compliers.

    DrLoser also go rebuild you examples in LLVM with optimization disabled you wall also notice that they key stall points are different again. Gcc and LLVM don’t generate identical machine code.

    The reason why Linux clocks per cpu core is practical one of locking.

    And once you’ve figured that out, how would you go about constructing a generalised version of that rather strange-looking bit-shift operation?
    This is the problem DrLoser you have only been able to demo the fault with optimization disabled. Reality is no program running on Linux will normally be built this way.

    concatenating the insides of loops
    This is one of the arch particular optimizations gcc and LLVM perform to avoid cache line issues.

    Big thing here remember DrLoser is that DeafSpy said number 5 yes number 5 is true for a Microsoft JIT environment for a gcc not optimizing you found it was number 7. If you do LLVM you will find its a different value again. Reality the problem you are talking about is so complier dependent its not funny. Its just as much complier Dependant as arch Dependant.

    Turn the optimizers on and watch what happens DrLoser.

  53. DrLoser says:

    I don’t think that Deaf Spy will mind me pointing this out after so long, Robert, but (unlike oiaohm) you have actually hit the root cause of the problem on the head!

    At optimization level -o3 or above, DIGITAL Fortran 90 attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining).

    Yes, it’s all about “efficient overlapped instruction execution (instruction pipelining).”

    The two of us have been playing coy all along. And, as I mentioned earlier, I’m not a fan of this “exercise for the student” rubbish.

    You, Robert, in this case the student, are essentially 90% there. It is indeed to do with the efficiency of overlapped (actually sequentially inserted, but no matter, just terminological inexactitude) instructions in a pipeline.

    Here’s a fairly big hint: what happens when the “overlapped instructions in a pipeline” become inefficient? Which is what is happening here.

    Once you’ve figured that out, under what circumstances do they become inefficient? (Deaf Spy’s original presentation practically gives this away.)

    And once you’ve figured that out, how would you go about constructing a generalised version of that rather strange-looking bit-shift operation?

    Extra clue on that last one: You can’t. Or, if you can, a Fields Prize awaits you.

  54. DrLoser says:

    Its the optimization like I just did that screwed people over when they attempt to make a time based random number generator based on mass sampling of how fast loops and other things run. Because your time responses can end up bogus.

    Does the phrase “hoist by your own petard” mean anything to you, oioahm?

    May I gently point out that, if you are not actually starting the clock at the right time, and stopping it at the right time, then your consequent results are going to …

    … “end up bogus?”

    Leave the tricky stuff to the professionals, Fifi. Once again — back to the lamp-post with you!

  55. DrLoser says:

    A question for oiaohm:

    Why on Earth do you seem to believe that LLVM would do a better job of this than gcc? Because it doesn’t, you know.

    Of course, it would help a tad if you actually understood the problem in the first place. No amount of sophisticated modern compilation tools are going to help you with that.

    Bit of a shame you haven’t even managed to make a single dent in the problem over five whole days of feverish PDF googling, isn’t it?

    (And let me remind you, it took me all of fifteen minutes or so. Plus which I now have a test rig for the problem, and all you have is blurry microwaved Gish-Galloping fantasy.)

    So, what about it, oiaohm? I’ve gone from total ignorance of the subject to something approaching mastery — including two minor complaints about Deaf Spy’s algorithm, plus a Debian Wheezy test rig — and you? You?

    You, oiaohm? You have nothing.

    Show us something.

  56. DrLoser says:

    We had to switch from floating-point to integer operations on the old machines to get any further speed-up.

    Please let me know the details of any nuclear device you went anywhere near, Robert.

    Because the Numerical Analysis implications of that absurd micro-optimisation do not bear thinking about.

  57. DrLoser says:

    At optimization level -o3 or above, DIGITAL Fortran 90 attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining).

    Entirely irrelevant. Let me remind you of oiaohm’s silly little algorithm:

    for(int i = 0; i < array.Length; i++)
    {
    if (array[i] < 0) count0++;
    if (array[i] < 1) count1++;
    if (array[i] < 2) count2++;
    if (array[i] < 3) count3++;
    if (array[i] < 4) count4++;
    if (array[i] < 5) count5++;
    if (array[i] < 6) count6++;
    if (array[i] < 7) count7++;
    if (array[i] < 8) count8++;
    if (array[i] < 9) count9++;
    if (array[i] < 10) count10++;
    }

    That isn’t “unrolling a loop” at all, is it?

    It’s merely cheating by actually concatenating the insides of loops.

    You cannot “unroll a loop,” Robert, unless you respect the invariants.

    In this case, oiaohm is not respecting the invariants. The invariants are when the clock starts, and when it stops.

    Do you really have any interest in defending this obviously foolish proposition? Because, I assure you, your friendly local compiler will respect invariants. And it won’t bork the entire thing up, as oiaohm chose to do.

  58. DrLoser says:

    I can assure you, I run not that other OS on any computer.

    And what a truly wonderful statement of principle that is, Robert. An even better statement of principle would be to forswear even talking about its internals, its security problems, its tendency to enslave people via the EBIL EULA.

    I mean, you never filth yourself with Da Bloat, do you? Which unfortunately means that you have no first-hand experience.

    And, as a fully-accredited Scientist with a modest little side-line in Parboiling Frogs Coated With Free Yeast, you will, I am sure, appreciate the distinction between running your own experiments and merely listening to hearsay. Normal people, by the way, would phrase that as “I can assure you, I do not run …” Nice touch channelling Yoda there.

    Here are two propositions:

    1) You can buy a quad core Intel 2GB RAM 32 GB ROM for $109. Feel free to pave over the free Windows 8.1 with the distro of your choice. (I understand that most Linux distros support Intel hardware.) OR!
    2) You can buy an “odroid,” quad core ARM, 2GB RAM, no ROM I can see. . For $179. Maybe. When it ships.

    Never in a million years would you consider either, so I imagine that the 65% cost/benefit differential is of no consequence to you.

    And, since you have kindly let us all know about your interest in cleaning fluff out of a computing device via a Q-Tip, the fact that the cheaper of the two actually comes in a box is most probably a disadvantage.

    In general, I would accept the “wholesale volume discount” argument, Robert. But, in this case? $179 against $109?

    Yer havin’ a laugh, aintcha?

  59. DrLoser wrote, “I would purely love to see the “optimising compiler” that can do that on a provable and regular basis. Because such a thing does not exist.”

    ” At optimization level -o3 or above, DIGITAL Fortran 90 attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.
    As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.
    The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.”

    That’s old-time FORTRAN stuff. The number-crunchers loved it. We had to switch from floating-point to integer operations on the old machines to get any further speed-up.

  60. DrLoser says:

    Now all you have to do, Fifi, is to google “Herb Sutter PDF” and the appropriate other keyword. Say, cache-busting. Or pipeline-busting. I’m totes convinced that you’ll get there eventually.

    I’m also equally convinced that you won’t understand a single word of the PDF in question.

  61. DrLoser says:

    CPU timing was in microseconds, in case anybody was interested:

    long long elapsed = (etimer.tv_sec - stimer.tv_sec) * 1e6
    + (etimer.tv_nsec - stimer.tv_nsec) / 1e3;

    Naturally, VirtualBox made this almost impossible to copy over. Nasty cheapskate crap is it, My Precioussss….

  62. DrLoser says:

    Good-Oh. Here we go, then.

    Without much effort, you will notice that the Deaf Spy Arithmetical Solution appears to have been shifted down one row in each case. Sorry ’bout that. I’m sure I
    could fix it up, but I burned too much time fiddling with “community solutions” to the retarded and fragmented world of the Linux Real Time Clock.

    You’ll also note that, in both cases (bear in mind, gcc compiled with -O0 and -std=c99 and no other frills), there’s a weird hiccough. For “Sutter calcs” it occurs here on the eleventh boundary. For “Deaf Spy calcs” it occurs here on the seventh boundary.

    In both cases, the apparent CPU clock time dips by around 30-60%. Now, I’ve run this little experiment about ten times, and I usually see the same sort of strange hiccough … but not at the same boundary condition. I wonder why?

    Here’s a snippet from my code (the rest available on request, as I said):


    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &stimer);
    ...
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &etimer);

    You’d think that a real-time clock would be quite accurate when you specify “CPU time,” wouldn’t you? And in a sense, here it is, doing just that. I’d have to look at the compiled assembler to be sure, but it seems that the kernel has happily switched between cores, thus making the measurement fairly useless. I haven’t examined the compiled assembler, but I was minded to look this phenomenon up:

    If the CPUs in an SMP system have different clock sources then there is no way to maintain a correlation between the timer registers since each CPU will run at a slightly different frequency. If that is the case then clock_getcpuclockid(0) will return ENOENT to signify this condition. The two clocks will then only be useful if it can be ensured that a process stays on a certain CPU.

    Ignoring inter-CPU clock drift, which is irrelevant in this case … what this is telling me is that the Linux Real-Time Posix-Compliant Clock is basically useless if you’re dealing with anything other than a single core. What I have presented here is, apparently, a sort of vague “Kernel approximation to a CPU timer.”

    Which, sadly, is not of much use in the real world. Linux does me over again!

    There’s still hope, though. If anyone out there can provide me with a gcc compiler/linker to bind the damn test to a single core, I’m prepared to try again.

    Oh, and Fifi? Just to repeat: your little exercise in futility completely missed the point. Give me a better algorithm inside the central loop, and I promise to try it out.

  63. DrLoser says:

    I believe that oiaohm has got at least one thing right on this thread, although it looks like he tripped over it by accident. Particularly since, for reasons that only he can understand, he chose to do the reverse of loop unwinding, and actually concatenated the insides of the loop.

    (I would purely love to see the “optimising compiler” that can do that on a provable and regular basis. Because such a thing does not exist.)

    Which of course, as Deaf Spy notes, is utterly irrelevant to the exercise. As Wolfgang Pauli would say, it’s not even wrong.

    I’ve just spent an entire hour fighting Linux’ cretin implementation of real-time clocks (O for the days of Solaris, long since cannibalised by this dreary little kiddie toy), and I’ve finally fought it into shape. The C version is available on demand.

    Now, sadly, because oiaohm hasn’t actually addressed the problem at hand, I can’t include his “solution” in these statistics. (Built via gcc on Debian Wheezy, for those wot is interested.) But I can compare the Deaf Spy Arithmetical Solution to the Herb Sutter PDF solution:

    Sutter's version:
    -----------------
    counter is 0, cpu elapsed is 478
    counter is 10700, cpu elapsed is 628
    counter is 20200, cpu elapsed is 493
    counter is 30100, cpu elapsed is 518
    counter is 39800, cpu elapsed is 523
    counter is 49500, cpu elapsed is 569
    counter is 58400, cpu elapsed is 567
    counter is 68400, cpu elapsed is 525
    counter is 79000, cpu elapsed is 518
    counter is 89400, cpu elapsed is 467
    counter is 100000, cpu elapsed is 164

    Deaf Spy's version:
    -----------------
    counter is 10700, cpu elapsed is 510
    counter is 20200, cpu elapsed is 556
    counter is 30100, cpu elapsed is 450
    counter is 39800, cpu elapsed is 672
    counter is 49500, cpu elapsed is 551
    counter is 58400, cpu elapsed is 545
    counter is 68400, cpu elapsed is 515
    counter is 79000, cpu elapsed is 301
    counter is 89400, cpu elapsed is 552
    counter is 100000, cpu elapsed is 523
    counter is 100000, cpu elapsed is 513

    Let’s see if that passes through the WordPress HTML filter first …

  64. oldfart wrote, “That third world is just as likely to be running both windows and intel on x86 as you are”.

    I can assure you, I run not that other OS on any computer. There is an old Xbox that got dumped in my foyer somehow, but it’s not been fired up and I have no clue what software runs on it. It’s true we have a bunch of x86 but those are legacy systems. A new one hasn’t been bought in about five years and I’m pretty sure the next purchase will be an opportunity for ARM. I just wish they would die sooner so I could exercise that option… I have two Atoms, for instance. One is noticeably slower than the other but the CPUs are identical. It’s probably the storage. Neither would embarrass a modern ARMed machine. e.g. The last kernel I built on the slow one went on for more than 30 minutes. The Little Woman’s thin client is giving some problems but that is really sad hardware, ~400MHz, and from Via to boot. I’m sure an ARMed client thin or not would please her. Her smartphone, although burdened by BlackBerry, is plenty fast enough. Her major effort is with the browser, after all, but she works with images and small videos and spreadsheets, stuff LibreOffice and Gimp/ImageMagick handle quite nicely. We may fix her old machine if it’s just a PSU, but otherwise, an ARM would make her a great machine for her office. FLOSS, unlike stuff on that other OS is not locked in to x86.

  65. DrLoser says:

    Unbalanced parentheses there or that’s not the whole story.
    Thank you, Mr. Pogson. Forgot to escape the characters…

    I now await oiaohm’s critique of this implementation, which has at least two flaws that I can see.

    Over to you, Fifi!

  66. oldfart says:

    So, when governments/businesses adopt GNU/Linux on Intel they are utter fools but if the choice is between ARM and Intel, it’s a different story, according to oldfart. I don’t think so. These days folks know they have a choice and increasing numbers are choosing technology not blessed by Intel and oldfart.”

    Nor are the entire world populated by cheapskates like Robert Pogson. Even in the so called third world, it is understood that you get what you pay for. That third world is just as likely to be running both windows and intel on x86 as you are, and I can assure you that the majority are not going to start doing their desktop computing on their smart phones, if only because smartphones are not designed and sold to perform as primary desktops.

  67. oldfart wrote, “In a business context its even simpler – None of your reasons have any validity,
    Again, end of story.”

    No, that’s not the end of story. Businesses can usually buy in bulk. The retail price to consumers this year is not the end of the story either. That will drop quite a bit when there are more choices like this in the market. e.g. early releases of new technology from Intel can be ~$1K for technology that ends up ~$100 a year or two later.

    So, when governments/businesses adopt GNU/Linux on Intel they are utter fools but if the choice is between ARM and Intel, it’s a different story, according to oldfart. I don’t think so. These days folks know they have a choice and increasing numbers are choosing technology not blessed by Intel and oldfart.

    It’s interesting to see whether that opinion, held by oldfart and perhaps indicative of USAian thought, holds elsewhere.
    Clearly, India has an opposite view, where GNU/Linux is being widely adopted in the workplace compared to use at home by consumers. Why would they adopt inferior technology at higher prices? [SARCASM]

  68. Deaf Spy says:

    Unbalanced parentheses there or that’s not the whole story.
    Thank you, Mr. Pogson. Forgot to escape the characters:

    count += (((~(1 << 31) + a - boundary) | a) >> 31) + 1;

  69. oldfart says:

    “Let’s see, to escape the Wintel monopoly, to try out ARM on the desktop, to save the planet etc., stuff that’s hard to price. What price is freedom?”

    Actually its quite easy to price. The ARM solution costs $70.00 more than the windows solution. End of story. In a business context its even simpler – None of your reasons have any validity,

    Again, end of story.

  70. Deaf Spy wrote, “count += (((~(1 31) + 1;”

    Unbalanced parentheses there or that’s not the whole story.

  71. Deaf Spy wrote, “why would a sensible human being spend $70 more for inferior hardware to run the same software (LO)?”

    Let’s see, to escape the Wintel monopoly, to try out ARM on the desktop, to save the planet etc., stuff that’s hard to price. What price is freedom?

    Further, the “inferiority” of the hardware may well be irrelevant. I’m sure it can keep up with type-type-typing just fine. Further, the “inferiority” of the hardware is just a conjecture. Show us the benchmarks…

  72. Deaf Spy says:

    I thought I’d take a stab at this one.

    Quite on the right path, doctor. Here is the solution:

    count += (((~(1 <> 31) + 1;

    No point to hide it from Ohio. He can’t make any use of it even if it hits him in the face a few times.

  73. Deaf Spy says:

    That’s why we chose GNU/Linux long ago.
    Good for you, Pogson, but the other “we”, i.e. all 98% others, chose Mac or Windows.

    But really, tell me, why would a sensible human being spend $70 more for inferior hardware to run the same software (LO)?

  74. Deaf Spy wrote, “comes in a box, with Windows license included.”

    Yeah, and a lot of malware, re-re-reboots, broken updates, restrictive EULA… Yes, we know. That’s why we chose GNU/Linux long ago.

  75. Deaf Spy says:

    You have loop nest and loop order optimizations and many more loop optimizations and order optimizations.
    Completely irrelevant, and therefore, incorrect. Try again.

  76. Deaf Spy says:

    It’s advertised for $179. That’s pretty good for its first year. By Christmas there will be a bunch more competitors on the market. This could be the year. All that in 5W. RIP Wintel.

    In the meanwhile, you can buy this:
    The top power of the power supply is 15W, meaning it works at much less. At the same time, it has much more CPU power, better connectivity, 32GB storage, comes in a box, with Windows license included.

    And, most important, it can run LibreOffice today. Not tomorrow, not next Christmas, not in two years, but today. Today, anyone can download LO, make a few clicks, and have it all running. No hacks, no pre-compilations, no crap.

    Today and for pay 70 bucks less.

    No sensible person would pay more to get less.

    RIP Wintel
    Dream on.

  77. oiaohm says:

    Its the optimization like I just did that screwed people over when they attempt to make a time based random number generator based on mass sampling of how fast loops and other things run. Because your time responses can end up bogus.

    Naught enough compliers are programed to look for particular functions and given directions on want cheats they can pull.

  78. oiaohm says:

    Deaf Spy there are ways to restricting the levels of optimizations.

    1 the program should require user to type in the values. Using constant values allows complier optimizer to go nuts.
    2 even typing in values that thing would have been reordered in major way.

  79. oiaohm says:

    Deaf Spy and DrLoser both are idiots.

    Actually, there is an arithmetic refactoring which works even in both C# and Pascal. 🙂

    A friend of mine, a great mathematician with excessive knowledge in computer algorithms managed to devise it. I admit it was beyond my algebra abilities.
    Here DeafSpy almost get it.

    http://en.wikipedia.org/wiki/Loop_nest_optimization

    You have loop nest and loop order optimizations and many more loop optimizations and order optimizations. When you come to AOT solutions the code you write and the code that runs are two major different things. Also add in auto inlining options. Auto lining sees the sub-function disappear.

    Lets say I tell the complier its single core and optimize the heck out of it what does the resulting program look like from Deaf Spy exmaple..

    class Program
    {
    static int arraySize = 10000;

    static void Main(string[] args)
    {
    int[] array = new int[arraySize];
    int count0 = 0;
    int count1 = 0;
    int count2 = 0;
    int count3 = 0;
    int count4 = 0;
    int count5 = 0;
    int count6 = 0;
    int count7 = 0;
    int count8 = 0;
    int count9 = 0;
    int count10 = 0;
    Random rand = new Random();
    for (int i = 0; i < arraySize; i++) array[i] = rand.Next(10);
    Stopwatch watch = new Stopwatch();

    watch.Restart();

    for(int i = 0; i < array.Length; i++)
    {
    if (array[i] < 0) count0++;
    if (array[i] < 1) count1++;
    if (array[i] < 2) count2++;
    if (array[i] < 3) count3++;
    if (array[i] < 4) count4++;
    if (array[i] < 5) count5++;
    if (array[i] < 6) count6++;
    if (array[i] < 7) count7++;
    if (array[i] < 8) count8++;
    if (array[i] < 9) count9++;
    if (array[i] < 10) count10++;
    }
    watch.Stop();
    Console.WriteLine("Count = {0}, Time = {1}", count0, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count1, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count2, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count3, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count4, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count5, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count6, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count7, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count8, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count9, watch.ElapsedTicks);
    Console.WriteLine("Count = {0}, Time = {1}", count10, watch.ElapsedTicks);
    Console.ReadLine();
    }
    }

    That is not full optimized 0 ends up faster if you are using a solve. Yes this is what happens when you tell the complier with loop nest optimization and loop refactoring that you want a single threaded version of the program.

    The reordered version is 100 percent CPU cache friendly. A JIT cannot do this because it takes a lot of time todo this form of refactoring. Even my example is only a primitive representation of what a good complier will do to the code. The reality is what you code and what the cpu runs is only roughly the same. Most cpu issues are solved by reordering and refactoring the code to match the cpu. Like this example you use slightly more ram.

  80. DrLoser says:

    “Simple, elegant, beautiful, and yet baffling.

    “I feel the need to dig deeper. In the meantime, would you care for a boiled frog sandwich? I saved 5¢s on the bread by growing the yeast in the frog’s vivarium!”

  81. DrLoser says:

    Robert Pogson, lacking imagination and drowning in skepticism, didn’t write:

    “Oh, I see what you’re getting at with that simple little pedagogical example of a program that poses the following apparent paradox, to whit: why should the distribution in question be symmetrical around the median boundary value?

    My observations and opinions about IT are based on 40 years of use in science and technology and lately, in education.

    “As a scientist, as a technologist, as an educator, I naturally find this a simple yet beautiful and instructive example of how complex systems behave in unexpected ways.

    “I have spent a very productive life encouraging my colleagues and my students to examine the simple, the beautiful, and the instructive. I am not yet prepared to abstain, myself.

    “This is a simple, a beautiful, and instructive example of a surprising algorithmic application that I do not understand. I have, of course, used Freedom One here, because the source code is available and unencumbered.

    “I now need to turn to my neighbours, who also believe in the Four Freedoms. Specifically, Freedom Two!

    “I turn to you, Neighbours, in the Libre Spirit of the Four Freedoms! Here, as my Gesture of Friendship, is a Cup of Free Yeast! I offer it in exchange for the answer to your question!”

    Well, good try, Robert, but yeast really isn’t free, you know. Admit that one simple fact, and either Deaf Spy or I will explain the trivial issue at stake.

    And, as a bonus, if you ask really nicely and stop calling people “twits,” you could actually score a coup for your site. Maybe Deaf Spy can be chivvied along into actually revealing his friend’s interesting bit of complex arithmetic.

  82. oldfart, lacking imagination and drowning in skepticism, wrote, “if you could demonstrate on an ARM based system that was built like a standard desktop running versions of a mix of real world applications compiled for ARM on a Linux desktop setup, then I night concede the point .”

    See the Debian repository. Even LibreOffice is built for ARM.

    See a demo on YouTube.

    This is getting really close, I think. It could use more RAM and storage and gigabit networking, but it’s certainly usable. I think USB3 allows pretty good I/O to storage. Haven’t used that. It’s advertised for $179. That’s pretty good for its first year. By Christmas there will be a bunch more competitors on the market. This could be the year. All that in 5W. RIP Wintel.

  83. DrLoser says:

    It’s a beautiful and elegant problem.

    I see this as a race between Robert and oiaohm, to see which one can dig the relevant PDF up first. (There are ample clues, multiply quoted, below.)

    My money is on Mr Pogson, because he has a genuine spirit of enquiry.

    All Fifi ever does is to dig up irrelevant cites, fail to read them, and splatter them into an imaginary wall of gibberish. But, to be fair, he might eventually stumble on the actual problem before Robert does.

    Let the race begin! Oh, and, Dougie? You’re obviously disqualified, through sheer technical and mathematical incompetence. Once you’ve passed your HSE, you’d trounce either one of the present competitors, I’m sure.

  84. DrLoser says:

    Actually, there is an arithmetic refactoring which works even in both C# and Pascal.

    One hopes that you will quote the relevant PDF for that one when the time comes, Deaf Spy.

    Never one to leave oiaohm floundering on his lonesome, I thought I’d take a stab at this one. It appears to be, um, tricky. On the (Intel) assembler level, I think you need to load CL with the currently indexed value in the array. You also need to have some sort of “magic number” loaded into DX, a different one for each boundary value. The most I can estimate about this “magic number” is that it’s going to be a bunch of zeros, followed by a bunch of ones, followed by a single zero (to cope with the boundary=0 case).

    At that point, I think, you use something like a shift-right with carry, thus setting up CF, and add the CF into the AX register (which of course was pre-zeroed using the fantastically pointless optimisation of an XOR AX, AX).

    An alternative approach would be to divide your sixty-four bit word length into 11 buckets of five bits each, load CL with the appropriate bit-shifting value for the number in question, shift right as before, mask down to the least significant bit as before, and accumulate.

    I don’t think either one of these approaches really fulfils the requirement, although either one of them will avoid the hem hem oiaohm stop dawdling will you? and just tell us what the actual problem is.

    Shifts are cheap. Arbitrarily looping in chunks of 32 (exercise for the reader: necessary in the second case) is also reliably cheap. Masking a register is cheap. And, that accumulator thing? It was necessary in the first place.

    And some sort of arithmetical solution like this (as your friend has demonstrated) is obviously and countably going to be cheaper than the consequences of The Symmetrical Problem Around Boundary Five That We Dare Not Name On This Site.

    Now, both of these solutions are non-optimal.

    But, interestingly enough for oiaohm, if you could make either one work … they would be embarrassingly parallel. Certainly more so than expecting Pink Unicorns and Magic Fairy Dust and LLVM to somehow descend onto the original quoted pedagogic example and somehow divine that, although it was bleedingly obviously written with a single CPU core in mind, the “solution” is to …

    … throw more cores at it.

    Just out of interest, oiaohm, could you in some way show us the (completely irrelevant to the problem at hand) demonstration program that shows how N cores, for some value of N, will solve the problem? I mean, it’s a pedagogical problem, designed to demonstrate either the issue of cache-busting or pipeline-busting. N cores doesn’t really make a difference.

    Nevertheless, I think your code snippet would be of considerable interest.

  85. oldfart says:

    I think ARM has arrived.

    The fact that an engineer with custom prototyping board compiled a kernel faster than a mediocre AMD based system means nothing to me Now if you could demonstrate on an ARM based system that was built like a standard desktop running versions of a mix of real world applications compiled for ARM on a Linux desktop setup, then I night concede the point .

  86. oldfart wrote, “you will not know that until you run benchmarks on a linux host running ARM”.

    That’s false:

    1. We know the clockspeeds of ARM and their cache-sizes,
    2. we know the memory bandwidth of ARM and their memory-sizes, and
    3. Debian and others already build kernels and applications for ARM.

    So, while we may not have benchmarks of everything, I do know with certainty that the latest ARMed systems can easily keep up with my Beast idling as it does all day long with two users and hundreds of processes in 4gB RAM. I’ve also had experience with single-core 32-bit systems on AMD and Intel and know ARM has the power to keep desktop users happy, maybe not all of them all of the time but certainly most of us most of the time. That’s good enoughTM. It was only a few years ago that we knew ARM could keep up with PIIIish systems but we’re well beyond that now. Look at some benchmarks on ARM: Dual-core 1gHz kernel load/decompress, ~3s, with 1gHz CPU, 10s on 300MHz CPU. Those systems are orders of magnitude slower than the current CPUs, both CPU clock and memory bandwidth.
    Beast is only 3X faster than those clunkers (30X larger file, similar times):
    “file linux-3.18.4.tar.xz
    linux-3.18.4.tar.xz: XZ compressed data
    pogson@beast:~/Downloads/linux$ ls -l linux-3.18.4.tar.xz
    -rw-r—– 1 pogson pogson 80949368 Jan 27 10:45 linux-3.18.4.tar.xz
    pogson@beast:~/Downloads/linux$ time unxz linux-3.18.4.tar.xz

    real 0m13.539s
    user 0m8.288s
    sys 0m1.460s”

    Here’s a guy building linux kernels on one of the new Exynos 5422 boards:

    “make exynos_defconfig
    time make -j8 CC=gcc-4.9

    real 5m43.746s
    user 31m25.235s
    sys 4m11.130s
    So it takes less than 6 minutes to build the kernel on ODROID-XU3 Lite, about 4 times longer than on a powerful, but much more power hungry (300W) AMD FX8350 based computer.”

    Hint. That’s faster than Beast builds my AMD64 kernel which is about 7-30m depending on options.

    Here, I’ll build with the same command:
    make -j 8

    Kernel: arch/x86/boot/bzImage is ready (#1)
    real 8m7.309s
    user 23m5.656s
    sys 2m6.068s

    There is no reason to believe one of the new gadgets is not highly competitive with Beast which has Phenom II CPU @2.5gHz and DDR2 6400 RAM. Since Beast has no trouble keeping up with two users (except browsing with Chrome – FireFox is no problem) and hundreds of processes, I think ARM has arrived.

  87. oldfart says:

    “That’s why my Beast uses 95W when ARM can do the job in just a few watts.”

    But you will not know that until you run benchmarks on a linux host running ARM. Until you do so AND can get a desktop based system built ARM running ARM versions of your software , your assertion is more theory than fact.

  88. DrLoser says:

    P.S. I am a little sorry that Dr. Loser couldn’t refrain himself from telling the results at public. It was such fun to see you hit walls in blindness. Nevertheless, you keep hitting walls even with your eyes open.

    I considered hiding it, but I expected oiaohm’s floundering to be much, much funnier this way round. And so it came to pass.

    Besides, after a reasonable period (say 12 hours, which I think is what I allotted), this “an exercise left to the student” is a little tiresome. I think the occasional hint is in order, given that some attempt (however futile) has been made to address the issue.

  89. kurkosdr says:

    “The actual Nexus 6 did not show up until ca. 1/15/15.”

    No intention to offend, but why you Americans like to write dates in “middle-endian” format?

    “Multi-cores will not help ARMs fighting Intel, neither will Moore’s Law, because Intel are already paces ahead there. Try something else.”

    Unfortunately, Intel is not a good match for Android. See, games have native code (which is why they ‘re the only apps in Android which are fast), which has to be emulated (if an x86 version of the game doesn’t exist). And as most people know, emulation is no fun compared to having hardware implementing the ISA directly.

    There was this article a while ago in theregister where Intel CPUs were benchmarked in games, and some games had performance issues compared to ARM while others just crashed. Now, I don’t know if things have improved, but the point is, no serious OEM will make significant quantities of Android devices running on Intel, when there are perfectly good snapdragons available, because they don’t want to take the risk of Intel’s emulator screwing games.

    Windows tablets is another matter of course.

  90. Deaf Spy wrote, “it became painfully clear you have no idea how modern CPUs work. “

    Pompous asses often have no leg to stand on so they attack people who do. I’ve been programming since 1968 and have watched the progress of computing all this time. Everyone uses multiple cores for stuff that matters because multiple cores permits higher throughput for the same power consumption or lowering power consumption and fastering servicing of more interrupts or processes. Moore’s Law is what allows lower-power cores and more of them. ARM needs that to compete with Intel and ARM is winning on price/performance. That’s why Atoms exist. That’s why my Beast uses 95W when ARM can do the job in just a few watts.

  91. Deaf Spy says:

    You can’t argue with the success of that plan but Moore’s Law and ARM have made it work
    Gee, I wonder how you can keep talking about Moore’s Law, when it became painfully clear you have no idea how modern CPUs work. Multi-cores will not help ARMs fighting Intel, neither will Moore’s Law, because Intel are already paces ahead there. Try something else.

  92. Deaf Spy says:

    For bonus points, there is a simple refactoring that makes the program entirely cache- and pipeline-friendly. Unfortunately it doesn’t work in Pascal, because the equivalent of a ternary operator is a statement rather than an expression

    Actually, there is an arithmetic refactoring which works even in both C# and Pascal. 🙂

    A friend of mine, a great mathematician with excessive knowledge in computer algorithms managed to devise it. I admit it was beyond my algebra abilities.

  93. Deaf Spy says:

    Ohio, try to stay focused, boy. You still haven’t managed to put your finger on the issue I brought:

    In other words, why is this behaviour (on a defective, non-optimised compiler/CPU, if you wish to include that information) symmetrical around boundary=5

    P.S. I am a little sorry that Dr. Loser couldn’t refrain himself from telling the results at public. It was such fun to see you hit walls in blindness. Nevertheless, you keep hitting walls even with your eyes open.

  94. oiaohm says:

    Verizon custom patches their roms after they get it from the ODM/OEM.

    So you have the delay of the OEM/ODM from google release then you have the delay of the carrier and each one can screw the down right thing up.

    Yes there was a reason why I bought a nexus phone I can by pass all the carrier crap and get a working installs.

    “I would suggest however that you not be too surprised if like windows of old any performance improvement is simply swallowed up by ISV’s creating more elaborate program’s for its customers.”
    oldfart I have already commented on this and I am not repeating self.
    http://mrpogson.com/2015/02/04/mini-beast-arm-cortex-a-72/#comment-242897

    Majority of application on phones will lack complexity. Android applications have a habit of lacking complexity not due to lack of CPU power but if you eat too much battery power consumer will use something else.

    Android has very much a anti-bloat nature. Even so 1 to 2 heavy application does not bring down the performance of the complete device.

  95. oldfart says:

    “Samsung is starting at the moment to roll out Android 5.0.2 to phones. Every report using old or new phones is reporting the same things Android 5.0 works.”

    Since every update I have gotten via Verizon I would be surprised if they released one .that didn’t work. However none of this technical trivia is going deals with my point which was.

    “I would suggest however that you not be too surprised if like windows of old any performance improvement is simply swallowed up by ISV’s creating more elaborate program’s for its customers.”

    I await your comments on this point sir.

  96. oiaohm says:

    Nexus 5 oldfart is what I have the generation before Nexus 6. Android 5.0 is update to it. I said released to Nexus devices that includes 5 and 7 and many others.

    OEM/ODM majority are horible at getting OS updates out. You buy a nexus so you don’t have to buy a new device when a new version of Android appears.

    There were Motorola phones on sale before the end of November 2013 with Android 5.0. Yes this is the thing Nexus 6 was not the first phone on sale with Android 5.0 Yes you could have bought a Android 5.0 phone for Christmas last year as long as it was a Motorola.

    http://motorola-blog.blogspot.co.uk/2014/11/its-time-to-unwrap-android-5-lollipop.html

    Samsung is starting at the moment to roll out Android 5.0.2 to phones. Every report using old or new phones is reporting the same things Android 5.0 works.

    DrLoser as I have started posting a Linux example of problem until then its pointless. Example currently posted gets eaten alive by optimizers. So far you guys have played here is the example guess the defect. Problem is without knowing what defect I am attempting to display its not possible to disable the optimizers that will destroy it detection or rewrite the code in a style that optimizers cannot mess with.

    http://baptiste-wicht.com/publication_store/sampling_pgo.pdf

    LLVM and GCC do a hell of a lot. Including calculating out what will required in caches and doing instructions to get that data into caches in advance of when it will be required. Worse they will slice up loops to cache limit sizes. So if this is meant to be example of cache line miss I am sorry to say resulting gcc and LLVM code attempt to trigger that not nice there are many particular code reordering optimizations and solves you have to turn off.

    This is the difference between a JIT complier and a AOT complier. AOT complier can spend a lot more time optimizing to work out how functions need to be sliced up and reordered for most effectiveness. If a JIT does this straight off the bat the application will be too slow to start for the end user wanting the application.

    Remember google JIT is 1/4 the speed of there AOT. The AOT gets to reoder the code to reduce cache miss events and memory transfer overloads.

    The multi loop is only one of many things compliers do for optimization.

    Java JIT applications get faster they longer they run because optimizations to avoid cpu cache misses happen in a profile guide optimization way. Microsoft .net Jit is stupid. You don’t see increasing performance the longer the application runs out of Microsoft JIT because it does not have profile guide optimization at all and you don’t have compliers static program analysis data used to line to CPU design.

    There are a huge number of examples of so called CPU problems done with .net code yet you don’t see examples mirror in java or C or C++. Why the compliers of the other languages are kinda better.

  97. DrLoser says:

    Lets see how long it takes my samsung 5g to get updated.

    Wrong observation, oldfart. Let’s see how long it takes Dougie’s Samsung to get updated. Or Luvr’s Samsung to get updated. Or ram’s Samsung to get updated. Or even The Little Woman’s Samsung to get updated. (Unlike Robert, she has a functional need for it whilst earning money.)

    I suppose we could always wait for Fifi’s Samsung to get updated, but since it only exists in virtual space (the hard vacuum between his ears), we might have to wait a very long time.

    About the only commentator around here who clearly cares about such stuff is Kurks. And, sadly for the rest of us, he appears to have given up and decided to go for a Lumia.

    Ah, the follies of youth, eh, Robert?

  98. oldfart says:

    “oldfart Android 5.0 is released in fact it was released for nexus devices November 3, 2014. ”

    Interesting. According to

    http://www.theinquirer.net/inquirer/news/2376049/nexus-6-release-date-price-and-specs

    The actual Nexus 6 did not show up until ca. 1/15/15. Thats not really very long a time, eh.

    As far as your having 5.02 for a while, given your track record sir you will excuse me if I have my doubts about your veracity.

    LEts see how long it takes my samsung 5g to get updated.

  99. Dr Loser says:

    Time for another episode in that popular series, Dr Loser answers the questions of biological analogies that you were too busy to ask!

    Tonight’s contestant, just to be fair, is a Mr Kurkos DR from Grecian-Land, United States of Europe. (My sponsors insisted on me putting it that way. Stupid Mid-Westerners.) Are Hippopotamuses Slow? Mr Kurkos opines as follows:

    even Nexuses, which are supposed to contain beefy hardware at the time of release, go from speedy jaguar to slow hippopotamus when the next update comes …

    True, there is unlikely to be a “Hippopotamus E-Type” available in the near future. I see this more as a failure of marketing than evolution, really. Owing to their defecatory tendencies, hippos are not an ideal branding opportunity.

    And you don’t see many of them on the shelves, because, well, they don’t really fit on shelves, and if you try, they will kill you. (More humans are killed by hippopotamuses every year in Africa than by crocs.)

    It’s a tragic story of Hippo branding failure, lack of Hippo shelf space, and naturally lack of Hippo salesmen. It’s a story that should resonate with the aficionados of the Linux Desktop, which suffers precisely the same issues. (Although, to be fair to Linux, it generally takes at least two months before it craps all over the place. On the other hand, the Hippo has a nifty way of twirling its tail when it does so.)

    The thing is, the top speed of a Hippo is ~30 mph, which isn’t too shabby. And a Hippo is a Beefy Beast — it packs a punch. (Or at least a bite.)

    May I be the first to suggest 2015 as:

    Year of the Hippo Desktop!

  100. DrLoser says:

    I’ll help you out further with another bit of your own, slightly speculative, information:

    Sorry 1 to 9 on mono on Linux run at the same reported speed when averaged over many runs.

    This comes down to the auto parallelizer in LLVM its kinda a little smarter.

    At some stage, I presume either I or Deaf Spy will need to explain the actual issue. But before we do, I would appreciate more information on the “auto parallelizer in LLVM.”

    Could you, in particular, explain how this auto-parallelization somehow produces near-identical results for boundary=0/10, boundary=1/9, etc?

    In other words, why is this behaviour (on a defective, non-optimised compiler/CPU, if you wish to include that information) symmetrical around boundary=5?

    Explain that, please. Because that is very precisely where the problem lies.

  101. DrLoser says:

    Really how many incorrect arguments in a row are you TMR guys going for.

    I don’t really know. Is there a limit on this stuff? I gave up at eleven, which covers your original “attempts” to explain away Deaf Spy’s example … and then Kurks took over. I haven’t slogged through your responses to Kurks yet, but on previous form you’ve probably added another four.

    Fifteen incorrect arguments in a row? It might very well be your current record, oiaohm, although I think the threading discussion and particularly the Unicode discussions were in the ballpark. But don’t let us stop you from spurring yourself on to further, as yet inconceivable, levels of incorrectness.

    Here: I’ll help you with one of your earlier quotes that I missed:

    So without question 0 is the fastest because its basically doing nothing and running first before it MULTI THREADS the for loop.

    To make it easier for you, I have both high-lighted the especially incorrect part of your argument, and turned it into block capitals.

    Now, do you want to reappraise the relevant PDFs available on the Internet, and try once more to answer Deaf Spy’s question?

    Of course, it would help if you understood the problem in the first place …

  102. DrLoser says:

    DrLoser might as well give up asking what my jobs were they are always off topic questions that I will most likely never answer.

    Do you have an example that demos the fault that works on Linux. DrLoser???

    This is like the playground game of “I’ll show you mine, if you show me yours,” isn’t it, Fifi? Except that apparently you won’t show me yours.

    What a little tease you are.

  103. oiaohm says:

    Really how many incorrect arguments in a row are you TMR guys going for.

  104. oiaohm says:

    oldfart Android 5.0 is released in fact it was released for nexus devices November 3, 2014. The performance gains are not being consumed quickly at all.

    Also Windows has never done a 4x performance boost. Yes the boost of changing from Dalvik to ART is huge. Application complexity has to go up by a heck of a lot to consume that.

    By the way we are in fact up to Android 5.0.2. Us with Nexus devices have had 5.0 for quite a while.

    There is also a nasty downside that will slow down applications eating up Art the fact application makers will want to sell to as many devices as possible. Supporting only the latest reduces possible sales.

    oldfart basically you are restoring to the arguement of a troll. Lets pretend a released product is not released yet. Sorry Oldfart this is case there is no means to claim that the gain will just be consumed.

    Issue with animations was in fact lack of performance.

  105. oldfart says:

    “Sorry Android 5.0 breaks the model of the newer version of Android being slower than the older version.”

    We can read too sir. When it comes out we shall see. I would suggest however that you not be too surprised if like windows of old any performance improvement is simply swallowed up by ISV’s creating more elaborate program’s for its customers.

  106. oiaohm says:

    kurkosdr
    http://www.ibtimes.co.uk/android-5-0-lollipop-how-improve-speed-performance-reduce-lags-extend-battery-life-l-speed-mod-1487063

    How long are you going to say garbage. Yes android 5.0 lollipop is 4 times faster than 4.4 in default mode. Yet it possible to make Android 5.0 lollipop even faster.

    Getting rid of dalvik cured most of the animation problems.

    Sorry Android 5.0 breaks the model of the newer version of Android being slower than the older version.

  107. kurkosdr says:

    Oh, and all the talk about ART doesn’t convince me. Remember “project butter” was supposed to make UI animations smooth.

    But project butter-y phones have choppy animations everywhere (not just Maps). The more google tries to fix Android, the worse it becomes.

    My next phone is going to be a Lumia, and you should do the same (or Ubuntu Phone? screw sanity)

  108. kurkosdr says:

    “My main gripes with Android were that it was needlessly slow and the licence.”

    If you found 2.3 slow, just wait for 4.x or even worse Lollipop.

    If you are still on 2.3, you haven’t seen the unique ability of Android to literally devour Snapdragons/Exynos’es in it’s full glory.

  109. oiaohm wrote, “dalvik is optional in Android 4.4 and newer and gone in Android 5.0. Dalvik has been replace with ART in Android 5.0 Lollipop.”

    Hey! That’s good news. I haven’t been following Android that closely as I still use an ancient Galaxy S with 2.*… My main gripes with Android were that it was needlessly slow and the licence. I wonder if Google was feeling the heat from the imminence of GNU/Linux on smart thingies.

  110. oldfart wrote, ““Mind you its the normal MS Troll arguement to point at bugs in competing software and completely disregard the response time to those bugs.”
    “Because it is a distinction without a difference.”

    Oh, there’s a big difference. In the old days before M$ began to worry much about quality, folks were left hanging for years. Their only salvation was buying a new PC with a new licence for the software M$ shipped… More recently, the world stood by helplessly as millions of PCs were taken over by bad guys exploiting one of gazillions holes in that other OS. These days, thanks to the complexity of M$’s software, they are an outlier in time to respond to reports made to them about security vulnerabilities. M$ has tied so many unrelated systems together that any change here to fix a vulnerability often has unintended consequences there and it takes M$ forever to get it right. Lately, they have released patches that really broke users’ systems. Perhaps oldfart thinks that doesn’t matter, but it does. Further, “7”, M$’s flagship in business, is waiting on death-row. Wanna bet M$ finds it increasingly difficult to fix vulnerabilities in “7” promptly?

    This factor is one of the most important reasons folks give for migrating to GNU/Linux. They can either fix things themselves, hire someone to do so or share in the warm glow of the benevolence of others who have fixed the vulnerabilities.
    “Enterprises consider Linux superior in technical prowess, security, and cost. In fact, 78 percent of enterprises feel Linux is more secure than most other operating systems, an important consideration in light of increasing scrutiny on the security of projects that support the world’s software infrastructure.

    The analysis in this report is focused on 262 respondents who work for organizations with sales of more than $500 million and/or 500+ employees. The majority (60 percent) identified themselves as IT/IS staff or developers and represented a wide range of industries. Users from the US and Canada make up 48 percent of the respondents, 26 percent are from Europe, and 11 percent are from Asia.”

  111. oiaohm says:

    oldfart I have made excuses for Microsoft in the past. Sorry its not just about the ISV I like or hate. If Microsoft was showing decent response times at the moment I would make the same excuse for Microsoft. Google is successfully doing 45 days please note this is 15 days over Google preferred target of 30 days. There are other vendors who do manage to pull of the 30 days. So if you were asking me to pick Google over one of the vendors who do 30 days I would not. The reality Microsoft and Apple have a long way to go until they are best of breed. Google is not best of breed they are not worst either.

    kurkosdr you say wonky. The problem here is insecure. Somethings you have no option bar to update then sort out the wonky bits latter.

    kurkosdr something interesting
    http://source.android.com/devices/tech/dalvik/
    dalvik is optional in Android 4.4 and newer and gone in Android 5.0. Dalvik has been replace with ART in Android 5.0 Lollipop.

    Interesting enough since art is native binaries its in fact bench-able performance is way higher than dalvik in android 2.3. Of course there is a downside installation takes longer. Yes low performance of a new Android 5.0 Lollipop can be caused by those new applications you have just installed being converted to native code in background.

    https://developer.android.com/about/versions/android-5.0.html#ManagedProvisioning

    Android 5.0 is give and take. The dropping of dalvik means applications can in fact run faster. Price is application installation is more costly. The introduction of managed profiles based of knox ideas causes a performance hit.

    kurkosdr big question how long did you leave the phone in lollipop mode before you gave up. Was it in fact long enough for it to build the install the applications to native code.

    I guess you did a in-place upgrade not understanding that this could kinda make the phone busy likes of 8 hours building all installed applications to native code. Yes most who complain about lilipop performance have normally been using the phone while it was building applications still. Yes building in background is a hell of a performance hit for a while. Good advantage this only happens once per application install.

    The native code mode also cures androids slow starting of complex applications after they are installed and build to native code. The native code bit also allows Linux kernel memory management to perform more effectively as well.

    Yes Android 5.0 should be faster than Android 2.3 if you let the art complier complete. In fact I do have a Nexus 5 and I can tell you absolutely it is if you are willing it wait long enough. Yes sit the newly upgrade Nexus 5 on change and disable power management and let it build.

    Yes Google could have given out some upgrade warning instructions with Lolipop about art side effect.

  112. oldfart says:

    “Mind you its the normal MS Troll arguement to point at bugs in competing software and completely disregard the response time to those bugs.”

    Because it is a distinction without a difference.

  113. oldfart says:

    “Every software vendor releases buggy software at some point.”

    Exactly, so spare us the excuses for the ISV you like.

  114. kurkosdr says:

    PS: Putting apps in Google Play is an ugly workaround to the problem, because apps tend to be wonky when they run on an old Android base, and even slower (Maps is particularly horrible when running on an old base). Which is another plus for the OEMs. Your phone doesn’t become obsolete in one day, so you theoretically can’t complain because it still receives some updates. What happens is a slow ride to suckage, leading to replacement (yay!)

  115. kurkosdr says:

    “The whole idea of Android is inefficient except that it allowed a bunch of Java-programmers to jump in…”

    Android 2.3 run Java (Dalvik VM) and had less performance issues. Google just doesn’t know how to make OSes, and for that reason they bloat their OS with each version release faster than Microsoft does with WP or even Windows.

    But hey, Moore’s law! Oh wait, it doesn’t make a new chip magically appear in my device. Android is an OEMs dream. Direct planned obsolence (no updates for you) and indirect planned obsolence (we made your phone unbrearably slow, even if it was a roaring Nexus), 2 in 1!

    No wonder OEMs love the damn thing.

  116. oiaohm says:

    Mind you its the normal MS Troll arguement to point at bugs in competing software and completely disregard the response time to those bugs.

    Validation, Verification, and Testing of Computer Software, Issues 500-575 and many other books I could refer to. Unfortunately its taught that there is no such thing as perfect software. This leads to the bug riddled software we have today.

    Key metric is response time.

    Android not getting vendor updates is a major issue because this is slow response time. Google has taken more and more applications back into google play so that update can be pushed out faster.

    Its sad right the down grading of what open source android contains is linked to lack of OEM/ODM lack of willingness to provide clients with OS updates.

    Lesson to learn from Android for anyone making a new OS is have a update system independent to OEM/ODM.

  117. oiaohm says:

    oldfart Microsoft also releases buggy software. I like the all else is excuses line.

    Every software vendor releases buggy software at some point.

  118. oldfart says:

    “Google yes might release buggy software…”

    Bottom line sir, Google is just as guilty as anyone of releasing “buggy” software. All else is excuses.

  119. oiaohm says:

    Really nothing beats Windows 8.1 not wanting to having anything todo with working Microsoft USB keyboards and mice on Intel hardware.

    kurkosdr is this not sad.

    Google has a 45 day policy to fix issues that they stick to. Remember Microsoft cannot even get reported security faults fixed in 90 days. Apple also fails to meet the under 90 day requirement. So Apple is in the running for the worse software with Microsoft.

    Google yes might release buggy software but at least a fix will come quickly.

    kurkosdr WP had equal issue with bing maps where when it was updated in 2012 it was causing crashes. Took 7 months for Microsoft to fix.
    If WP had the same problems as Android has, we would never hear Pogson stop whining about it, I guarantee it.
    I think you need to take that back. There was not a single post by Pogson about it.

  120. kurkosdr says:

    “Google Maps stopped to work after its most recent update. ”

    Opposite situation here. The new “material design” Maps (to which i upgraded against my better judgement) used to lock-up when I selected a result, requiring a force-quit. Sometimes, a second instance of Maps would pop up, then Maps would lock-up. I worked around the problem by selecting a destination from the reccomendations Maps makes as you type.

    With the latest Maps upgrade, no problems at all. The problem was probably caused by the mini-Map that is presented when destination details were shown, causing the two instances of Maps to somehow cause a lock-up. In the latest version it was replaced with a jpeg image of the mini-map.

    Hooray for Google software! If WP had the same problems as Android has, we would never hear Pogson stop whining about it, I guarantee it.

    PS: Microsoft, hand-over your crown. The new king of rushed and buggy software is Google. Although both MS and Apple are serious contenders.

  121. luvr says:

    kurkosdr wrote, “I once did the mistake of upgrading Maps in my SGS3, and now it lags as if there is no tomorrow.”

    Sigh… For me, Google Maps stopped to work after its most recent update. It simply crashes whenever I attempt to start it. This is on an ASUS Memo tablet. I removed all of its updates, but then it loses most of the features that made it particularly interesting to me.

    Yesterday, Firefox suddenly started to crash after an update. Shortly thereafter, a new update arrived and that problem, at least, was quickly solved.

    Some time ago, Google Hangouts suddenly crashed after an update. Another update, a week or so later, solved that problem, too.

    So, Android certainly comes with its own set of issues.

  122. oiaohm says:

    kurkosdr Kitkat vs Lollipop is not that simple. Lollipop adds the Knox security framework. Security filtering frameworks annoying have overheads.

    Knox was required for particular market access. You want to sell phones to USA government and many other governments there is a list of features required. Worst these features has to be stock or you don’t get on the certified list. Stupid enough they don’t require the same list of features for a laptop.

    kurkosdr also remember arm design gpu mali the newer high res gpu versions also consume less power running flat out compared to the older lower resultion mali gpu running at about 50 percent load. Arm mali is not following the Nvidia and ATI/AMD model of just burning more power to increase performance.

  123. kurkosdr wrote, “please tell me how it’s Microsoft that’s making inefficient software”.

    The whole idea of Android is inefficient except that it allowed a bunch of Java-programmers to jump in… The cost of labour matters. GNU/Linux has been held up by the shortage of C-programmers but schools have been cranking out Java-programmers for a decade and businesses hired them… You can’t argue with the success of that plan but Moore’s Law and ARM have made it work just as Moore’s Law and Intel made that other OS acceptable to many for so long. That’s why I love GNU/Linux and FLOSS. The user can install just what he needs and even tweak it some to get the performance he needs from just about anything. That’s one of the reasons I hate systemd because it increases the dependencies of application-level software to the point that a “minimal system” doesn’t exist. I liked Debian for years because I could skip the tasksel stuff, all of it, get a really minimal system that booted in a few seconds even on old hardware and install just what I needed rather than what someone else wanted me to run. systemd is pushing GNU/Linux to become more like M$’s OS, bloated crapware, a tangled mess with all kinds of unnecessary dependencies, attack-surfaces, sluggishness out of control of the user…

  124. kurkosdr says:

    I won’t play the of = I won’t play the game of

  125. kurkosdr says:

    I used to be super-excited about news like these, but now I am not.

    Give ’em more power, they ‘ll waste it with more inefficient code. And that goes out to everyone. Apple, Microsoft, Google, Canonical. But Google is the biggest culprit, since even Nexuses, which are supposed to contain beefy hardware at the time of release, go from speedy jaguar to slow hippopotamus when the next update comes (for example, the Nexus 5 was roaring with Kitkat, with Lollipop it’s just meh). Dear Pogson, please tell me how it’s Microsoft that’s making inefficient software.

    Sometimes, I wonder if there will ever be a time when we are going to have enough hardware. You know, millisecond waiting times from the moment you press a button to the moment the app loads, boot times measured in single-digit seconds, and so on.

    Meawhile, I ‘ll just stay with my old phone and not upgrade to the latest versions of my apps, with the exception of the browser. I once did the mistake of upgrading Maps in my SGS3, and now it lags as if there is no tomorrow. Sorry, I won’t play the of more hardware, fatter software anymore. It’s only a net profit for the hardware manufacturers from this point on, any features you may want are already in the old phone (SGS3, LG G2, Xperia Z, you name it, if you own one of these, you probably have all the features you want)

    PS: At least the graphic processors become better. But it’s not worth it on a game.
    PPS: All the new GPU power will of course be wasted on rendering 1440p graphics, on a phone, because screw sanity.

  126. oiaohm says:

    DrLoser might as well give up asking what my jobs were they are always off topic questions that I will most likely never answer.

    Do you have an example that demos the fault that works on Linux. DrLoser???

  127. DrLoser says:

    To be fair, oiaohm, you haven’t called me an idiot once in this thread. (Apart from that one time where you inadvertently claimed that I was a “moron.”)

    I appreciate that restraint. It almost make this sound like a civilised conversation.

    A civilised conversation with a donkey.

  128. DrLoser says:

    So your last job in IT would have been as a compiler expert, Fifi?

  129. oiaohm says:

    DrLoser the true answer it does not matter what it is if the fault does not display itself. So why do I have to bother finding out when it people making compliers who have dealt with the problem.

    You are not giving me an example that shows a problem. That mostly means the write up is most likely old and out dated because methods were found later to deal with it that have since been integrated into compliers.

  130. DrLoser says:

    And your last job in IT would have been, Fifi?

  131. oiaohm says:

    DrLoser since you like danging so much maybe some one should hang you off the side of a building.

  132. DrLoser says:

    The reality is the Linux world does.

    I will absolutely 100% accept that argument. Every last Linux program, on every last Linux distro, is 100% capable of working around this trivial little hardware issue via the most astonishingly and unbeatable combination of compiler optimisations and assembly-level JITters.

    No problem at all.

    Except … just this teeny, tiny little one, Fifi.

    What is the actual problem in the first place? Go search for that surprisingly elusive PDF!

    (Took me all of two minutes, as I say.)

  133. DrLoser says:

    Unless the example is meant to bring up complier optimization. I would be changing the boundary for loop to start at 1.

    I’ll just leave that particular impertinence dangling.

  134. oiaohm says:

    DrLoser
    You are not supposed to trick your way around it with compiler optimisations.
    The reality is the Linux world does. It is the complier that is meant to worry about arch issues. If you have to start worrying about arch issues when it comes to performance other than threading its either a major hardware bug or its a complier not altering to the cpu.

    The advantage of JIT’s and bytecodes is meant to be correct cpu optimisation so avoiding cache-line and and pipe-line problems.

    Non-local problems are more frequent and far more difficult.
    This is what a JIT runtime optimizer and profiling is meant to cure. So speeding up at a particular point can be the JIT optimizer kicking in.

    Native code having cpu problems is understandable because it most likely built for the wrong hardware.

    language dependent also tells you is more complier than much else.

  135. DrLoser says:

    So 10 is still not everything.

    “Well, it’s one louder, isn’t it? It’s not ten. You see, most blokes, you know, will be playing at ten. You’re on ten here, all the way up, all the way up, all the way up, you’re on ten on your guitar. Where can you go from there? Where?”

    Perhaps you have a point there, Fifi.

    Can you plug in an amp without blowing yourself up? Don’t try it in the bath-tub.

  136. DrLoser says:

    Come on give the PDF you will find out how much of that PDF is in fact complier dependent.

    Nope, the PDF in question is purely a description of how the hardware works. Deaf Spy’s trivial pedagogical example is merely a demonstration of the limits of that hardware.

    Some of this is even language dependent, Fifi: thus my addition of Pascal statements versus expressions. I wouldn’t expect you to understand anything as fundamental as that.

    Given the simplest possible example (and you have been given that), it is possible to argue that the compiler can optimise a local problem out. Well, whoop-de-doo. Non-local problems are more frequent and far more difficult. And you still don’t seem to realise that this is a pedagogical example. You are not supposed to trick your way around it with compiler optimisations.

    The whole point of the example is that it demonstrates the difficulty in dealing with cache-busting, pipeline-busting, or a combination of both.

    And if you can’t even guess at what the underlying hardware issue is, oiaohm, then what is the point of all those half-baked links you threw at us?

    None whatsoever. Unless you have a relevant hardware link on this one.

    Or, of course, the pathetically easy PDF that I found in two minutes flat.

  137. oiaohm says:

    May I politely suggest that you run up a monitor for “CPU load” whilst you examine the program in question? You might need two cores for this, to make it a fair deal.
    Issue was that I only had 8 cores available. I have just re run it with 10 and they are now all showing the same time.

    I have processes bound to cpu cores on this system. Those take effectively 100 percent load on those cpus. The system I am sitting on has 16 cores. So 10 is still not everything.

    4 cores I tried with 1 to 5 the same and 6 to 9 the same. I am not getting the solitary better at all. The difference is inside statical error.

  138. DrLoser says:

    DrLoser please go and read up on CAD to be correct drafting. You will find a lot things are not matrix maths. Its all the non matrix maths stuff that brings your simulations to a grinding halt.

    I appreciate your advice, Fifi, but we here have a commercial company to run. And, as part of my terms of employment, I absolutely insisted that I had a gibbon in a cage to offer me this sort of useful advice.

    Thinking back, I should have just accepted their initial offer of a parrot.

  139. oiaohm says:

    DrLoser me and Robert Pogson are Linux users so the example was for Windows users. Sorry wrong examples. This is the problem we have many examples of CPU so called problems turn out to be complier related just as much as CPU.

    Complier is meant to know CPU limits and attempt to avoid them.

    Come on give the PDF you will find out how much of that PDF is in fact complier dependent. Yes a lot of classes teach this crap. Linux students get confused because example after example does not add up. Why is Linux making it work when on Window it is going south when you write the Linux answer you can marked wrong hand have to prove it to the teacher that opps that is the answer when you are on Linux.

  140. DrLoser says:

    I think the reason to 1 to 9 instead of being 1 to 10 all being the same is cpu load.

    You are, of course, entitled to your completely ignorant and uninformed opinion, Fifi.

    But … OK, I’ll give you that “CPU load” would be hardware-related, which is marginally better than your previous flailing around regarding LLVM.

    May I politely suggest that you run up a monitor for “CPU load” whilst you examine the program in question? You might need two cores for this, to make it a fair deal.

    But I think you’ll find that it has nothing at all to do with “CPU load.”

    And you still have no clue whatsoever, do you? Took me two minutes.

    You’re still on the never-ending clock, Fifi.

  141. DrLoser says:

    So its demonstration of a Microsoft .net JIT issue.

    Could be. But that’s just brainless execution of somebody else’s smarts.

    Tell us, oiaohm, what specific problem is the Mono JITter addressing?

    Remember, this is a hardware issue. The JITter can only do so much, given a realistic program rather than a pedagogical simple example.

    And your ability to look PDFs up has really gone down the tubes, hasn’t it? Not a single one in evidence.

    Goodness knows why.

  142. DrLoser says:

    One more clue, Fifi, and this is a repeat of my software clue, as given earlier to Robert.

    It is, indeed, a hardware issue. But it can be solved (in this one simple pedagogical case, and not in general) by a simple refactoring via a ternary operator.

    But only in languages that support a ternary operator as an expression, rather than a statement.

    I really cannot make this more simple for you, other than directing you to the PDF in question.

    Sad, really.

  143. oiaohm says:

    I think the reason to 1 to 9 instead of being 1 to 10 all being the same is cpu load.

  144. oiaohm says:

    DrLoser he has not said what he is attempting demonstrate its dark spy who has to state if what I am referring to is what he was meaning to demo.

    “cache-busting or pipe-line busting”
    Ok you must be refering to Microsoft JIT. It does not have brains.

    Sorry 1 to 9 on mono on Linux run at the same reported speed when averaged over many runs.

    This comes down to the auto parallelizer in LLVM its kinda a little smarter.

    So its demonstration of a Microsoft .net JIT issue.

  145. DrLoser says:

    Or, rather, symmetrical around “boundary=5.”

    Which any normal person would have seen by running the program and going …

    WOT?

    And then, like me, spending two minutes looking up the relevant PDF.

    I can cite that PDF, Fifi. Can you, after all of twelve hours’ effort?

  146. DrLoser says:

    So without question 0 is the fastest because its basically doing nothing and running first before it multi threads the for loop.

    On my machine, Fifi, 10 was the fastest. Now, interestingly, even though I know and fully understand the “secret special sauce…” I’m not quite sure why. Because, theoretically, this problem should have a symmetrical distribution around 5.

    I suspect, without analysing it any further, that there’s some sort of set-up cost involved with boundary=0 that is not present with boundary=10. Or, possibly, that the MMU, the cpu, the cache lines and the pipelines have gotten a bit of “giddy-up” from starting at 0 and going to 10.

    Perhaps I should reverse the sequence? That would be an interesting experiment, I think.

    But not nearly as interesting an experiment as watching you flailing around with your preposterous theories about “I needed to upload the latest version of LLVM.”

    Go on, please, Fifi. This is more fun than you can possibly imagine. And don’t forget: I let slip the important fact that this is symmetrical around “boundary=0.”

    Bit of a give-away that. Deaf Spy will never forgive me.

    Out of interest, however, I wonder how many of his CompSci 101 students figured it out inside the twelve hours or so that it’s taken you, Fifi?

  147. oiaohm says:

    DrLoser please go and read up on CAD to be correct drafting. You will find a lot things are not matrix maths. Its all the non matrix maths stuff that brings your simulations to a grinding halt.

    Naturally, these could all be easily parallelised through a GPU. Except, it turns out, not. We have tried. There are several obstacles.
    Bullet physics engine mostly says most of the stuff you are talking about is bull crap caused by people not knowing how to code it. You will find it got around almost 100 percent of the issues with those kinda of simulations. Bullet physics engine use to doing million of point calculations not a wimpy 10 000 points. Yes each point in bullet has more demisions than 10 to worry about.

    I will give you its not easy making parallel formulas for some of things things.

  148. DrLoser says:

    Unless the example is meant to bring up complier optimization. I would be changing the boundary for loop to start at 1.

    I believe I have already pointed out that this example, though there is a code optimisation to do with the difference between a statement and an expression, is clearly not intended to demonstrate anything to do with compiler optimisiations, Fifi.

    And I’ve given you more clues than you should really need. It’s either to do with cache-busting or pipe-line busting, and it might very well be a combination of both.

    Now, stop guessing and go look up the relevant PDF.

    And no more drivelling, please. It’s not a demonstration of a compiler issue (I give you this for free). It’s a simple demonstration of how an embarrassingly parallelisable algorithm can go horribly wrong if you don’t understand how the underlying hardware works.

    3 … 2 … 1 … Run, Fifi, Run!

  149. oiaohm says:

    DrLoser after updating my llvm that had got a bit old its no longer doing this nasty. Its not strcpy releated.

    Ulrich Drepper find the strcpy modification he rejected. its Strlcpy what is highly pointless because glibc already has strncpy. Drepper accepted MPX extensions into glibc we are just waiting on the hardware.

    Drepper issue was not running static analysis to find these issues for a very long time.

    Even with the new LLVM a lot disappears.

    for (int iterations = 0; iterations < 1000; iterations++)
    {
    }
    That disappears.
    This "for (int boundary = 0; boundary <= 10; boundary++)"
    is now "for (int boundary = 1; boundary <= 10; boundary++)"
    Yes there is extra code before the for loop being

    watch.Restart();
    watch.Stop();
    Console.WriteLine("Count = {0}, Time = {1}", 0, watch.ElapsedTicks);

    So without question 0 is the fastest because its basically doing nothing and running first before it multi threads the for loop.

    That is of course if you are using a working LLVM/mono.

    How come this has happened is LLVM has calculated possible value range for the array contents being 0 to 10 so 0 boundary has to be 0 count.

    Somewhere in the older LLVM version this went horribly wrong. The code you write and the code the complier ends up leaving can be very different.

    Unless the example is meant to bring up complier optimization. I would be changing the boundary for loop to start at 1.

  150. DrLoser says:

    DrLoser Yes I missed one thing:
    if (array[i] < boundary) count++;

    Absolutely irrelevant, Fifi. Prove any part of it to be relevant, should you be impertinent enough to try.

  151. DrLoser says:

    Console.WriteLine(“Count = {0}, Time = {1}”, Random.Next(10000), watch.ElapsedTicks);

    I think you’re onto something there, oiaohm. Leaving aside possible cache-busting and pipeline-busting issues, that looks like a 96 nanosecond MMU issue to me …

  152. DrLoser says:

    Not so absurd, really, when you consider massive matrix calculations or whatever.
    That is a really bad example. Insane bad example Moron bad example.

    Matrix calculations is something you use GPU for. Why because a Matrix problem turns out to be a stack of small independent maths problems.

    Really bad example. Really really bad example. Insane bad example. Moron bad example.

    Well, obviously, Fifi. Now, while you are spending your day job squirting bull semen up a heifer’s back end, it so happens that I am part of a fifty man team building a CAD engine for structural engineering. I’ll save you the details, but it boils down to a static model (THANK GOD! NOT A FEEDBACK MODEL WITH A GOVERNER!) involving several cycles through a set of — rough average — 10,000 result points for a wooden roof, each of which has about ten other orthogonal dimension involving either SLS or ULS and various calculations.

    Naturally, these could all be easily parallelised through a GPU. Except, it turns out, not. We have tried. There are several obstacles.

    But, if you can take time out from squirting genetic material into a mammalian uterus, Fifi, I’m sure you can come up with some creative ideas.

    Maybe a completely pointless cite or two?

  153. DrLoser says:

    So when I saw the nop between the timer points I though that was were the modifications ended.

    Ulrich Drepper! Is that you, Mr Strcpy?

  154. DrLoser says:

    Oboy I miss read the decompiler. Basically LLVM has done a nasty.

    Fair enough. Here’s an admission for you, oiaohm. I spent half of today tracking down a cock-up I made by checking in a debug version of three DLLs, rather than the overnight build. We all make these mistakes. None of us are perfect.

    But at least one of us tracked down the relevant PDF, didn’t we?

    And at least one of us ran the code through and had that “A-Ha!” moment, didn’t we?

    Clue, oiaohm: it’s back to the Turkey Baster for you.

  155. DrLoser says:

    DrLoser basically dark spy code on mono aot does some really creative things.

    Apparently, so does brain damage.

    Remind me again, oiaohm. What particular issue does Deaf Spy’s example code demonstrate?

    You have a pass on the “forbidden cites” here if you feel that one would be necessary. I won’t hold it against you.

  156. oiaohm says:

    Console.WriteLine(“Count = {0}, Time = {1}”, Random.Next(10000), watch.ElapsedTicks);

    Oboy I miss read the decompiler. Basically LLVM has done a nasty.

  157. oiaohm says:

    The random end up in
    Console.WriteLine(“Count = {0}, Time = {1}”, count, watch.ElapsedTicks);
    That becomes
    Console.WriteLine(“Count = {0}, Time = {1}”, Random.Next(boundary), watch.ElapsedTicks);

    I did not know LLVM could quite transform this far. So when I saw the nop between the timer points I though that was were the modifications ended.

  158. oiaohm says:

    DrLoser Yes I missed one thing
    if (array[i] < boundary) count++;

    Do compliers not cheat when they optimize. Remember printf turns into puts and so on. Your right its not a Nop but it a call random for value of (boundary). The code is in fact substituted. The build result does not have array anymore I just noticed. LLVM had been insanely busy.

    DrLoser basically dark spy code on mono aot does some really creative things.

  159. DrLoser says:

    Perhaps there’s an angle where everyone would get IT for $free if they allow access to some cloudy grid for the idling cores… Is this the next business model, a cloud with no server-rooms?

    Simply put, Robert: no.

    A cloud with no server rooms and free yeast? Now you’re talking sense!

  160. DrLoser says:

    It’s not that you don’t have a clue what the problem here is, Fifi.

    It’s just that, will all that vaunted WallOfCites drivel that you persist in using for your Gish-Gallopingyou can’t even look it up, can you?

    That’s truly pathetic. It’s proof of pointless ignorance, I believe. That task took me all of two minutes to find the relevant PDF, and a minute more to realise what it meant.

    And yes, five minutes more to run the sample C# and see the output.

    Apparently M$ C# is roughly 2.5 times slower than finding a PDF via Google.

    Except that, in either case, you have to know what you’re doing. Which, Fifi, you have yet to demonstrate.

  161. DrLoser says:

    And of course it’s being called within a loop. And the loop conditions are … let’s just check this … not especially invariant.

    BWAHAHAHAHAHA!

    How’s the learner classes in prize bull insemination going, Peter?

    Clue: the Turkey Baster thing goes into the Heffer. Not the Bull.

  162. DrLoser says:

    It just NOP anything that can be solved out to a static result that does not in fact modify anything.

    Unfortunately for you, Fifi, this is a pure function that takes an array of randomised integers and returns a result.

    Guess where LLVM notices that the result is modified?

    No NOPs for you!

  163. DrLoser says:

    DrLoser llvm running over that program will NOP that complete code section.

    Naturally you will have no problem at all quoting the output from LLVM.

  164. DrLoser says:

    Btw, here is a very little nice example that I give to my students sometimes (courtesy of Raymond Chen). It is in C#, but I am sure you will have no trouble working it out.

    Just a guess, Deaf Spy.

    CS 101?

  165. ram says:

    Deaf Spy said: “I don’t see Intel throwing cores. AMD did try, and failed to compete at both desktop, and on server. Their 16-core Bulldozer had its ass handled back to it by 8-core Xeon.”

    AMD failed since they didn’t supply any motherboards for those chips that ran Linux. The same time they introduced Bulldozer they also tried to force the big anti-feature UEFI. Intel, on the other hand, does supply motherboards for their chips, AND they support Linux.

    Since high performance computing, and cluster computing, is completely dominated by Linux, AMD’s lack of Linux support may prove to be fatal to AMD.

  166. oiaohm says:

    Gibberish. No compiler yet known would elide all of that code to a NOP.
    DrLoser llvm running over that program will NOP that complete code section. So don’t say no complier will either. Yes mono will do it in its AOT mode. It why it make some .net programs run insanely fast. It just NOP anything that can be solved out to a static result that does not in fact modify anything.

  167. oiaohm says:

    DrLoser reality is how often are you running a single task.

    http://www.linux-kvm.org/page/Multiqueue

    Most of what you are calling gibberish. here is correct.

    Not so absurd, really, when you consider massive matrix calculations or whatever.
    That is a really bad example. Insane bad example Moron bad example.

    Matrix calculations is something you use GPU for. Why because a Matrix problem turns out to be a stack of small independent maths problems. Turns out most Matrix calculations turn out to be embarrassingly parallel algorithms and this is why you use GPU for them.

  168. DrLoser says:

    OK. A little help with Robert and the “boundary” issue here. Because I know that Robert is capable of programming it, and will be just as astonished as I was at the results.

    And I know that Fifi is completely incapable of looking up the relevant PDF. Why? Because Fifi hasn’t. (It only took me two minutes, from scratch.) Here we go then.

    Deaf Spy what version .net with what version optimization.

    Not actually relevant. This is a demonstration of … well, I’ll let you guess, Fifi. Either cache busting or pipeline busting.

    This is not exactly an answer you are going to expect. Due to the complete code example being predictable the complier optimization can convert the code between watch.Restart() watch.Stop(); can in fact be reduced to only count=boundary.

    Exactly an answer we would expect. Because it is utterly and embarrassingly wrong. It’s really nothing to do with the compiler … although, if you look up the relevant PDFs, you’ll find an argument for compiler optimisation.

    But it’s not the optimisation you are suggesting, Fifi.

    You didn’t even bother to compile and run the code, did you, Fifi? So much for the Four Freedoms.

    Heck it can get worse. The code between watch.Restart() and watch.Stop() can be made non existent.

    Gibberish. No compiler yet known would elide all of that code to a NOP.

    A completely baseless and uninformed proposition, Fifi.

    Oh, and btw, your completely baseless and uninformed proposition has nothing at all to do with the example behaviour at hand.

    Seriously, Fifi. You didn’t even compile and run it, did you? I DID.

    Complier modifications to code can be quite ext-ream. Sorry you example lacks a few key things to prevent complier rewriting.

    Name one, Fifi.

    Run the program up. (Easily translated into your language of choice.) And do us the favour of explaining the underlying issue.

    Deaf Spy knows what the underlying issue is.
    I know what the underlying issue is.
    I suspect that ram knows what the underlying issue is.
    And, if Robert has rewritten the sample in Pascal and run it, then I believe Mr Pogson knows what the underlying issue is.

    You don’t have a clue as to the underlying issue, do you, Fifi?

  169. DrLoser says:

    I new there was a problem with DrLoser and Deaf Spy ideas. I just was not remembering the new law.

    The only trouble with that little observation, Fifi, is that you quoted Gustafson’s Law earlier on. Good to know that your memory is not as defective as your limited capacity for reasoning.

    Yes a single problem performance many not be increased massively. Reality is how often is a computer really dealing with a single task.

    Gibberish.

    Its is insane rare in fact for a computer to have a single task.

    A Reductio ad Absurdum, Fifi: let’s assume the computer has a single task. Not so absurd, really, when you consider massive matrix calculations or whatever. Now. Tell us how this single task can be parallelised with 100% efficiency. (Which it cannot.) And tell us how this scales up to more than a single task.

    Embarrassingly parallel workload is the most common computer workload. Yes 2 users running 2 applications on the same server is almost always Embarrassingly parallel workload.

    Gibberish. Without any other details, that amounts to two serial processes running on two separate cores. Which is not an embarrassingly parallel algorithm, as provided by Robert’s excellent and informative example.

    Linux kernel drivers turn out to be able to be an Embarrassingly parallel workload. Writing to many discs turn out to be Embarrassingly parallel workload.

    Embarrassing gibberish, Fifi. You’re out-doing yourself here.

    Sending network traffic on modern network cards turn out to be Embarrassingly parallel workload.

    Even more pathetic, Fifi. No. Simply put, no.

    And for the rest, and I sincerely hope that Deaf Spy will kindly accept me stealing his thunder … another post, I think.

  170. oiaohm says:

    http://en.wikipedia.org/wiki/Gustafson%27s_law

    I new there was a problem with DrLoser and Deaf Spy ideas. I just was not remembering the new law.

    Yes a single problem performance many not be increased massively. Reality is how often is a computer really dealing with a single task. Its is insane rare in fact for a computer to have a single task.

    Embarrassingly parallel workload is the most common computer workload. Yes 2 users running 2 applications on the same server is almost always Embarrassingly parallel workload.

    Linux kernel drivers turn out to be able to be an Embarrassingly parallel workload. Writing to many discs turn out to be Embarrassingly parallel workload. Sending network traffic on modern network cards turn out to be Embarrassingly parallel workload.

    Deaf Spy what version .net with what version optimization. This is not exactly an answer you are going to expect. Due to the complete code example being predictable the complier optimization can convert the code between watch.Restart() watch.Stop(); can in fact be reduced to only count=boundary. So real world result can be no difference no matter the value of boundary. Heck it can get worse. The code between watch.Restart() and watch.Stop() can be made non existent. Complier modifications to code can be quite ext-ream. Sorry you example lacks a few key things to prevent complier rewriting.

  171. DrLoser says:

    uh, yes, I do. Take some application that works on a large array in RAM, say, “solver”. Break it up into 4 parts that work on 1/4 of the array each, say solver1,2,3,4. Turn them loose.

    That’s an example of an Embarrassingly Parallel problem, Robert. You don’t see many of those in real life. In fact, I’ve never seen a single one.

    But let’s just consider this theoretical “solver,” shall we?

    1) Where did the input come from? Because it sure wasn’t deposited into RAM by Pink Unicorns. Oops, back to the sequential part of Amdahl’s Law.
    2) How do you marshal the results? It’s been a while, but back in the oughties you probably heard of Google’s “revolutionary” Map/Reduce algorithm.

    Not especially revolutionary, really, since it’s been a known technique for parallelisation since, ooh, I dunno, say the 1960s. But in the Functional/Lisp world, not the *nix world. The genius of Larry Page and Sergei Brin was to take this concept (which hitherto had only existed on a single machine) and use it across a vast, redundantly parallel, network of servers.

    And if memory serves, they patented it. I know how you adore software patents, Robert.

    Anyway, long story short, that’s how you do all massively distributed parallel data-mining these days, including Hadoop. (Which is a piss-poor relative of what Bing uses. I have seen both. I am in a position to judge.)

    tl;dr.

    For a relatively insignificant part of the workflow, you can employ an embarrassingly parallel algorithm. But making any use of the output of “solver” typically involves far more technical chops than you possess, Robert.

    And to be fair, I don’t possess them either.

  172. oldfart wrote, “When I started up my windows 10 beta VM the activity was spread evenly across all cpu’s/cores.”

    Hmmm… That’s the behaviour I’ve had on Beast for years.
    %Cpu0 : 1.3 us, 0.0 sy, 0.0 ni, 98.0 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
    %Cpu1 : 1.7 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu2 : 1.0 us, 0.7 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu3 : 1.0 us, 0.7 sy, 0.0 ni, 90.6 id, 7.7 wa, 0.0 hi, 0.0 si, 0.0 st

    If you run a VM on Beast, it looks the same, except busier:
    %Cpu0 : 6.6 us, 1.7 sy, 0.0 ni, 86.1 id, 5.6 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu1 : 13.9 us, 15.2 sy, 0.0 ni, 67.7 id, 3.3 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu2 : 3.3 us, 1.3 sy, 0.0 ni, 47.7 id, 47.3 wa, 0.0 hi, 0.3 si, 0.0 st
    %Cpu3 : 4.3 us, 1.0 sy, 0.0 ni, 81.9 id, 12.7 wa, 0.0 hi, 0.0 si, 0.0 st

    Is that not desirable? I think it is.

  173. DrLoser says:

    There’s a reason they do that and it’s not because it doesn’t give more throughput. People have workloads that love these things.

    Indeed. Bing, for example. They use dual-processor blades with these monsters in most (it’s been a year and a bit, probably all) of their data centers.

    But most people don’t have workloads like this, Robert. Most people, as Deaf Spy points out, have workloads that do not readily translate to the simple matter of “adding cores.”

    And by “most,” I mean 99.9% of people. Granted, the other 0.1% gobble these things up by the thousand, which is why they are profitable.

    But it’s not a market that the ARM fabs are aiming for (ram will have some valuable input on the niche, so I’ll defer to him). Mostly because they’d stand a very high chance of going broke if they did.

  174. DrLoser says:

    it probably doesn’t make any significant difference what value boundary has.

    Apparently you googled up the wrong PDF, Robert, because you have actually got that answer completely arse about face. Try rewriting the program in Pascal and you’ll see what I consider to be a very surprising result.

    For bonus points, there is a simple refactoring that makes the program entirely cache- and pipeline-friendly. Unfortunately it doesn’t work in Pascal, because the equivalent of a ternary operator is a statement rather than an expression. More bonus points for knowing why this makes a significant difference.

    And don’t forget, this is a pedadogical example. Deaf Spy has made it as simple (and in your face) as possible, but no simpler.

    For even more bonus points, imagine writing a thousand-line long program for, say, ballistics calculation that doesn’t fall foul of the same basic issue.

  175. oldfart says:

    “So? Most applications don’t have much use for more and it’s about TDP.”

    Interesting I just pulled up Task Manager on my windows station and checked the performance tab. Interestingly enough, all of the 8 Cores (2 CPU’sx 4 cores per cpu) are registering activity. When I started up my windows 10 beta VM the activity was spread evenly across all cpu’s/cores.

    Care to comment Robert Pogson?

  176. Deaf Spy wrote, ” Most Intel chips are still 2 and 4 cores only.”

    So? Most applications don’t have much use for more and it’s about TDP. If the package will only handle ~100W, you can double the number of cores but you have to slow them down for no benefit beyond a few cores per package. You’ve answered your own challenge. If you do put in more cores for little benefit and the cost goes up, people won’t buy them.

    Further, Intel makes some monstrous chips. 15 cores at 2.8gHz for 130W TDP. There’s a reason they do that and it’s not because it doesn’t give more throughput. People have workloads that love these things.

  177. Deaf Spy wrote, “can you tell at what value for boundary the code runs fastest, and why?”

    It probably depends on the hardware. On one of the ancient minicomputers with no cache and no pipeline, 0 would be the answer because there would be less work done for that value, but in modern machines, with caches and pipelines, it probably doesn’t make any significant difference what value boundary has. This application’s code would fit entirely in cache and on some machines, the arrays could as well. Depending on the precision of the clock, the time may not be accurately measured in any event.

  178. DrLoser says:

    Well, I tried to figure that one out, Deaf Spy, and failed miserably. Luckily it only took me two minutes to find the relevant PDF.

    Both the search and the resulting excuse from oiaohm should make for entertaining reading…

  179. Deaf Spy says:

    Btw, here is a very little nice example that I give to my students sometimes (courtesy of Raymond Chen). It is in C#, but I am sure you will have no trouble working it out.


    class Program
    {
    static int arraySize = 10000;

    static int CountThem(int[] array, int boundary)
    {
    int count = 0;
    for (int i = 0; i < array.Length; i++)
    {
    if (array[i] < boundary) count++;
    }
    return count;
    }

    static void Main(string[] args)
    {
    int[] array = new int[arraySize];
    Random rand = new Random();
    for (int i = 0; i < arraySize; i++) array[i] = rand.Next(10);
    Stopwatch watch = new Stopwatch();
    for (int boundary = 0; boundary <= 10; boundary++)
    {
    watch.Restart();
    int count = 0;
    for (int iterations = 0; iterations < 1000; iterations++)
    {
    count = CountThem(array, boundary);
    }
    watch.Stop();
    Console.WriteLine("Count = {0}, Time = {1}", count, watch.ElapsedTicks);
    }
    Console.ReadLine();
    }
    }

    Now, can you tell at what value for boundary the code runs fastest, and why?

    QED, eh?

  180. Deaf Spy says:

    …and Intel and AMD and ARM don’t know how to make CPUs have more throughput by replicating cores and SMP doesn’t exist and networking doesn’t exist and … [SARCASM]

    Let’s see. Intel’s first desktop quad-core chip, Q6xx series, came about in 2007. Fast-forward 7 years later. Most Intel chips are still 2 and 4 cores only. Only the extreme edition of I7 have more cores, which are not only damn expensive, but also used for specialized purposes.

    I don’t see Intel throwing cores. AMD did try, and failed to compete at both desktop, and on server. Their 16-core Bulldozer had its ass handled back to it by 8-core Xeon.

    Sarcasm, eh?

  181. Deaf Spy says:

    QED

    Certainly not so, Pogson. Yours is a nice theoretical exposition, and a very superficial one, I am afraid. You fail to take into consideration a number of things:
    1. Overhead for splitting the array, esp. in NUMA-based systems.
    2. Overhead for allocating workers even with a working and pre-configured threadpool.
    3. Overhead for a barrier that will wait for all the workers to finish.
    4. Overhead for merging the results.
    5. Overhead of unexpected non-linear optimizations at CPU level, which you are very likely to ignore.
    6. If you N is small enough to fit in the cache, you are guaranteed to have much more “fun” that you anticipate. Most likely the overhead of 1..5 will eat all your advantage.

    And I am not going too deep.

    You can say QED only when you present you code, and benchmark results. Now, you are merely daydreaming.

  182. oiaohm says:

    ram arm chips are a lot more tricky than just turning parts on and off. Arm chips integrate some ideas from fpga as well. So circuits inside the chip change their function as well this is part of the ARM v8-A and responsible for even that its 64 bit requiring less silicon than its 32 bit relations.

    Something interesting due to how the turn off and on and modification operations are performed it costs no extra processing time. Lower power usage is lower power usage.

  183. ram wrote, ” ARM achieves alot of its power efficiency by turning parts of the chip off when not in use. Is that a useful feature for desktops?”

    Well… you could run off a battery during power-failures or put the thing in a smaller package. Other than that, I have to agree Moore’s Law has reached some limit of usefulness in desktop energy conservation. You can, of course, stick in more cores if the TDP is reduced. I used to have an idling CPU with 1 core. Now I have 4. Are 6, 8, or 12 really needed? I think going to SSD is all the speedup most of us need for the next decade unless artificial intelligence matures. That stuff can soak up any amount of computing power. Perhaps there’s an angle where everyone would get IT for $free if they allow access to some cloudy grid for the idling cores… Is this the next business model, a cloud with no server-rooms?

  184. ram says:

    No doubt the new ARM chips are powerful and good on power consumption. What is not clear is which architecture is the winner for machines that are mostly on. ARM achieves alot of its power efficiency by turning parts of the chip off when not in use. Is that a useful feature for desktops? Perhaps in some offices. Is it a useful feature for workstations? Probably not so much. How about servers? Yeah, maybe some with uneven use profiles.

    Still not clear to me if there is an overall “winner” or the market just fragments into specialized niches. In any event, most of those niches are going to be running Linux or some other related FLOSS operating system, whatever it might be called by then.

  185. oldfart says:

    “QED”

    Maybe. The devil is always in the details of implementation Robert Pogson. We look forward to your committing what you have outlined to actual code and producing the results.

  186. DrLoser wrote, “you, Robert, have no clue whatsoever how to write a multi-threaded Pascal program that not only takes advantage of all four AMD cores, but also results in a reputable industry-standard 2.5 multiplier for the average Amdahl-era program.”

    uh, yes, I do. Take some application that works on a large array in RAM, say, “solver”. Break it up into 4 parts that work on 1/4 of the array each, say solver1,2,3,4. Turn them loose. That assumes of course that there aren’t random fetches all over the array for each process but there are many instances of that, say, an application that computes some weighted sum: solver1 computes the sum of N/4 terms, solver2 another N/4 and so on. Finally have “solver” take the four terms and add. Run each process on a different processor or core. Since this operation is quite compute-intensive, likely the bottleneck will be in the memory interface but if N is small enough to fit in the cache we can have a lot more fun. If you need everything going full-bore, copy the quarters to different systems. QED

  187. DrLoser says:

    There were four of them 🙂

    I stand pointlessly corrected.

    In re which, I have an apology to make to Robert. I hope I’ve got the Photobucket bit right (it worked a year ago), but if not the words will have to suffice.

    Yes, Robert, all five of you had a legitimate issue with the packaging of Microsoft Office 2007, as confirmed by anil. Me, I’d have used nail scissors or a razor blade; but, not good packaging.

    In passing, I will note that Anil, like me, is what you would call “a Microsoft Troll.” (I know Anil, and I don’t think he will object.) One of the principles of we contrarians on your site, Mr Pogson, is that we will admit when we are wrong.

    I was wrong, regarding the packaging. I admit it. I have one further admission to make, also prompted by Anil:

    I’m guessing that will fail. Image here.

    See that glowing multi-coloured ball at the top left hand side, Robert? That’s where File/Print is found. It’s where File/Print was on earlier versions of Office. It’s nothing if not discoverable.

    Nevertheless, I apologise. Shortly afterwards, people complained, and Microsoft reinstated the File tab. Apparently you were not the only one confused.

    It all ended happily ever after, though. You, Robert, got to muck around with your favourite Linux distribution.

    And 80+% of the desktop world got back to clicking File/Print for the last eight years, precisely as they had done for the last twenty or so.

    Following Anil’s prescription, however, I humbly apologise for having inaccurate fun at your more accurate expense.

  188. luvr says:

    Dr. Loser said, “How many pointless ellipses was that? Five?”

    If you can predict the future as accurately as you can count, then you will be proven wrong. There were four of them 🙂

  189. DrLoser says:

    And Luvr is the early entrant in the stakes for The Greatest Number Of Pointless Ellipses, 2015:

    Even though it’s far from clear whether you are the one who deserves the congratulations. Like I said, time will tell.

    How many pointless ellipses was that? Five? I guess we’ll have to tally them up against oiaohm’s tendency to start a sentence with “Yes,” which is equally unconvincing.

    Time will tell indeed, Luvr. It’s been pretty convincing, Linux desktop wise, over the last twenty years.

    There’s no good reason to avoid rational argument in favour of abject fatalism, though, as far as I can see.

  190. DrLoser says:

    I’m pretty sure that networking “exists,” Robert.

    For once, I’m not going to explain the latency issues to you and to oiaohm.

    Why don’t the both of you make them magically vanish, for the benefit of your other readers?

    Just to make this easy, let’s posit the minimum networked system of two devices (mobile phones will do, since in 2016 the top end $500 devices will be built on Cortex A-72s) separated by five feet on a 100Gb LAN.

    Your application (your choice of application) is sitting on one device. Let’s say it hits a processing limit on the local four cores, and needs two more cores for full parallelisation. I specify two, because obviously the other two (remote) cores could reasonably be expected to take up the slack.

    Could you, perhaps, sketch out a believable architecture for this little toy system?

  191. luvr says:

    “Misguided” why? Because you don’t believe that ARM might take over the “desktop” (or whatever that what we call “desktop” today will morph into tomorrow)?

    Do you have a crystal ball, maybe? I’m sure Robert doesn’t, but I have no idea whether he is the “misguided” one, or you are. In any case, time will tell. I suggest, therefore, that we let it speak, and return to the subject once it has spoken. Then, if you turn out to have been right all along, you can repeat your statement about how “misguided” Robert was. Otherwise, you will have to accept that you were the “misguided” one, won’t you?

    (Though I’m sure that you will, then, cry foul over how unjust history has treated you… And, of course, you are now going to say that you’re confident that you won’t have to… All I want to say is, “That the best may win”… To which you will obviously want to reply, “Thank you”… Even though it’s far from clear whether you are the one who deserves the congratulations. Like I said, time will tell.)

  192. DrLoser says:

    I believe that Fifi is pinning his (vacant) hopes on NUMA, btw. But, whatever, Robert.

    If you want to pick SMP, that’s fine. Any ideas about the latency? The cache-lines? The effects on … well, basically any other part of the system that isn’t a simplistic little four-core 2.5 GHz Arm Chip?

    And after that, we talk Herb Sutter and Joe Duffy and possibly others, all of whom have extensively documented how incredibly hard doing this stuff in software is.

    Never mind, Robert. You don’t need to worry about the software, any more than you have to worry about Debian packaging. Somebody else will do it for you, at roughly the market price of yeast in Utopia.

    Sooner or later.

  193. DrLoser says:

    …and Intel and AMD and ARM don’t know how to make CPUs have more throughput by replicating cores and SMP doesn’t exist and networking doesn’t exist and … [SARCASM]

    And you, Robert, have no clue whatsoever how to write a multi-threaded Pascal program that not only takes advantage of all four AMD cores, but also results in a reputable industry-standard 2.5 multiplier for the average Amdahl-era program.

    No offence to either you, Robert, or to ARM. Or even to Pascal.

    The road to parallel multi-core programming is littered with failure. For once, you’re not a voice alone, crying in the wilderness.

    You’re right up there with all the other tech press savants who can predict a solution to a very, very difficult problem without having a clue how to get there.

    As Deaf Spy says, your previous (very extensive) column on the subject provided ample evidence to prove this.

    What to do, what to do? Let’s throw more chips at the problem.

    Well, I suppose I might just call that a grotesquely uninformed point of view, but I am of the Whig persuasion. Man (and IT) is infinitely perfectible!

    And the great thing about this, Robert, is that if you want to make an underpowered multi-core design like the Cortex A72 (fine for phones, rubbish for desktops let alone servers) achieve Beast-Hood! … well, all you have to do is to write the algorithms.

    And the even better thing is that all the full-time industry experts who have expended ten years of their lives doing this sort of thing will bow down in obeisance!

    And you know what obeisance means, Robert?

    Free Yeast For Life!

  194. Deaf Spy, truly being deaf/blind/etc., wrote, “We already destroyed this misguided proposition of yours on this forum.”

    …and Intel and AMD and ARM don’t know how to make CPUs have more throughput by replicating cores and SMP doesn’t exist and networking doesn’t exist and … [SARCASM]

  195. Deaf Spy says:

    If they use too little power for a desktop, folks can just throw in extra chips

    You never learn, do you? We already destroyed this misguided proposition of yours on this forum. You failed to provide a single proof to support it, and little Fifi was so miserable that it hurt.

Leave a Reply