Standby for the ARMed Server Wars

Dell has been quietly grinding out a design for a rackmounted server using ARMed CPUs. You can get over 2000 CPU cores in a rack with a ton of RAM and storage for a few watts per core. HP, Calxeda and others are also working on this. The OS, of course, will be GNU/Linux, likely Ubuntu GNU/Linux, because of software already ported to ARM.

Dell is still only shipping to select customers so prices and general availability may be a ways off, but someone has to test them under fire…

I expect some time in 2012 this will be a viable option for all the big guys. I expect prices will fall when volume production ramps up.

see Dell ARMs up for hyperscale servers • The Register.

About Robert Pogson

I am a retired teacher in Canada. I taught in the subject areas where I have worked for almost forty years: maths, physics, chemistry and computers. I love hunting, fishing, picking berries and mushrooms, too.
This entry was posted in technology. Bookmark the permalink.

13 Responses to Standby for the ARMed Server Wars

  1. oiaohm says:

    http://www.calxeda.com/products/energycards/quadnode

    Phenom big thing each of the 4 core sets boots solo “MicroSD Card support via four (4) slots – one slot dedicated per SoC to enable diskless system designs by booting from microSD card”
    So you are not required to share memory between 4 core sets or even share processes.

    The fabric under these arm chips even enables you to isolate processing so if you wanted to you could break the box up into 288 individual machines. If one of those machines lock up the rest would not be effected.

    288 not that fast machines. 1.1 to 1.4 Ghz. Now that is good enough for light websites with decanted hosting without using viritualization.

    Virtualization is a overhead its not free. Physical hardware splitting is fairly free and you don’t suffer from opps this person did something stupid so crashed the hypervisor so taking down 20 other peoples sites.

    Also interesting thing is clusters. Size matters. Longer the interconnects the worse the performance. x86 still due to heat and other things cannot be done small.

    http://en.wikipedia.org/wiki/File:AMD_Bulldozer_block_diagram_%288_core_CPU%29.PNG

    Phonom have a close look at the Bulldozer block diagram the Xeon is traditional 1 FPU per core. The arm chips are traditional as well 1 FPU per core.

    The Bulldozer 1 FPU per 2 cores. Next the Xeon has a instruction converter per core. Arm does not a instruction converter at all and Bulldozer has a instruction converter shared between 2 cores. Linux kernel has had to alter a lot of code to alter the Linux kernel scheduler to cope with what the Bulldozer did.

    So yes Anandtech was testing with Bulldozer not optimised stuff. The optimise bit has to be OS level. Xeon don’t kick beat Bulldozers on Linux. Those extra cores of the Bulldozer get used with better process management. Windows yes because its not Bulldozer optimised.

    Bulldozer is not an example of more cores because 4 floating point cores in Bulldozer vs 4 floating point cores in a 4 core Xeon of course is going to be about the same if work load is split correctly. Bulldozer has 8 integer cores.

    Yes Bulldozers do bring a process management nightmare. Since now the process management has to know what programs contain floating point so it can schedule things correctly so you don’t bottle neck at the floating point processor.

    So a Bulldozer is a pain in ass chip. You cannot compare it to a 16 core energycards. In fact even that EnergyCore ECX-1000 are lower clock for floating point a single energycard runs ring around a AMD Bulldozer chip with a non optimised setup. No bottle necking since each core has its only floating point processor in a ECX-1000 so this is 16 floating point processors vs 4. The amd chip is not running fast enough to beat that.

    So the Energycards are no snail. Processing power each Energycard matchs up to a standard x86 processor.

    “memory is unified. RAM and storage are one for all cores”

    Really look at Energycards as a cluster in a box solution. A very small box containing a very big cluster. If want you are processing will run well on a cluster like web servers or read only file servers then arm box rocks.

    Databases arm box cause them pain. Video games it going to cause pain.

    ARM servers will have there place. Just like clusters of old had there place. Its when you need many thread processing.

    Physical hard-drives only spin so fast network connects are also only so fast.

  2. oldman says:

    “ach ARMed CPU in some of these designs has its own RAM and NIC.”

    the problem is Pog that once a atomic calculation is done on one CPU it has to be shipped somewhere. Nic interconnects are two slow for this function. That is why HPC uses technologies like infiniband to link together the computer farms. But even infiniband scales but so far.

    This is why calixtra is putting crossbar interconnects on silicon and intel is purchasing crays interconnect technology. they all know that the current simple minded ARM SOC chip deskgns just dont cut it for this case.

  3. Phenom wrote, “You see, it is not trivial to split a task between too many cores for a very simple reason – memory is unified. RAM and storage are one for all cores. Unless you split your data into CPU-private storages, you can’t benefit of many cores endlessly.”

    You are completely ignoring the fact that each ARMed CPU in some of these designs has its own RAM and NIC. There is no bottleneck around the cores in a rack. Load the data into that RAM and everything runs faster. You ask all the cores who has fact, “X”, and the one with the answer responds. Load an individual process onto each CPU and they all crunch in parallel. It all depends on the suitability of the task. There are many large and important tasks where this technology works.

  4. Phenom says:

    but there certainly are loads for which more cores is the best solution

    Any task that can be split into multiple processes readily will run faster on more cores

    Pogson, the tests of Anandtech proved that wrong. A few more cores can help, but a lot more cores do not.

    You see, it is not trivial to split a task between too many cores for a very simple reason – memory is unified. RAM and storage are one for all cores. Unless you split your data into CPU-private storages, you can’t benefit of many cores endlessly.

    To make things worse, there are technologies like AMD’s NUMA, which sometimes make things better, sometimes worse.

    Again, read the article. Virtualization is by theory one of the tasks that is best served by many cores. In reality, Xeons (less cores) swipe the floor with Bulldozers (more cores).

  5. oiaohm says:

    Phenom
    “In other words, ARM can’t compare with x86 when it comes to performance on servers.”

    Its like GPU vs CPU. GPU per core is technically slower than the CPU. But the GPU can process more parallel threads.

    ARM server vs x86 server. Is very much CPU vs GPU. Arm units might be slower put more parallel cores with proper isolation from each other equals a more resilient system.

    x86 and arm will take out different segments. General web service more cores will be better.

  6. Phenom wrote, “which proves that multiple cores cannot compensate the pure MIPS power”.

    No one is claiming there are no loads for which more cores is not the best solution but there certainly are loads for which more cores is the best solution. Otherwise, we would all run single core CPUs.

    The same realities which caused multi-core CPUs to happen apply to multiple ARMed CPUs in boxes. Any task that can be split into multiple processes readily will run faster on more cores. There are all kinds of tasks like file-serving, and searching/sorting/databasery which can benefit from more cores. With ARM one can have more cores in a smaller space and cheaper.

  7. Phenom says:

    Spam filter fails again.

  8. Phenom says:

    Posgon, in theory, you are right. Practice, however, does not happen to follow theory.

    I gave you a rather sophisticated articles with a real-life experiment, which proves that multiple cores cannot compensate the pure MIPS power. Even in virtualization tests, where many cores should shine, prove that more cores do little good.

    Now, you can tell me that Anandtech are Intel paid shills and liars, of course. However, I trush their experience and high level of expertise in their articles much more than your theoretical blabbering.

  9. Phenom wrote, “ARM can’t compare with x86 when it comes to performance on servers.”

    Performance = MIPS * cores. The idea is to pack a ton of lower-performing cores into a small space. You bet ARM competes on performance. I have seen 2U occupied by a single core of x86 yet ARM can plug dozens of cores into that same space. Cooling of ARM cores is much easier because there is little heat released per unit volume. That’s a consequence essentially of a tiny instruction-set requiring few transistors to implement. It’s the same feature that allows a tiny chip to run a smart-thingy. It may be true that a single transaction takes longer on an ARMed server but it’s not true that an x86 server can do more transactions per second than an ARMed server given the phenomenon of parallel processing. It’s the same idea as multiple cores in the x86 world. One has to slow the clock-speed a bit to prevent over-heating but more does get done by multiple cores than single cores.

  10. Phenom says:

    Arm might be just the filesystem server to the other servers if nothing else.

    In other words, ARM can’t compare with x86 when it comes to performance on servers. Pogson will have to wait for another decade at least. 😉

  11. oiaohm says:

    Phenom arm chips have two advantages.

    1152 ARM cores 4U server. Or divide that by 4 for number of cpu chips. Density of this with X86 heat generation is not going to happen.

    If you want a picture of it search for hp moonshot.

    Those processes are 1.1 to 1.4 ghz per core. what is basically fast enough for web serving in most cases. Also due to the hardware vitalisation you can give people there own decanted cpu cores.

    http://www.calxeda.com/products/energycards/quadnode

    That is what the node cards look like. 16 cores per card. Fully hardware splitable into a 4 core with 4 gb ram with decanted harddrive. Or usable as a 16 core processor with with 16 gb of ram. Yes its possible in what that plugs into to link tones of those cards into one huge mother system like HP moonshot does. Due to its design a card can be used as a decanted raid controller for 16 drives+.

    So if you go writing nuts a HP moonshot can be connected to the same number of harddrives as cpu cores 1152 of them. What basically means 1 4U moonshot box and say by by to the rest of the rack as storage drives and a few racks either side.

    So you are not talking light weight hardware here. Arm might be just the filesystem server to the other servers if nothing else.

  12. kozmcrae says:

    “Just look at what more, but less powerful cores make on servers:”

    Less powerful but more power hungry.

  13. Phenom says:

    Poor chances, Pogson. Just look at what more, but less powerful cores make on servers:
    http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper

    ARM is even weaker than that.

Leave a Reply