Monthly Archives: February 2014

2GB/core is the HPC Gold Standard … But I Know I Need 48GB/node.

I got some e-mail after the previous blog (http://dell.to/144sqai) on 2GB/core recommendations for HPC compute nodes. It turns out that some of you know the memory capacity requirements of your workloads and it is currently 48GB per (2-socket) compute node. Kudos for determining the minimum amount of memory required!

But configuring to the minimum required memory assumes that less memory is “better”: it costs less money and has less potential negatives. More on that later.

Continuing, the logic goes that 48GB/node is 24GB/socket on a 2-socket node. And since there are four (4) memory channels per socket on an Intel SandyBridge-EP processor (E5-2600) and one would like to maximize the memory bandwidth, one needs 4 x 6 GB DIMMs to achieve the required 24GB per socket. But, alas, there is no such thing as a 6 GB DIMM.

Hence, a 4 GB DIMM and a 2 GB DIMM are used on each memory channel. Several of you shared this configuration data with me. This does many things correctly:

  1. Complies with my previous Rule #1: Always populate all memory channels with the same number of DIMMs. (That is, on all processors use the same DIMMs Per Channel or DPC). Check.
  2. Complied with my previous Rule #2: Always use identical DIMMs across a memory bank. Check.
  3. Does not use 3 DPC, which would negatively affect memory performance. Check.
  4. Meets the known memory capacity requirements. 4 GB plus 2 GB is 6 GB. 6GB per memory channel is 24GB/socket and the required 48GB/node. Check.

Therefore, the memory configuration is balanced and a good one, technically speaking.

However, let’s dig deeper and take into account a few other things. One is my previous Rule #3: Always use 1 DPC (if possible to meet the required memory capacity). The others are to consider today’s price and tomorrow’s requirements.

As stated in the previous blog, I like to create the “best” memory configuration for a given compute node and then see if the memory/core capacity is sufficient. In other words, in high performance computing take memory performance into account (first) in addition to the age-old capacity requirements. And as usual, price comes into play. In this 48GB/node case, the price is indeed a driving factor.

To be consistent with the previous blog, we’ll use the same memory sizes and prices, based upon the Dell R620, a general purpose, workhorse, 1U, rack-mounted, 2-socket, Intel SandyBridge-EP (E5-2600) compute node platform. Below is that same snapshot of the memory options and their prices taken on 12-July-201

Here’s the layout of a 48GB/node configuration using 4 GB DIMMs and 2GB DIMMs. Also, in the figure is the total memory price for that configuration.

Here’s an alternate layout using 8 GB DIMMs. Also, in the figure is the total memory price for this configuration.

Here are the key features of the second configuration:

  • More than the 48GB capacity required
  • Less $$$ (per node; consider this ~$300 savings times the total number of nodes)
  • Less parts to potentially fail (in fact, half the parts to fail)
  • Fewer types of spare DIMM parts to stock
  • Easier correct replacement of failed DIMMs
  • More available memory slots for future expansion
  • “Future proof”

“Future proof? What does he mean by that?” Did you notice the memory per core in the figures above? The 48GB/node configuration using 4 GB DIMMs and 2GB DIMMs is 3GB/core for today’s mainstream 8-core processor. The 48GB/node specification may in fact be tied to the GB/core and the core count per processor. Today’s node may need 48GBs, but a node with more cores may need more memory.

We know from several public places (e.g., http://www.sqlskills.com/blogs/glenn/intel-xeon-e5-2600-v2-series-processors-ivy-bridge-ep-in-q3-2013/ ) that the follow-on to the Intel SandyBridge-EP processor (E5-2600), codenamed Ivy Bridge-EP, will officially be called the Intel Xeon E5-2600 v2. The mainstream v2 processor will feature ten (10) cores, compared to today’s 8 cores. With this future processor, the alternate memory configuration above using 8 x 8GB provides a total of 64GB/node. This 64GB/node on a 2-socket node with 20 cores is 3.2GB/core, still exceeding the 3GB/core of the 48GB node today.

If you have comments or can contribute additional information, please feel free to do so. Thanks.

–Mark R. Fernandez, Ph.D.

26-Aug-2013

(original posting: http://dell.to/16EjfPl )

2GB/core is the HPC Gold Standard. Not Anymore!

It seems like as long as I can remember the recommendation, and sometimes the requirement, for memory on a High Performance Computing (HPC) compute node has been 2GB/core. Note that this is a capacity only requirement. Often the price of this memory would dominate the price of the compute node. I personally have struggled to meet this 2GB/core requirement within the budget; or worse, within the projected budget since memory prices are like the Dow Jones Industrial Average and go up and down almost daily. But things have been changing consistently and rapidly these past few years. And our tendency to stick to that old memory standard needs to change also.

I first noticed the trend in November of 2011, about a year and a half ago, while preparing to attend Supercomputing in Seattle, WA. The purchase price of a 2-GB DIMM was less than the price of a 1-GB DIMM. In mid-2012, the 1-GB DIMM entered End-Of-Life (EOL) at Dell. That is, the 1-GB DIMM was no longer even offered or available. If one wanted a memory capacity based upon 1-GB DIMMs, well, they got twice the memory at the same speed or faster but for a lower price. Being from Louisiana, we call that lagniappe!

To further complicate things, the Intel SandyBridge-EP processor (E5-2600) was introduced in early 2012. It featured 4 memory channels per socket vs. 3 memory channels for the previous generations of Westmere/Nehalem-EP (5500/5600). Based upon the core count of the processor selected, going with a hard-wired GB/core requirement without regard to the number of memory channels can lead to some very bad memory configurations in terms of memory performance. The icing on this cake of confusion was the introduction of Intel’s Flex Memory, which allows an infinite number of what I will call “unbalanced” configurations. [For this blog, I will focus on making optimal balanced memory recommendations. More on the pros and cons of unbalanced and near-balanced memory configurations can be found in the links below.]

Today in HPC we, in general, highly desire balanced memory configurations since they provide the best performance. The thumb rules for a balanced memory configuration with the maximum performance are simple:

  1. Always populate all memory channels with the same number of DIMMs (That is, on all processors use the same DIMMs Per Channel or DPC)
  2. Always use identical DIMMs across a memory bank.
  3. Always use 1 DPC (if possible to meet the required memory capacity)

By now you may have discovered that I am proposing turning things upside-down: create the “best” memory configuration for a given compute node and then see if the memory/core capacity is sufficient. In other words, in high performance computing take memory performance into account (first) in addition to the age-old capacity requirements. And as usual, price comes into play.

To make a memory recommendation, I first access Dell’s R620 webpage to get the current list prices of memory [ http://www.dell.com/us/business/p/poweredge-r620/pd ]. The Dell R620 is a general purpose, workhorse, 1U, rack-mounted, 2-socket, Intel SandyBridge-EP (E5-2600) compute node platform. Below is a snapshot of the memory options and their prices taken on 12-July-2013.

Notice that 1-GB DIMMs are not available as explained above. But also notice that the trend has continued! The 4-GB DIMM is now less expensive than the 2-GB one. I see another EOL coming…

Additionally, the $/GB leader is now the 16-GB DIMM as shown in the table and figures below.

This $/GB leader opens up a lot of possibilities, but for now let’s remain focused on the 2GB/core standard and address these other possibilities in a future blog.

Now suppose we use the least expensive DIMM and populate a Nehalem (5600) or Sandy Bridge (E5-2600) socket for optimal performance. In the figures below, I have followed the population thumb rules previously presented for balanced memory configurations. A socket is depicted with its corresponding memory channels. All the memory channels have been populated identically. There is 1 DPC in use. The corresponding GB/core for each possible core count of the possible processors to be placed in the socket is listed below each figure.

In all cases the 2GB/core capacity is met or exceeded in a configuration set up for optimal performance. This configuration uses the minimum amount of the lowest cost DIMMs to achieve the maximum memory performance. Additionally, this represents the lowest number of parts to fail and provides for memory capacity expansion if needed. This is today’s sweet spot memory configuration for HPC. The GB/core capacity is a side effect. Oh, and the sweet spot changes.

In contrast, one might naively pick the better $/GB DIMMs, partially populate the memory channels, and exploit Flex Memory to address the exact capacity requirements. Some options are shown below, separating the core counts and addressing them independently.

Yes, these will work, as in they will function and not generate any errors due to Flex Memory. Yes, they exactly meet the 2GB/core capacity-only characteristics. Yes, they are also less expensive. And they will perform terribly. Please don’t do this! (unless you really know your memory bandwidth requirements).

Note that in all cases, one or more of the four memory channels are unused. This leaves memory bandwidth on the table. For an overview of memory performance see David Morse’s blog and see John Beckett’s whitepaper for much deeper details:

http://en.community.dell.com/techcenter/b/techcenter/archive/2012/07/26/memory-performance-guidelines-for-dell-poweredge-12th-generation-servers.aspx

http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/12g-memory-performance-guide.pdf

Especially notice Figure 29 of the whitepaper which indicates a potential 50% hit in memory bandwidth for unbalanced configurations, an understandable amount when using half the available memory channels.

So, there you have it. The “best” memory configuration today meeting or exceeding 2GB/core uses the 4GB DIMM and follows the maximum performance configuration thumb rules. The GB/core capacity is a side effect, but beneficially meets or exceeds the common 2GB/core requirement. Meeting GB/core requirements blindly with the available DIMM sizes, while driving down prices without regard to performance is ill-advised.

If you have an interest in additional general information about memory, memory types, etc., this is a good place to start:

http://www.dell.com/poweredge/memory/

How will all this change with Intel Xeon IvyBridge (E5-2600 v2), coming in just a few months? Well, faster memory will be available, but probably at a premium price. There will be more cores per socket available. And memory prices continue to fluctuate. I’ll get back to you… 😉

If you have comments or can contribute additional information, please feel free to do so. Thanks.

–Mark R. Fernandez, Ph.D.

24-Jul-2013

(original posting: http://dell.to/144sqai )

Phinally, the Phull Phi Phamily is Announced!

Intel has announced the full line-up of the now officially named “Intel® Xeon® Phi™ coprocessor x100 family”. Whew! What a mouthful. I call it Phi for short. And we in HPC have been waiting a long time for Larrabee, uh, MIC, er, Knights Corner, I mean Phi to be announced and available to help advance our research.

I am very excited to be able to phinally talk more openly about this accelerator for HPC. In a previous blog, I briefly described the already available 5110 model of the Phi coprocessor and how to compute its peak theoretical performance.

Phi…, Nodes, Sockets, Cores and FLOPS, Oh, My!
http://dell.to/YjFuN0

STAMPEDE, Texas Advanced Computing Center (TACC)

I also shared that the Dell TACC Stampede system used an early-access, special edition Phi called the SE10. Stampede, which was ranked #7 on the November 2012 Top500 list, now moves up to #6 with the release of the June 2013 Top500 list (www.Top500.org). Congrats to Tommy Minyard and the folks at TACC for the improved number.

The production version of the Phi SE10 used in Stampede is called the 7120 and features a bit more performance than the special edition SE10 version. The 7120 was announced at the 2013 International Supercomputing Conference (ISC’13 http://www.isc-events.com/isc13/), along with other details about the rest of the Phi models.

For those that don’t have the time to read another blog or don’t want to spend the effort to do the math, here’s the summary of the peak performance of the three Phi models announced:

  • 3120: 1.00 TFLOPS (57 cores/Phi * 1.1 GHz/core * 16 GFLOPs/GHz = 1,003.2 GFLOPS)
  • 5110: 1.01 TFLOPS (60 cores/Phi * 1.053 GHz/core * 16 GFLOPs/GHz = 1,010.88 GFLOPS)
  • SE10: 1.07 TFLOPS (61 cores/Phi * 1.1 GHz/core * 16 GFLOPs/GHz = 1,073.60 GFLOPS) (Note: not available)
  • 7120: 1.20 TFLOPS (61 cores/Phi * 1.238 GHz/core * 16 GFLOPs/GHz = 1,208.28 GFLOPS)

So, what does all this mean and how does it help HPC and Research Computing? In short, we now have another 3 arrows in our quiver to attack the wide range of important problems that we face.

How has the presence of Phi already affected HPC and Research Computing? Well, the #1 system on the June 2013 Top500 list is using 48,000 Xeon Phi coprocessors. Yes, 48 thousand. See the www.Top500.org list for more details. Of note is the fact that both TACC’s Stampede with 6,400 Phi coprocessors and the #1 system with 48,000 Phi coprocessors are operating at about 60% efficiency. That’s a consistent number over a wide range of coprocessors.

If you have not yet had a chance to experiment with Phi, then, as usual, I recommend a platform that is more suitable to test-and-development than a production platform such as those deployed at TACC for example. As such, Dell also announced at ISC’13 support for Phi in the PowerEdge R720 and T620, both of which are excellent development platforms for both GPUs and Phi coprocessors. For more information about installing and configuring a Phi, see this posting:

Deploying and Configuring the Intel Xeon Phi Coprocessor in a HPC Solution
http://dell.to/14GtFRv

When deploying larger quantities of Phi or GPU cards, the production platform used by TACC’s Stampede, the C8220x, is an option.

To get you going on the software side with Phi, be sure to read and bookmark these:

Additionally, on the software side, if you are already using Intel’s Cluster Studio XE (http://software.intel.com/en-us/intel-cluster-studio-xe), support for Phi is included.

What does the future hold? Personally, comparing and contrasting the performance of Phi coprocessors and GPUs is still on my list for a future blog. Now that Phi is announced, I may be able to get to that sooner!

Secondly, there is an upcoming whitepaper from Saeed Iqbal, Shawn Gao, and Kevin Tubbs from Dell’s HPC Engineering Team. They present a performance analysis of the 7120 Phi in the R720. Preliminary results indicate about a 6X speedup and 2X the energy efficiency compared to Xeon CPUs on LINPACK. I’ll possibly update this blog with that link and definitely tweet about it as soon as it is available.

Finally, Intel also revealed that the next-gen of Phi is code-named Knights Landing and will be available not only as a PCIe card version as today but also as a “host processor” directly installed in the motherboard socket. They also shared that the memory bandwidth will be improved. This might help with the efficiency mentioned previously.

CPUs, GPUs, Coprocessors and soon, “host processors”. Interesting times ahead. I’ll be following those developments and sharing critical information as it becomes available.

If you have comments or can contribute additional information, please feel free to do so. Thanks.

–Mark R. Fernandez, Ph.D.

18-Jun-2014

(original posting: http://dell.to/100hBJ5 )

A Billion Here. A Billion There. Next Thing You Know, You’ve Got 4.3 Billion. And It Turns Out That’s Not Nearly Enough!

A #Dell colleague, Dave Keller (@DaveKatDell), alerted me to a YouTube video featuring Vint Cerf, a founding father of the Internet and current Chief Internet Evangelist at Google.

http://del.ly/6042XmFp

In that video, Vin Cerf explains that devices connected to the Internet are given Internet addresses like phones are given phone numbers. That address, known as an IP address, is usually represented as a grouping of 4 numbers separated by dots, such as 192.168.0.1. This is the default IP address of many Netgear routers such as might be used in your home.

Each of those numbers in Version 4 of the Internet Protocol (IPv4) can be a number from 0 to 255. This means there are 256 choices for each of the 4 numbers.

256 * 256 * 256 * 256 = 4,294,967,296

So, there are about 4.3 billion addresses available for devices to connect to the Internet. In 1980, there were only about 280 million people in the entire United States. 4.3 billion sounds like plenty!

But how many do you use today? Cell phone? Laptop or Tablet? Home computer? Work computer? Home Internet router? TV?

I just named 6 possible ones. Without going into private networks, etc., I think it is safe to say that when you are connected to the Internet, you are using an IP address.

OK. So what’s the big deal? China has over a billion people. India has over a billion people. And according to the Vint Cerf in that same video, there are over 5 billion mobile devices in the world today. According to Government Technology (http://del.ly/6046XoZ8), in 2020 there will be 50 billion Internet-enabled devices in the world. To put that number in perspective, that equates to more than 6 connected devices per person. Oops!

But don’t worry. Internet Protocol Version 6 (IPv6) is rolling out. China is actually take a lead in this. Imagine why.

http://www.engadget.com/2013/03/11/chinas-new-internet-backbone-detailed-for-the-public/

http://www.zdnet.com/blog/china/china-vows-to-accelerate-ipv6-move-to-reverse-current-disadvantages/454

4.3 billion sounded big. Just how big is IPv6? Almost too big to explain or to even comprehend. It is well over one trillion times as large as IPv4. Or, with IPv6 those 4.3 billion address available from IPv4 are available to each and every person alive. I can almost understand and appreciate that. But in fact, it’s much larger: over a trillion-trillion-trillion total addresses. Or for the nerds out there, about a third of a google of addresses.

And according to Paul Gil over at About.com “These trillions of new IPv6 addresses will meet the internet demand for the foreseeable future.” I certainly hope that is an understatement!

http://netforbeginners.about.com/od/i/f/ipv4_vs_ipv6.htm

If you have comments or can contribute additional information, please feel free to do so. Thanks.

–Mark R. Fernandez, Ph.D.

22-May-2013

(original posting: http://dell.to/185JpR2 )

Phi …, Nodes, Sockets, Cores and FLOPS, Oh, My!

Intel’s Xeon Phi coprocessor is the subject of this latest blog, which has evolved into a bit of a series of blogs. In the most recent post, I described how to compute the peak theoretical floating point performance of a system hosting GPUs. That one followed a general purpose description of computing peak floating point performance of systems containing CPUs such as Xeon processors from Intel.

GPUs: http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2013/03/14/gpus-nodes-sockets-cores-and-flops-oh-my.aspx

CPUs: http://www.delltechcenter.com/page/Nodes,+Sockets,+Cores+and+FLOPS,+Oh,+My#fbid=TkQxC6Vb2Bi

Intel’s Xeon Phi coprocessor, unlike GPUs, is based upon the same architecture as the Intel Xeon CPU, such as Westmere, Nehalem, Sandy Bridge and the upcoming Ivy Bridge. As such, computing the peak theoretical floating point performance is similar to computing it for CPUs as described in the first blog.

There are several public references available that indicate that the currently shipping Intel Xeon Phi model 5110 contains 60 cores and that they operate at 1.053-GHz. Unlike GPUs where all cores are not available for double precision floating point math, all 60 cores of Intel’s Xeon Phi are available and computing the peak double precision floating point performance is as straight forward as it was for regular Intel CPUs.

All 60 of these cores can perform double precision floating point math at a rate of 16 flops/clock. Yes, sixteen(16) flops/clock! The AVX in Xeon Phi is one generation ahead of the AVX in general purpose Intel processors. (Comparing and contrasting these 60 cores at 16 flops/clock to a GPU’s 1000-ish cores at 2 flops/clock may be the subject of a future blog.)

Here’s the peak theoretical floating point math for an Intel Xeon Phi 5110:

GFLOPS = 60 cores/Phi * 1.053 GHz/core * 16 GFLOPs/GHz

GFLOPS = 1,010.8

I have seen this appear as “over a teraFLOP” and as 1,011 GFLOPS.

Additionally, the TACC Stampede system uses a special edition of the Intel Xeon Phi called the SE10. It features 61 cores operating at 1.1-GHz.

GFLOPS = 61 cores/Phi * 1.1 GHz/core * 16 GFLOPs/GHz

GFLOPS = 1,073.6

For additional information about TACC and the Stampede systems see:

http://www.tacc.utexas.edu/news/press-releases/2011/stampede

http://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide

Hope that helps. As future Intel Xeon Phi models are released in the coming months, these same type computations should be valid to compute their peak performance. For a system, compute the CPU performance of the host node as described in the previous blog. Compute the Intel Xeon Phi performance as described here. The total system performance is the sum of these.

Remember that this is the peak theoretical floating point performance. This is the theoretical performance you are guaranteed to never see. Expect to see more about real-world performance using Intel Xeon Phi as the Centres of Competence come up to speed:

http://www.dell.com/Learn/uk/en/ukcorp1/secure/2012-09-27-dell-hpc-computing-centre-cambridge?c=uk&l=en&s=corp

If you have comments or can contribute additional information, please feel free to do so. Thanks.

–Mark R. Fernandez, Ph.D.

30-Apr-2013

(original posting: http://dell.to/YjFuN0 )

 

GPUs…, Nodes, Sockets, Cores and FLOPS, Oh, My!

In a previous post, I described how to compute the peak theoretical floating point performance of a potential system.

http://www.delltechcenter.com/page/Nodes,+Sockets,+Cores+and+FLOPS,+Oh,+My#fbid=TkQxC6Vb2Bi

In that post, I alluded to GPUs coming into the mix: “When might you need MHz these days, you ask? Think GPU speeds.” Well, that time has come! The nVidia GTC conference is soon (www.gputechconf.com) and systems are now regularly shipping with GPUs such as the nVidia K20 and K20x which operate at MHz frequencies.

There are several references available that indicate that the new nVidia K20 contains 2,496 cores. And the operating frequency is also available. Do not attempt to use these 2 pieces of data to compute a peak theoretical floating point performance number as described in the previous blog.

The K20 does indeed contain 2,496 cores, but not all are available for double precision floating point math. These cores are arranged into what are called Streaming Multiprocessor (SM) units. SM units in a GPGPU on an nVidia card are analogous to CPUs in sockets on a motherboard. Each SM does indeed contain 192 cores, all of which are available for single precision floating point math. But unlike most CPUs, all GPU cores are not available for double precision floating point math. On the nVidia K20 SM, 64 cores can perform double precision floating point math at a rate of 2 flops/clock.

There are 13 SM units in the K20, operating at a 706 MHz frequency Here is the use of MHz and the reference in the previous blog. 706 MHz is 0.706 GHz. Note that 13 SMs * 192 cores per SM is the quoted 2,496 cores total. Also note in the math below that the 64 double precision core count is used and not the 192 (single precision) core count quoted.

Here’s the peak theoretical floating point math for a K20:

GFLOPS = 13 SM/K20 * 64 cores/SM * 0.706 GHz/core * 2 GFLOPs/GHz

GFLOPS = 1,174.784

I have seen this appear as 1.17 TFLOPS or 1,175 GFLOPS.

Additionally, the nVidia K20x contains an additional SM unit for a total of 14 SM units and it operates at a slightly higher frequency of 732 MHz or 0.732 GHz.

Here’s the peak theoretical floating point math for a K20x:

GFLOPS = 14 SM/K20 * 64 cores/SM * 0.732 GHz/core * 2 GFLOPs/GHz

GFLOPS = 1,311.744

I have seen this appear as 1.31 TFLOPS or 1,312 GFLOPS.

Hope that helps. Compute the CPU performance as described in the previous blog. Compute the GPU performance as described here. The total system performance is the sum of these.

Remember that this is the peak theoretical floating point performance. Since it is theoretical, it is the performance you are guaranteed to never see! But we also already have a few blogs posted about real-world performance using GPUs:

Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations: http://dell.to/JsWqWT

ANSYS Mechanical Simulations with the M2090 GPU on the Dell R720: http://dell.to/JT79KF

Faster Molecular Dynamics with GPUs: http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/faster-molecular-dynamics-with-gpus.aspx

Accelerating High Performance Linpack (HPL) with GPUs: http://en.community.dell.com/techcenter/high-performance-computing/b/hpc_gpu_computing/archive/2012/08/07/accelerating-high-performance-linpack-hpl-with-gpus.aspx

If you have comments or can contribute additional information, please feel free to do so. Thanks.

–Mark R. Fernandez, Ph.D.

14-Mar-2013

(original posting: http://dell.to/12TrP1g)

Nodes, Sockets, Cores and FLOPS, Oh, My!

Recently, a fellow blogger here at HPCatDell, Dr. Jeff Layton, has been running a series on PetaFLOPS for the Common Man. In that series, he writes that in the November 2009 Top500 list there are actually two systems that achieves above one PetaFLOPS in sustained performance on the Top500 benchmark. However, there are five systems that have a theoretical performance above one PetaFLOPS.

Working in HPC, I am often asked how to compute the theoretical performance of a potential system.

Although this may be elementary for many of you, I thought I would take the time to document this and, maybe, in a subsequent writing explain the sustained performance numbers that go into the Top500 rankings – and why there is such a difference between theoretical and sustained.

To properly compute the theoretical performance of a system, we need to agree upon some common terms, or a taxonomy, if you will, of HPC compute components. Then, we simply do a dimensional analysis as we did in high school.

In the past, a chassis contained a single node. This chassis was a desktop computer or a tower version or a deskside unit or a rack-mounted pizza box server, etc. Within that thing you bought was a single node. A single node contained a single processor. A processor contained a single (CPU) core and fit into a single socket. But times change…

With recent “systems,” we can have a single chassis containing multiple nodes. And those nodes contain multiple sockets. And the processors in those sockets contain multiple (CPU) cores.

Therefore, let’s define a few terms.

1. A “chassis” houses one or more nodes.

2. A node contains one or more sockets.

3. A socket holds one processor.

4. A processor contains one or more (CPU) cores.

5. The cores perform FLOPS.

The “chassis” is that thing that houses one or more compute nodes. Note that the chassis may be a rack-mounted pizza box, or a blade enclosure or entire rack computer, which accepts plug-in compute nodes. One must buy one or more of these in order to have a computer system. Nonetheless, I call the piece of hardware that is a unit that houses compute nodes a chassis.

Nodes, usually a printed circuit board(s) of some type, are manufactured with (empty) sockets. There is not, in general, a node board for each available processor. The node boards are built to accommodate a family of processors. Depending upon your needs, your desires, or your budget, you select a specific processor to go into that socket. Today, within the same processor family, you can select between differing core counts, a wide range of frequencies and vastly differing memory cache structures.

Also note that the “thing” that Intel and AMD and other microprocessor companies sell is a processor. One cannot buy anything smaller than a processor. And they call it a processor with preceding adjectives, e.g., the ABC dual-core processor, or the XYZ quad-core processor.

Finally, the cores within the processor perform the actual mathematical computations. One sequence of these mathematical operations involves the exclusive use of floating point numbers and is called a FLOP or FLoating-point OPeration. The plural of FLOP is FLOPs, with a small “s,” like many things when made plural.

In general, a core can do a certain number of FLOPs or FLoating-point OPerations every time its internal clock ticks. These clock ticks are called cycles and measured in Hertz (Hz). Most microprocessors today can do four (4) FLOPs per clock cycle, that is, 4 FLOPs per Hz. Thus, depending upon the Hz frequency of the processor’s internal clock, the floating point operations per second or FLOPS can be calculated. Note the large “S” in FLOPS.

The internal clock speed of the core is known. It’s that GHz rating typical of today’s processor. For example, a 2.5-GHz processor ticks 2.5 billion times per second (Giga ~ billion). Therefore, a 2.5-GHz processor ticking 2.5 billion times per second and capable of performing 4 FLOPs each tick is rated with a theoretical performance of 10 billion FLOPs per second or 10 GFLOPS.

That’s probably more than anyone needs to know about the details of counting mathematical operations done by microprocessors. Fortunately, the final formula for computing theoretical performance of a system is quite simple and straightforward.

Here is a full and complete sample formula using dimensional analysis:

GFLOPS = #chassis * #nodes/chassis * #sockets/node * #cores/socket * GHz/core * FLOPs/cycle

Note that the use of a GHz processor yields GFLOPS of theoretical performance. Divide GFLOPS by 1000 to get TeraFLOPS or TFLOPS.

Likewise, MHz clocks used in the formula will yield MFLOPS, if you need that number. Similarly divide MFLOPS by 1000 to get GFLOPS. When might you need MHz these days, you ask? Think GPU speeds.

Note that for multi-rack systems, the formula may be improved by adding the number of chassis per rack as the first term.

Hope this helps.

— Dr. Mark R. Fernandez, Ph.D.

(original posting: http://dell.to/1gGPKYu )