Design and DevelopmentModern complex software development environments (IDE's), interpreters and compilers make maximum use of available memory and processing resources. The Intel Nehalem chipset and processors can provide over 100GB of memory and up to 12 simultaneous threads of execution. For highly multi-threaded applications and development environments with large in memory data sets and arrays it is the platform of choice. With reduced code-compile-execute-test cycle times the platform also gives more leverage to aggressive Agile, Extreme Programming and Rapid Development methodologies. It supports a potential step change in productivity and code quality. More processing power frees up valuable creative software engineering resource to focus on innovation and the uniquely value creating aspects of the discipline. The following processor and main memory intensive benchmarks illustrate how the Nehalem platform's memory bandwidth and multiple execution threads improve performance (all based on single CPU only for comparison): - Fritzmark - score over 35
- WPrime - under 4s
- SiSoft Sandra - Dhrystone ALU iSSE4.2, 161 GIPS, 48 MIPS
- SiSoft Sandra - Whetstone FPU iSSE3, 144 GFLOPS, 42 MFLOPS
- SiSoft Sandra - Memory Bandwidth (iSSE2), 37GB/s
- Linpack - 95 GFlops
- Cinebench - OpenGL 2314 CB-GFX, CPU Rendering 29776 CB-CPU
- SPECint_rate_base2006 - 255
- SPECfp_rate_base2006 - 204
At a minimum this represents a 30-40% improvement over all other platforms and in some cases over 100% performance improvement (see SPEC results table for full results). Possibly Nehalem's greatest advantage is that with greater miniaturisation (down to 35nm prefabrication in the next generation) In single threaded or with serially characterised workloads the platform helps by offering clock speeds of as much as 5GHz or more. Our experience is that the Nehalem architecture offers improvements even at clock speeds matching that of legacy architectures (i.e. clock for clock its able to perform more work). However it also has significantly greater over performance headroom and with our unique designs and Cryo Boost process we are able to more than double the CPU work rate. Typically customers with this sort of application or program will see a several fold improvement in execution times over previous platforms delivering real immediate cost benefits. Testing and DebuggingThe massive amount of processing, storage and bandwidth available mean that its finally possible to simulate production environments on a single host with many virtual machines and at the same time execute a meaningful workload. With many virtual machines installed on the the single host each one can have its own dedicated core (or virtual core) simulating the equivalent of up to 24 powerful single core servers in a server farm. Our relatively modest workstations are capable of meeting the workload challenge of a small data centre server farm. This means you can replicate real world task metrics and synthetic loads with a single powerful workstation. You can emulate the concurrency of several thousand users with load generating or injector software. Even the SAS RAID storage arrays can get close to Enterprise level data access and transfer rates. Typically an eight drive solid state storage array can achieve transfer rates close to 2GB/s and with random access times under 0.1ms while storing 2TB of data. In an Enterprise environment a fully managed fibre channel SAN (Storage Area Network) costing several hundred times more would be humbled by this performance and require specialist expensive management tools and resources. Massively Concurrent / Parallel ProcessingIt is now possible for all C language based software engineers to harness not only the power of the CPU but also the GPU. While a CPU runs as several GHz clock speeds and offers six or fewer cores the GPU runs at a more modest 600MHz or so clock speed but offers several hundred cores. Reminiscent of the RISC vs CISC processing paradigms the GPU is suited to very specific RISC type workloads where a core stream of simple instructions can execute many complex mathematical or logical operations concurrently. This is what characterises the requirements of graphics and video processing and hence the GPU has been honed to be formidable in this area. The chipset and CPU is still required to be the 'host' of traditional x86 based instructions and provide the work to the GPU's and perform complex instruction sets outside the scope of the GPU. Hence there is a balance that has to be struck between the performance of the CPU and GPU to ensure the best possible performance is realised. For the right kinds of workloads though the GPU transforms the execution time required to complete the task. Taking a video encode for example (H.264) the introduction of High Definition video (1080p) has hugely increased the workload required to encode each frame increasing the time taken to encode footage. nVidia provide a platform SDK (Software Development Kit) for their GPU's that allows you to package a C based program up for execution on the GPU across many of its cores simultaneously. Adobe added CUDA support to video encoding in CS4 (Creative Suite 4) and reduced a six hour encode task down to 40 minutes (Pinnacle and others are now also adding CUDA support).

|