Biostar Racing P1: cold exhaust

We were not the first to notice that compact computers like the Intel Compute Stick are not good enough in terms of performance. Getting acquainted with the similar device from Biostar, the expectations were not the most optimistic. Like the younger models of stick computers, the Racing P1 runs on one of the weakest processors in the Atom Z8000 family. However, the x5-Z8350 chip chosen by Biostar is at least one step, but more productive than its younger brother. Let's try to evaluate the performance of this platform, which, thanks to Biostar’s efforts, is no longer a stick, but, truth, is not a laptop.

Figure 1 . USB 3.0 / 2.0 connectors, SD card slot, backlight pins, headphone jack and power button fit on the front panel

The tools used are NCRB (NUMA CPU and RAM Benchmarks) for Win64 and a cross-platform utility for identifying the JavaCPUID processor.

CPU

The CPUID instruction confirms that the Intel Atom x5-Z8350 processor is installed on the Biostar Racing P1 platform. Its nominal frequency is 1.44 GHz, which, however, does not prevent it from lawfully accelerating to 1.92 GHz if necessary. Even with a brief acquaintance with this platform, a paradox is obvious: its work in the range from 1.44 to 1.92 is more a rule than an exception from it.

Figure 2 . Factory Specifications Intel Atom x5-Z8350

The decision by the x5-Z8350 processor to select the minimum or nominal clock frequency and start the Turbo mode is performed based on the analysis of the load and operating temperature. SDP (Scenario Dissipated Power) frames define the typical power consumption of a device. The control mechanisms independently assess the situation, and in the case of a “light” load, they reduce the power consumption of the chip. The ability to turn on Turbo mode is a function of temperature, so the results of summer and winter tests may differ. In general, Racing P1 also "pereobuvatsya" for the season.

Going beyond the scope of the study, we note that the afterburner leads to a consumption of up to 7 Watts on power lines of ~ 220 V. The Racing P1 cruise mode reduces this value by about half, the sleep mode requires just over 2 watts of AC power (consumption monitoring was performed with a conventional household wattmeter) .

Figure 3 . CPUID on Intel Atom x5-Z8350 functionality

Intel Atom x5-Z8350 works with data whose maximum bit width is 128 bits. Modern AVX 256/512 functional extensions are not supported. This means that our measuring tool will be a set of SSE 128 vector instructions, and the object of measurement will be the cache memory and dynamic RAM.

Figure 4 . System Information Window and NCRB Utility Test Mode Selection: The menu on the left displays instruction sets, including functional extensions supported by the processor.

An important digression is appropriate here: in the general case, the maximum digit capacity of the operands does not mean maximum performance. For example, a number of AMD processors in a construct up to AM2 inclusively process two 64-bit downloads with classic MOV instructions faster than a single 128-bit SSE load with a MOVAPD instruction. Remembering this, we were convinced experimentally before the measurements began - using SSE for the Atom x5-Z8350 is indeed the most productive scenario.

L1 cache

Usually, the cache size is a multiple of a power of two. At the first level, the manufacturer tries to evenly distribute it between the instructions and the data. All these canons are not respected by the x5-Z8350 processor architecture. Each of its four cores has 32 kilobytes of cache for instructions and 24 kilobytes for data.

Figure 5 . Cache level classification

A number of sources give the product of the size of the cache memory of one core by their number, which gives a more impressive idea: 128KB instruction cache and 96KB data cache. The official page is traditionally silent about the L1 cache, at least at the time of this writing.

Note that the zero level cache (similar to L1 Trace Cache), which stores decoded instructions and improves the efficiency of short cycles, is not declared by the CPUID instruction. The verification of its presence and the performance analysis are worthy of a separate publication.

Theory and Practice: cache performance

Measuring the speed of the cache memory is a cyclic read or write block, the size of which is less than the size of the investigated cache level, and therefore data access operations are cache hits (cache hits). In fact, the choice of the target object (L1, L2 cache or DRAM) is determined by the size of the data block being processed.

Having set the tested entity, we proceed to consider the operation at the level of machine commands. In our experiment, we use the unrolled cycle of sixteen MOVAPD SSE2 instructions, each of which transmits a 128-bit operand between the memory and one of the XMM registers. As a result, 16 XMM0 ... XMM15 registers are fully loaded in one iteration of the cycle.

For completeness, we note that the MOVAPD instruction can also be used to transfer data between two XMM registers, but in our case, register operations will not give an idea of the performance of memory objects. Maximum performance is ensured by the alignment requirement for the MOVAPD instruction: the address of the operand must be a multiple of 16 bytes (128 bits).

L1 cache benchmarks

As long as the read or write block is smaller than the L1 cache (on the graph it is the X axis), the exchange rate is high. As soon as the block goes beyond the limits of L1, cache misses occur and the speed drops. Obviously, when evaluating performance, the informative is the “upper step” corresponding to the left part of the graph.

Figure 6 . Graph of the speed of reading a block of data from its size;
neighborhood X = Size L1

The maximum speed in megabytes per second (MBPS) corresponds to the minimum number of cycles per instruction ( CPI, Clocks Per Instruction ) and is about 30 GBPS .

Figure 7 . Graph of the speed of writing a block of data on its size;
neighborhood X = Size L1

As can be seen from the graphs, the inflection point for reading L1 corresponds to a theoretical value of 24 kilobytes. For the record, the caching policy applied in this processor characterizes the “early fall” of speed, which will be the subject of a separate study. But now it can be noted that this policy does not contribute to record rates of recording speed, although in some cases it is possible to avoid L1 clogging with unnecessary data.

The results show the speed developed by a single Atom x5-Z8350 processor core. A number of tests, in particular AIDA64, show the total performance of all cores.

We will conduct a small theoretical calculation, simulating peak bandwidth. For the CPU under investigation, the clock frequency in Turbo mode is 1920 MHz. 128 bits or 16 bytes are transmitted in one clock:

1920 * 16 = 30720 (about 30 Gigabytes per second)

The TSC (Time Stamp Counter) counter is used as a source of exemplary time intervals. Since the processor core and TSC are generally clocked asynchronously, the TSC clock values per instruction are fractional values.

Make sure that the processor operates in Turbo mode, based on the values of frequencies specified in the documentation. One clock cycle of the forced 1920 MHz core frequency is approximately 0.521 nanoseconds. One cycle of the nominal frequency of 1440 MHz, which is running the Time Stamp Counter register is approximately 0.694 nanoseconds. For instructions executed per cycle, the theoretical value of the number of TSC cycles per instruction (CPI) should be

0.521 / 0.694 = 0.750

The displayed measured Minimum CPI values in the range of 0.759 ... 0.767 are quite close to this value.

L2 cache

The four cores of the processor under study are divided into two groups, two cores each. The total size of the L2 cache is 2 MB and is equally divided between them. The conclusion is obvious: each core has access to 1 megabyte of L2 cache, which is shared with a neighbor in the group.

L2 cache benchmarks

The L2-cache speed is the central “step” that arises when a double inequality of 24 KB <X <1 MB is observed - when the data block being processed does not fit in L1, but still fit in L2.

Figure 8 . Graph of the speed of reading a block of data from its size;
neighborhood X = Size L2

As can be seen from the graphs, the decrease in speed, due to the exhaustion of L2, occurs when the limit of 1 MB is exceeded. The ability to “borrow” the cache from a neighboring group by moving the speed drop to a point of 2 MB was not detected.

Figure 9 . Graph of the speed of writing a block of data on its size;
neighborhood X = Size L2

The performance rating of L2-cache by writing is close to reading: 12 vs. 11.5 GBPS. Theoretical background of this result will be considered in the following publication .

Source: https://habr.com/ru/post/413857/

All Articles