Biostar Racing P1: from easy to hard

In a miniature computer like Biostar Racing P1 every megahertz counts. This is dictated by compactness and low power consumption. The Intel Atom x5-Z8350 processor is here in its place. And it is not necessary to expect any special performance records from him. Especially, taking into account the failure characteristics of recording L1-level cache.

Nevertheless, this "is no longer a stick, but also not a laptop" will still find its consumer. The key to this is the four cores of the seemingly unpretentious CPU. Is it worth pinning on them?

In the previous article, we analyzed the results of cache tests performed in a single thread, which gives an idea of ​​the "isolated" performance of an individual kernel. What is the integral evaluation of a multi-core processor? So, set the Use parallel operations checkbox in the NCRB utility and perform a similar series of measurements.

Figure 1 . Selection of a multithreaded platform testing scenario using NCRB utility tools

Multi-threaded test L1-cache


In the Intel Atom x5-Z8350 processor, the first-level cache is a private resource of each of the four cores. This means that when processing a data block whose size is smaller than the L1 size (in our example, it is 24 kilobytes), each core uses its own cache memory, there is almost no access competition, which means we can expect a multiple performance increase in accordance with the number cores. The common phrase "but you will not fight" quite accurately describes this measurement scenario.

Figure 2 . A plot of the speed of reading a block of data from its size for simultaneously operating 4 processor cores; neighborhood X = size L1
Counterarguments can include such factors as reducing the upper limit of dynamic overclocking when implementing a given power consumption scenario and thermal mode, as well as limiting the processor time allocated by the operating system to an application within a multi-tasking environment.

Recall that peak performance in a single-threaded test (see “ Biostar Racing P1: cold exhaust ”) was just over 30 GBPS. Using 4 cores, we get a result of about 107 GBPS, which is pretty close to the theoretical value of 120 GBPS.

Figure 3 . Graph of the write speed of a block of data on its size for simultaneously operating 4 processor cores; neighborhood X = size L1
In the study of L1, the left part of the graph is important, corresponding to a block of up to 24KB in size. Here we see two fractions of performance: a fast section on small transactions (more than 105 GBPS), and a slow section for data that is larger than 6.4KB, but still “fit” into the eye of the L1 cache. The first is clear: as in the case of the reading test, it is close to the value of 120 GBPS, which is quadruple for one core. Why is data failure in L1 again a “failure”? This can only be guessed at.

Probably, Intel engineers, designing an economical version of the processor, shifted the focus of data caching from L1 to L2. The caching of instructions at the first level is still effective, and the Atom x5-Z8350 is fine with that. Under the conditions of a shortage of resources, the processor skips the reckless use of static memory to serve data streams, relying more on the capabilities of the second cache level.

This is where the generally accepted approach to generating a load profile for real-time transaction processing comes to mind. The generally accepted standard is the ratio of reading to writing in the proportion of 70% to 30%. Approximately the volume allocated for the “quick” record corresponds to the remaining space in the L1 cache. Is it possible on the basis of this to assume that Intel is targeting Atom processors in particular at processing streaming information, for example, media content?

It is obvious that the processor's restraint in write caching is beneficial if there is no re-access to the information just recorded: caching “unnecessary” data clogs the memory, displacing “necessary” data from it. Writing into memory performed when unpacking media content is, at first glance, an operation that is not profitable to cache. Appeal to previously recorded data in case of failure to cache, on the contrary, will lose.

Multi-threaded test L2 cache


The cache memory of the second level, with a total volume of 2 megabytes, is divided into two equal parts of 1 MB, each of which serves a group of two cores. This means that in a multi-threaded test, each core has 512 kilobytes of L2 cache, as opposed to 1 megabyte in a single-threaded one. Consequently, on the plot of block processing speed versus its size, an inflection point should be expected in the vicinity of X = 512 KB, and not X = 1024 KB, as was the case in the single-flow test (see Biostar Racing P1: cold exhaust ). The considered topological features of the L2 cache also affect the scaling of the access speed.

Figure 4 . A plot of the speed of reading a block of data from its size for simultaneously operating 4 processor cores; neighborhood X = size L2
The performance of L2 characterizes a section of the graph that satisfies the double inequality 24 KB <X <512 KB, which corresponds to a data block that no longer fits into L1, but still fits into L2.

Figure 5 . Graph of the write speed of a block of data on its size for simultaneously operating 4 processor cores; neighborhood X = size L2
Recall that the reading speed of L2 in a single-threaded test is about 11.5 GBPS. The result of scaling is about 39 GBPS. Not bad! The write speed of L2 in a single-threaded test is about 12 GBPS. The result of scaling is about 31 GBPS.

Instead of a resume


We can state a good level of multi-threaded performance of the platform under study. The architecture of the Intel Atom x5-Z8350 processor, which determines the private L1 cache and partially shared L2, is expected to reflect on the benchmark results.

Figure 6 . Monitoring of CPU utilization by means of Windows 10 OS: the moment of increase in core load up to 100 percent corresponds to the moment of test start
When running a multi-threaded test, the load on each of the four processor cores increases to 100 percent. What happens to the temperature and power consumption?

Figure 7 . Monitoring of temperature and power consumption by means of the AIDA64 utility
The result was obtained using the popular information and diagnostic utility AIDA64 about 20 minutes after the launch of the multithreaded NCRB test.

Important caution


Trying to repeat the above experiments on your computer, you need to back up the data, verify the effectiveness of the processor cooling system, the reliability of the power supply and the Vcore pulse controller. A stress test can damage an overclocked or unstable system. And best of all to experiment on the state equipment.

Source: https://habr.com/ru/post/415451/


All Articles