Hardware-configuration-Threads-Memory

Glitch8

( 9.81% )

2022-06-22T15:01:23Z - 2022-06-22T15:01:23Z ago

#1

In short yes, and no.

0

Eugene8

2022-06-22T15:24:30Z - 2022-06-22T15:24:30Z ago

#2

You could install WL in a virtual machine with a RAM limit in its configuration.

0

MIH8

2022-06-22T15:52:18Z - 2022-06-22T15:52:18Z ago

#3

Thanks to both of you for a feedback.

@Eugene
Yes, thats possible. You mention to use VMs not the first time in context of WL.
Can it be that you use it a lot this way? What are the main benefits for you to setup WL with VMs?
Is it because you develop the software or simply a matter of taste?

@Both
Is there a reason why the number of threads is limited?
I hoped i can speed up things with using more cores/threads depending on the task WL works on.

0

Glitch8

( 9.81% )

2022-06-22T16:23:32Z - 2022-06-22T16:23:32Z ago

#4

It just isn’t something anyone’s requested before. We use C#’s built in parralelization support when appropriate which abstracts away these lower level details.

0

MIH8

2022-06-22T16:48:24Z - 2022-06-22T16:48:24Z ago

#5

Running individual backtests is not an issue, after all.

I was thinking of all the development tools like Strategy Evolver or even the other Evolver tools. Even computationally intensive tools like Montecarlo analysis could complete more runs in the same amount of time. So there are lots of neat little applications where more computer power is useful.

In the end, it can also make a difference if you have to wait 4 hours instead of 16 or 32 hours.

Now, I don't want to strain this topic. I was just curious because I see the high potential for the small applications in the software.

Thanks for your feedback.

0

Glitch8

( 9.81% )

2022-06-22T16:57:21Z - 2022-06-22T16:57:21Z ago

#6

You’re 100% right, and we already run the tools you mentioned in a parallel mode to take advantage of multiple cores.

0

Best Answer

superticker8

2022-06-22T20:34:12Z - 2022-06-22T20:34:12Z ago

#7

QUOTE:
Is there a reason why the number of threads is limited?

Yes there is. Even if one has a Xenon processor with 8 cores, he may only want to use 3 or 4 of them; otherwise, he runs out of on-chip L3 processor cache. Once that happens, adding more cores will only slow the processing down because your cache hit ratio falls dramatically.

For WL, you want to purchase a processor chip with as much on-chip cache as possible. The number of cores you have isn't that important, and using too many cores will kill your cache hit ratio.

0

MIH8

2022-06-22T21:08:29Z - 2022-06-22T21:08:29Z ago

#8

Well, I can't estimate how WL utilizes the hardware resources.

I can only say that I run and program high performance software like chess engines. This kind of software tests the limits of the hardware in any case. The interaction of processor, cache and ram depends of course on the application.

Having a workstation like a Threadripper with 16 or 32 cores should not have problems using more than 8 cores. The whole system is also balanced so that the individual components can be fully utilized. The whole machine is designed to do high performance stuff.

I'm sure I could save a lot of time if the upper limit was extended. The tools that WL offers just demand that.

0

superticker8

2022-06-22T23:09:41Z - 2022-06-22T23:09:41Z ago

#9

QUOTE:
a Threadripper with 16 or 32 cores should not have problems using more than 8 cores. The whole system is also balanced so that the individual components can be fully utilized.

That generalization is not true. Rather it's a function of the application and its resource requirements. WL requires a great deal of memory, and if 3 or 4 cores consume all the available cache for the problem, then adding more cores will just reduce the hit ratio on the on-chip cache.

Each cache level (L1, L2, L3) has an access time of 5-7 times faster. So whenever there's a cache miss, the speed penalty is a factor of 5 to 7.

Intel offers several configurations of processor chips. You want to use a i7 or better (Xenon perhaps) so you maximize on-chip cache size at the expense of reducing the number of cores. One processor chip configuration does not fit all applications! Who told you it could?

---
What would be worthwhile is to recognize different WL users have different problem sizes (memory footprints) and different hardware configurations. So having a way individuals could adjust for these differences may be worthwhile. There's a feature request along these lines. Did you vote for it? https://www.wealth-lab.com/Discussion/Symbol-by-Symbol-Optimization-improve-speed-6998

0

MIH8

2022-06-23T06:40:39Z - 2022-06-23T06:40:39Z ago

#10

Hi superticker, i quote myself,

QUOTE:

The interaction of processor, cache and ram depends of course on the application.

QUOTE:

Well, I can't estimate how WL utilizes the hardware resources.

Those two points express very well that we do not disagree.
This feature request is simply about make it configurable.

It is simply not correct to generalize that using more cores does not necessarily increase performance. That depends on how the software uses the algorithms and resources. In addition, the applications offered can differ greatly even with these requirements.

To be able to judge whether it helps or not, you need to know the software internally. Do you know the source code?
Well, if you don't know the source code, only the development team can make definitive statements in advance.

However, it is also certain that if additional resources, such as additional cores, are not available, they will certainly not lead to an increase in performance.

Well, you may be right, but this can only be answered concretely.

Have a nice day!

Edit:
Out of curiosity, can you refer to a model that is listed here?
https://www.cpubenchmark.net/high_end_cpus.html

0

MIH8

2022-06-23T08:40:38Z - 2022-06-23T08:40:38Z ago

#11

@superticker

Another comment. Chess engines also have a very high cache pressure, or do you really think there is less?. In general, this type of application can utilize all hardware resources at the same time 100%. All levels, from processor, cache, ram, cores and even fast disk accesses are exhausted.

The limitation is usually caused by the application. However, I do not think that WL has reached its limits when we are in the range of 8 cores. If that were the case, the resource usage of the software could definitely be optimized. I also don't think that WL has higher or special hardware / memory requirements compared to other applications (i use).

However, I believe Glitch that corresponding adjustments are limited with "C#'s built in parralelization support", which I can't judge though. And to optimize the thread+ memory management for the individual applications (e.g. Evolver, MC Analysis ...) individually is probably too complex and would complicate the code base.

0

superticker8

2022-06-23T20:49:52Z - 2022-06-23T20:49:52Z ago

#12

QUOTE:
corresponding adjustments are limited with "C#'s built in parallelization support"

There's no software limit as to how many threads the OS (Windows) can support (within reason of course). There's only hardware limits. For WL, it's no secret that the memory requirement is tremendous for each task, so a couple tasks consume all the hardware memory resources. And there's no way around that--unfortunately. (I wish there were.)

Some problems like neural networks have numerous computation "nodes," but each node uses little memory. Applications like this do lend themselves to massively parallel computational machines. There was a feature request to support this type of configuration (CUDA core GPU accelerator) for NeuralLab, but it was rejected because few WL users have this kind of massively parallel hardware. But, for neural networks, such a configuration would speed things up significantly.

You sound like someone interested in parallel hardware accelerators. Why don't you take a look at the CUDA core programming model. https://www.techcenturion.com/nvidia-cuda-cores/

0

MIH8

2022-06-23T22:59:02Z - 2022-06-23T22:59:02Z ago

#13

The physical limit is given by the hardware.

The limitation by the software is how the resources are used.
That said, it is not about what a software could do, but what it does.

Especially in the context of chess AI, I have been working for years with neural networks, genetic algorithms, Monte Carlo algorithms, optimization methods, machine learning, classical AI. From high-level optimization and the development of better algorithms to assembler optimization, everything was there at one time or another.

Since AlphaZero, Leela Chess Zero, Stockfish NNs are also used in high performance chess software and optimized accordingly. You can take a look at the design of this software. It's all open source. Especially in this area, graphics cards have of course become highly interesting in recent years because combining the mentioned solutions with MC methods.

This area is simply all about optimizations of software, algorithms and optimal use with hardware resources.

So, it's quite simple. If a software does not allow to use more than 8 cores effectively due to its memory management, then it means that the limitation is caused by the software (its implementation).

So no, WL is not exceptional in terms of memory requirements or compute-intensive applications, not even close.

The fact that a modern workstation cannot be utilized to the full capacity is certainly not in the nature of the available applications. It is the resource management in the software itself.

The memory load must simply not lead to additional cores reducing the overall performance. (What you think would happen)

QUOTE:

... so a couple tasks consume all the hardware memory resources. And there's no way around that--unfortunately. (I wish there were.) ...

Of course, you (the developers) can change that. It's just a question of how much effort it takes and what benefits it brings. I'm sure the backlog is full and new ideas are added daily. At some point, it is simply a question of priorities.

By the way, what background do you have. At least you seem to be interest in hardware, at least in interaction with software.

@WL team: Please don't be confused of the discussion. I already accepted that currently feature request is not an option. It's just about talking to superticker.

0

colotrader8

2022-06-24T01:35:14Z - 2022-06-24T01:35:14Z ago

#14

I ran an experiment yesterday while running WL8 Build 9 Optimization of a strategy.

I used Windows 10 Resource Monitor to control the number of threads in the first test.
The second test was to just let the optimization run with out any control on the number of threads.
I used Windows 10 timer - stopwatch function to measure the length of time required to complete each Optimization.
The same code and security symbol were used for each run.

The Optimization had 7840 permutations and covered a two week time frame on 5 minute bars for a single security symbol.

For the first test I cancelled the optimization when 140 threads showed in Windows 10 Reource Monitor.
I then restarted the optimization when the thread count dropped through 80.

The second test was to just let the Optimization run without interference.

The time to completed the first test Optimization was 49 Minutes and 57 seconds.
The time to complete the second test Optimization was 2 hours 58 minutes 34 seconds.

0

MIH8

2022-06-24T08:12:53Z - 2022-06-24T08:12:53Z ago

#15

Hi. I'm not sure what your conclusion is from your test.

The following points are noticeable. If the number of logical threads (140) is greater than the number of physical cores / hyperthreads the performance will decrease. This is a problem. This may be reinforced if memory is allocated per thread and in total exceeds the physically available memory. This leads to swapping and a real performance killer.

You speak of 7840 permutations and probably mean 7840 combinations. As far as I have this now in mind, the arrangement of the parameters to be optimized does not play a role.

Here is an example from the environment that I looked at comparatively. Classical evaluation modules in chess programming use several hundred or thousand individual parameters. A single piece square table with 64 values, where the range of values can be e.g. 4K (-2k to 2k), thus has 4K^64 states So I'm only talking about 64 values, not the thousand.

In this context, there are optimization methods based on the gradient descent algorithm in different variants. (stochastic, Adam, AdaGrad ...). Now comparing the numbers.

Back to "7840" combinations, to find an optimum these combinations have to be compared based on the available data. Let's say 20K bars are used as a sample. Of course, we do not forget the data used for indicators and so on.

Back to the "4K^64" combinations. I personally use 120 million chess positions as a database to tune tables like this. It can be much more complex. Here, too, there are associated data like indicators at the bars. The data size to represent a chess position is a lot bigger than for a bar information.

Approximating a global (not local) optimum takes minutes on a single core.
There are no speed issues caused by the algorithm, memory, or computational power.
In my estimation, the WL optimization problem described could be solved in less than a minute. No problems with speed caused due to algorithm, memory or computing power.

WL does not need to be that performant. However, it leaks out that the algorithms used are not optimal but it is ok for this kind of application.
In addition, there is the noticeable behavior of the memory resources and the limitation of cores. In long run this should be improved.

When working with C#, which has a high level compared to C, there may be libraries that improve the algorithms. But working with available or built-in solutions will have a different kind of limitations on the other side. I am sure that the ressource management keeps to be a topic the more the utilization of optimization and evolving methods come into play.

0

colotrader8

2022-06-24T18:27:04Z - 2022-06-24T18:27:04Z ago

#16

Two conclusions:
1. Manual control of the number of threads reported in Windows Resource Monitor resulted in an Optimization execution time that is less than one third of the execution time of just letting Optimization process run.
2. An indication that the execution time of the Optimization process might be substantially reduced by programmatic control rather than using a brute force manual control like I did.

0

superticker8

2022-06-24T23:07:04Z - 2022-06-24T23:07:04Z ago

#17

QUOTE:
2. An indication that the execution time of the Optimization process might be substantially reduced by programmatic control rather than using a brute force manual control like I did.

One needs to limit the numbers a threads (or cores); otherwise, one overwhelms the function of the processor's caching system, which will result in big speed penalties. One needs to understand how the cache controller works internally to appreciate why. This is a hardware limitation based on a particular application's resource requirements.

I tried to explain why this is a problem with the processor (in layman's terms), but I clearly failed in my efforts. And recognizing that--I'm done commenting.

0

MIH8

2022-06-25T06:40:59Z - 2022-06-25T06:40:59Z ago

#18

Hello superticker.

Thanks for coming back again.

QUOTE:

One needs to limit the numbers a threads (or cores); otherwise, one overwhelms the function of the processor's caching system, which will result in big speed penalties. One needs to understand how the cache controller works internally to appreciate why. This is a hardware limitation based on a particular application's resource requirements.

Please explain in more detail.

1. What is the particual part in the application that is such unique?
2. Did you read post #15 and #13, if yes what makes WL so different in its requirements or what is wrong with the software comparisons?
3. The cache controller is only one part in the chain. My guess it isn't the only bottleneck particular in the WL context.

Obviously you know what you talk about. I would not test performance like colotrader, but i wouldn't stop a talk because what he thinks and did.

0

MIH8

2022-06-25T07:50:55Z - 2022-06-25T07:50:55Z ago

#19

@ superticker here is a concrete example out of countless others (still recommending to read Post #15,#13).
I'm still not sure if you saw that different people posted in the discussion.

Example

A hashtable, also called transposition table, is too large to fit in any cache and accesses to DRAM memory are extremely expensive in comparison. (We agree on that)

If a hash entry has two slots for the same datatype you might want to access both slots.
If you would probe both slots with two different access calls, this would double the costs for DRAM access.

So, a software solution is, to align the slot size to 64 byte (each slot 32 byte).
Keeping the data in the same memory block 'cache line', the second probe would be for free because you only need one probe.

Another point is how entries are accessed. The index of an entry will be computed by a hash number that corresponds to the data.
This can be done via a modulo operation. It was reported that using a modulo operation instead of ANDing can cost up to 30% speed.
So, a software solution is, make the size of table a power of two, to enable the usage of ANDing.

Summary

In general terms, the example shows that it is a question of the algorithm and the implementation, but also an combination of memory and processor operations. The performance improvements provided through a better software implementation is the key. Regardless of the hardware and the available cache, the improvements will prove useful on any machine!, even with different hardware settings.

I'm sure you can abstract this on the WL software, I'm convinced that with an improved implementation the reduction of threads is not necessary. Conversely, your assessment says that the implementation is not good. Having to use fewer threads would be a limitation for further progress, not a solution.

QUOTE:

One needs to limit the numbers a threads (or cores); otherwise, one overwhelms the function of the processor's caching system, ...

A different point, in layman's terms ...

If multiple threads are running on the same core that has a cache, it may not be a problem at all because the same location in the cache will be addressed very often. Unnecessary memory queries that overwrite the cache again and again are a problem. Shared data can increase the hit rate ... which is a subject in the implementation context of the application.

... you need to look at the details and how the threads are used. Your judgment for the solution is too vague.

Cache should enable the CPU cores to work faster despite the memory latency of the main memory access. In this sense, it is a contradiction to reduce the number of CPUs. One should not limit the number of CPUs, but free up resources through better data management.

0

colotrader8

2022-06-25T14:12:21Z - 2022-06-25T14:12:21Z ago

#20

Just to be clear, the "threads" I am referring to are in Windows Resource Monitor,
There is a CPU tab with a column heading "Threads" in Windows Resource Monitor.
I believe that one can monitor this in C#, and I think that control of the number of these "threads" can be used to speed up the optimization process in WL8.

0

superticker8

2022-06-25T14:56:02Z - 2022-06-25T14:56:02Z ago

#21

QUOTE:
1. What is the particular part in the application that is such unique?
2. Did you read Post #15 and #13, if yes what makes WL so different in its requirements or what is wrong with the software comparisons?
3. The cache controller is only one part in the chain. My guess it isn't the only bottleneck particular in the WL context.

1) Already answered above. Each WL task (and core) requires significant amounts of memory to execute and all tasks compete for the same cache memory system creating a bottleneck. Now if we were talking about an N-CUBE supercomputer, this wouldn't be a problem because each processor on the N-CUBE has independent memory resources.

When I (or you) buy a motherboard for your workstation, it comes without a processor. That allows me to pick a particular processor chip that maximizes the cache size while minimizing the number of cores taking up chip real estate (which is limited). For running the WL application, that's what everyone should be doing.

2) Those issues aren't going to affect the hardware limitations, which are the main concerns during WL optimization.

3) Yes, the caching system is only one part in the chain. But for multicore execution of the optimizer, it's the weakest link in the chain. That makes it the defining rate limiting bottleneck. As I said, if you were running WL on an N-CUBE supercomputer (which does not have a shared caching system), this wouldn't be an issue.

For non-multicore execution (i.e. not doing parallel optimization), the "other factors" you mention would play a greater role. Caching shouldn't be a bottleneck for single core execution on WL.

0

MIH8

2022-06-25T16:27:27Z - 2022-06-25T16:27:27Z ago

#22

QUOTE:

1) Already answered above. Each WL task (and core) requires significant amounts of memory to execute and all tasks compete for the same cache memory system creating a bottleneck

What you call significant amount of memory, i call dust in the wind.

I have now pointed out several times that there are applications whose memory, cache and processor requirements are many times higher than those of WL. These applications simply utilize the ressources in much better way, regardless of whether the processes run in single core applications or in a mutlithreading context. Do you think that's wrong?

Of course, you can focus on the cache when selecting the hardware. Nobody disputes that. But especially in this case, cache-optimized software solutions become even more efficient. That is not mutually exclusive! Do you think it is?

Unfortunately, you didn't answer specific examples. In my last example I clearly showed that it is possible to reduce the pressure on the cache by specific improvements in a software. You still claim that this is not possible. Why?

You haven't given a concrete example that shows why memory management in WL can't be improved. Why?

The reason why different memory (cache) levels exist is to better utilize the processor performance and reduce the delay of data requests over the ram. Right?

To make the best use of the cache, the data management should be designed to keep the data with the highest number of accesses in the cache. Right? This can be influenced by the implementation. Right?

Depending on the cache architecture, private and shared cache can have different positive and negative effects. Shared memory can increase the hit rate. Right? This can be a matter of implementation too. Right?

My impression is that you are quite familiar with hardware and what it does in this context. At the same time, I think that you are not really familiar with software (especially on the implementation level), which has really high demands on this topic and on hardware. I may be wrong, so I am still interested in the context from which you make your assessment.

Anyway, thanks a lot to discuss this.

0

superticker8

2022-06-26T09:46:18Z - 2022-06-26T09:46:18Z ago

#23

QUOTE:
What you call significant amount of memory, i call dust in the wind.

And this is why I stopped commenting. Have it your way.

0

MIH8

2022-06-26T11:17:00Z - 2022-06-26T11:17:00Z ago

#24

You do not need to comment. I asked you a bunch of questions (#22,#19) and there is not a single answer from your side. I gave a comparison to a different field of science where applications do process at least the same abmount of data much more performant (and ideas how and why this can work). There is not one substantive statement from your side that goes beyond a generally valid one. No details, no examples, no differentiation from your side.

Why don't you simply prove me wrong by going through the points?

Honestly, that would be interesting. I have no problem to be wrong if i get the chance to learn something.
Maybe some readers are interested too.

For example, what is much data for you? (so, we would be able to talk about numbers)
I fear, there won't be an answer again for the next concrete question, unfortunately, as all the time.

I tried to ask you how much experience you have programming similar applications.
Why don't you give a simple answer. Maybe you can report from your experience. No answer so far, as usual.

You simply did not provide any content! Not even reference to the examples, questions and statements.
When I think about it, that alone is a statement!

0

MIH8

2022-06-26T14:39:19Z - 2022-06-26T14:39:19Z ago

#25

Memory access as a bottleneck

Hardware design of CPUs currently focuses heavily on optimizing caches, prefetching, pipelines and concurrency. CPU computing time is spent on about 85% for caches and up to 99% for storing/moving data!

Now there are different ways to deal with this already during the development of a software. I have collected some points that are for the interested reader/developer. The concepts are independent of the cache architecture or hardware in general. They can have a positive effect on all hardware performance levels.

Main concepts for cache-friendly code

CODE:

    • Principle of locality
                 ◦ Temporal Locality
                    Likelyhood that the same location is accessed again in short time
                 ◦ Spatial Locality
                    Placing related data close to each other
                 ◦ Time
                    Do the operations on the data in one run
    • Cache blocking technique (high performance computing)
                 ◦  Rearrange  data access to pull subsets (blocks) of data into cache and to operate on this block to
                     avoid having to repeatedly fetch data from main memory
    • False sharing
                 ◦ make use of private or threadprivate data
                 ◦ Pad data structures so that each thread's data resides on a different cache line 
                 ◦ Modify data structures so there is less sharing of data among the threads.
    • Thrashing – PFF algorithm
    • Various + coding
                 ◦ exploit implicit structure of data         
                 ◦ void unpredictable branches (about prefetching data)         
                 ◦ use contiguous memory

High latency is not exclusively caused by the hardware used. How efficiently the hardware is utilized depends to a large extent on the software.

The requirement for WL, in terms of data throughput or latency is certainly not exceptionally high. One should also be aware that high memory consumption does not mean that it is used efficiently.

The proposed solution to procure hardware (by superticker) is not justified in my opinion. Especially as long the memory management of the software does not change. It can get a bit faster, but the software can reduce extremly the bought potential performance without any problems. For systems that are already fully utilized, a purchase option can make sense, even with superticker's recommendations.

Only when a software solution is sensitive to the issue (practically exhausted) and one wants even more performance, it makes sense to invest in specialized hardware. A good sign for this is also when a computer can fully utilize all hardware resources at the same time.

However, I am convinced that at the appropriate time the WL team will make improvements in this regard, step by step.

0