Good afternoon.
I recently invested in a new motherboard and a new processor, Intel Xeon e5-2696 v3.
However, I see that in the backtest optimization, the Wealth-lab system does not use 100% of the processor's capacity.
Utilization is around 30 to 40% of the total.
Is there anything that can be done to improve Wealth-Lab's processing speed?
I recently invested in a new motherboard and a new processor, Intel Xeon e5-2696 v3.
However, I see that in the backtest optimization, the Wealth-lab system does not use 100% of the processor's capacity.
Utilization is around 30 to 40% of the total.
Is there anything that can be done to improve Wealth-Lab's processing speed?
Rename
What optimier are you using? This is what I see when I optimize using Exhaustive:
I use the same mode, as can be seen below.
I'm not sure then, it must be something to do with your hardware. As you can see in my screen shot it's using all of them in my experience.
I'll do some research and see if there's anything we might do, I see you have more cores so there is a clue.
I'll do some research and see if there's anything we might do, I see you have more cores so there is a clue.
Okay, thanks for the feedback.
Also, during the week, I will put the second Xeon 2696-v3 processor to work on this board I purchased.
The power supply I have only feeds 1 processor.
As soon as I have the second processor working, I send a print of the processing and use of the processors.
Also, during the week, I will put the second Xeon 2696-v3 processor to work on this board I purchased.
The power supply I have only feeds 1 processor.
As soon as I have the second processor working, I send a print of the processing and use of the processors.
The first print I sent was a 20 year test and approximately 1500 stocks.
In smaller databases, with less stocks, the use is much higher.'
In smaller databases, with less stocks, the use is much higher.'
@tfritzen - I've done extensive testing on this personally (comparing i5/i7/Xeon) processors and performance.
In my experience, what you are seeing is "expected behaviour". Expected... not optimal... and is a result of several factors.
1) Thread Management
2) Data access (including size of data, including number of number of bars, complexity of indicators, number of parameters, and number of historic trades produced)
3) Buffer/Onchip Memory sizes
4) Amount of shared data between sets
5) Operating system settings for load balancing of other applications
6) Afinity (actively assigned cores for WL7 to utilize)
to name a few...
Essentially what you are seeing is the CPU waiting for information (therefore not calculating 100% of the time). The more information required to complete a backtest, the more wait-time you will see on the utilization visualiser in the form of reduced % usgae.
If you watch your CPU's (logical cores) on your Xeon machine you may even find some processor cores completely idle during a long backtest (SP500 over 30 years - daily) while others are running at 30-40%. Be aware that initially all cores may fire up, but over time you might find that due to shared data access some cores become idle for long periods.
Your CPU has three levels of on chip memory and prioritizes the utilization based on the data that is shared between tasks. Data required for all tasks (or most) is stored on the fastest of the three levels, the least shared on the slowest. So if shared data is to large for the onchip memory then it is allocated to RAM, then Swap/Page file on your HD if needed.
You also have hte impact of other softwares etc... If you run other critical applications you can manually set CPU affinity to "lock out" those applications from utilizing the WL7 assign cores. This is helpful if the secondary application task are also information intensive.
Without getting further into all the specifics. Short runs of less bars (and less parameters) are more efficient (utilize more CPU) than longer runs. Short runs may consume 100% time at overclocked speeds. Longer runs with more parameters may sit as low as 20-25% time and benchmark speeds for your processor. Consider the difference in read time and memory allocation between a 5 year / 50 stock baktest v's 30 years / 500 stocks.
This is not an error in design of WealthLab's optimizer, but rather a lack of personalization to your specific hardware configuration.
Since spending many hours studying WL7's impact during optimization, I decided to write my own exhaustive optimizer extensions that is based on WL7's native Exhaustive Optimizer, with some tweak specific to my machine. This apears to significantly outperform the native version on long runs with numerous optimizable paramters while equaling (if not outperforming) on the shorter runs.
I hope this helps you better understand what you are seeing. I know I was initially perplexed by the seemingly low usage. But now that i understand it better I've been able to develop a solution that better suits my needs.
You may find my thread here helpful:
https://www.wealth-lab.com/Discussion/Expected-Resource-Usage-Memory-CPU-Tweaks-Tips-Code-Management-Ongoing-Discussion-7143
In my experience, what you are seeing is "expected behaviour". Expected... not optimal... and is a result of several factors.
1) Thread Management
2) Data access (including size of data, including number of number of bars, complexity of indicators, number of parameters, and number of historic trades produced)
3) Buffer/Onchip Memory sizes
4) Amount of shared data between sets
5) Operating system settings for load balancing of other applications
6) Afinity (actively assigned cores for WL7 to utilize)
to name a few...
Essentially what you are seeing is the CPU waiting for information (therefore not calculating 100% of the time). The more information required to complete a backtest, the more wait-time you will see on the utilization visualiser in the form of reduced % usgae.
If you watch your CPU's (logical cores) on your Xeon machine you may even find some processor cores completely idle during a long backtest (SP500 over 30 years - daily) while others are running at 30-40%. Be aware that initially all cores may fire up, but over time you might find that due to shared data access some cores become idle for long periods.
Your CPU has three levels of on chip memory and prioritizes the utilization based on the data that is shared between tasks. Data required for all tasks (or most) is stored on the fastest of the three levels, the least shared on the slowest. So if shared data is to large for the onchip memory then it is allocated to RAM, then Swap/Page file on your HD if needed.
You also have hte impact of other softwares etc... If you run other critical applications you can manually set CPU affinity to "lock out" those applications from utilizing the WL7 assign cores. This is helpful if the secondary application task are also information intensive.
Without getting further into all the specifics. Short runs of less bars (and less parameters) are more efficient (utilize more CPU) than longer runs. Short runs may consume 100% time at overclocked speeds. Longer runs with more parameters may sit as low as 20-25% time and benchmark speeds for your processor. Consider the difference in read time and memory allocation between a 5 year / 50 stock baktest v's 30 years / 500 stocks.
This is not an error in design of WealthLab's optimizer, but rather a lack of personalization to your specific hardware configuration.
Since spending many hours studying WL7's impact during optimization, I decided to write my own exhaustive optimizer extensions that is based on WL7's native Exhaustive Optimizer, with some tweak specific to my machine. This apears to significantly outperform the native version on long runs with numerous optimizable paramters while equaling (if not outperforming) on the shorter runs.
I hope this helps you better understand what you are seeing. I know I was initially perplexed by the seemingly low usage. But now that i understand it better I've been able to develop a solution that better suits my needs.
You may find my thread here helpful:
https://www.wealth-lab.com/Discussion/Expected-Resource-Usage-Memory-CPU-Tweaks-Tips-Code-Management-Ongoing-Discussion-7143
QUOTE:
Your CPU has three levels of on chip memory and prioritizes the utilization based on the data that is shared between tasks. Data required for all tasks (or most) is stored on the fastest of the three levels, the least shared on the slowest. So if shared data is to large for the onchip memory then it is allocated to RAM, then Swap/Page file on your HD if needed.
Exactly! So reduce your memory footprint so your problem can fit entirely in the on-chip cache of the processor. Then you'll get full utilization. Right now, your memory footprint is too big.
Also, using less cores so they are not fighting over on-chip cache memory may help. What you need is more on-chip cache, not more cores.
QUOTE:
Also, using less cores so they are not fighting over on-chip cache memory may help.
Yes, the term for this in my post above is "Afinity" (item 6).
However, in your personal circumstances you may find that it is important to set this for the WL7 executable, AND for other applications running at the same time. This is most common the the gaming world where gamers are also recording/streaming for a YouTube channel. The screen recording/streaming software Afinity (Assigned CPU's) is set to X logical cores (say 1-4 on an i7 octacore) while the Afinity for the game being played might be set to logical processors 5-8. This way the the OS is not time-slicing the screen recording/streaming software and Game software on the same CPU's, instead, each is dedicated to a Logical Core Set. The result is les s in-game lag, and a smoother recording/streaming viewer experience.
Perhaps (though, I have not tested this), in the case of a XEON Dual multicore processors, you might like to assign WL7 to the logical cores of ONE processor (1-16), with other applicaitons running assigned to the logical cores on the second processor (17-32). This may also positively impact the on-chip memmory utilization.
In Windows 10/11 the Afinity seting is accessible via right clicking the executable process in Windows Task Manager
If you would like to learn more, I suggest you start with the way your OS manages its processor time allocation to tasks (aka: time-slicing). Each OS does is a little different, but they all foolow the same principles. When you understand more about how the hardware is assigned work, you will better understand why WL7 and other softwares act the way they do at times. You will also be better able to create code solutions for the challenges specific to your utilization of WL7.
QUOTE:Also, affinity can be set via command line:
In Windows 10/11 the Afinity seting is accessible via right clicking the executable process in Windows Task Manager
https://stackoverflow.com/questions/7759948/set-affinity-with-start-affinity-command-on-windows-7#7760105
OK,
Over the weekend I will be able to test with the second Xeon processor on, so there will be 72 logical processors and 64gb ram in total.
I will test with and without defined affinity for Wealth-Lab.
I'll send the results as soon as I run the tests.
Note: I am using Google Translator.
Over the weekend I will be able to test with the second Xeon processor on, so there will be 72 logical processors and 64gb ram in total.
I will test with and without defined affinity for Wealth-Lab.
I'll send the results as soon as I run the tests.
Note: I am using Google Translator.
Google Translator is getting good!
Good afternoon.
As mentioned, I installed the second processor in my CPU.
So my machine has 2 xeon e5-2696 v3 processors and 64gb ram DDR4 reg ecc.
I noticed that in windows 10 pro the processing remains in only 1 processor. After researching about it, I saw that in Windows 11 the system automatically uses all available processors.
In this way, I updated my system to windows 11 and performed new tests.
I noticed that in fact, in windows 11, the system uses all processors. But, depending on the size of the database and the amount of assets, the processing uses more or less CPU resources.
Attached is an image of using a base of 7 years and 250 assets, for 95000 tests.
As mentioned, I installed the second processor in my CPU.
So my machine has 2 xeon e5-2696 v3 processors and 64gb ram DDR4 reg ecc.
I noticed that in windows 10 pro the processing remains in only 1 processor. After researching about it, I saw that in Windows 11 the system automatically uses all available processors.
In this way, I updated my system to windows 11 and performed new tests.
I noticed that in fact, in windows 11, the system uses all processors. But, depending on the size of the database and the amount of assets, the processing uses more or less CPU resources.
Attached is an image of using a base of 7 years and 250 assets, for 95000 tests.
Okay, so now let's make your memory footprint smaller so it all fits in the on-chip cache. Reduce your problem size from 7 years to 2 or 3 years, then tell us what your CPU utilization is. The number of assets (stocks) shouldn't matter; the number of cores used will. Try reducing the number of cores involved as well. Too many cores are fighting over the same on-chip cache, which is crushing your CPU utilization.
As you reduce your memory footprint, you should reach high CPU utilization.
As you reduce your memory footprint, you should reach high CPU utilization.
While I agree with @superticker in principle, the issue runs a little deeper than just the dataset fitting on the on-chip memory.
items to consider:
1) the three levels of on-chip memory are progressively SLOWER. Optimal speed may or may not be reached by shared data fitting on all three levels, or only the L1 (Fastest).
2) The number of other CPU Tasks sharing the chip time. Remember that CPU multitasking is achieved by time-slicing. Wealth-Lab is not the only application competing for time. Antivirus, Indexing, other OS processes, and well as installed softwares are all requesting time. If tasks competing for chip time have significantly different data requirements, you will see slower performance as the data is swapped in and out of the on-chip cache. This is where setting the Afinity may be helpful if you also lockout all other programs.
Note: While the windows 11 OS might use all available cores for smoother user experience, it could potentially reduce your WL7 chip utilization due to this competing time-slicing. This is theoretical only, I have not tested Win 11 nor researched how Win 11 manages time slicing, so I cannot comment from an experience perspective
3) The total number of active CPU tasks will impact the time allocation of each slice. Therefore LOTS of tasks will result in smaller time-slices allocated which in turn resutls in more L1--L2-L3-Ram-Paging cache swapping (and more processor idle time). This is inherant in the time-slicing concept due to its ultimate goal of keeping all tasks progressing to give the user the "ilusion" of simultaneous processing. You can set WL7 to a higher priority, but this can also introduce undesired side effects. However, there are tools like Process Laso (https://bitsum.com/) that allow you to set threshold for individual application CPU utilization to keep you system running optionally.
4) The Length of time each task needs to complete. Similar to the above, the CPU time is finite, therefore the length of the task will impact CPU utilization as longer tasks are broken into shorter runs on the CPU. If the task is short, then it completes and returns swiftly, if its long, it may be waiting for many other tasks to complete before the next "section" is run. This has the efect of extending the run time of the task. generally the task is assigned to a single logical CPU rather than spread across CPU's. So the task may be sitting waiting for considerable time depending on that Locigal core load.
The impact of the scenario above would be evident in a spooling of 1000 longrunning tasks compared with 10,000 small tasks. Even if the total time required for each "set" was similar, the later would likely complete first in a typical environment. In WL7 we see evidence of this with smaller datasets (number of bars) and larger symbol set combinations. No only is the data more likely to fit entirely on the on-chip memory, but the processing task completes in a minimal number of time-slices resulting in the task completing with minimal swaping of chip cache, ram and paging file.
In my research into this specificly for WL7, I have found that 100% utilization is a significant challenge on large data. There are ways to improve speeds significantly. However, this requires a different resource management aproach that is a little counter intuative to the .Net ideology and factors all aspects of resource management that I have mentioned in my posts.
I am hoping to produce a new optimizer in the near future that combats many of these issues resulting in superior overall speed. Preliminary tests of an Exhaustive Optimizer are showing very encouraging results <-- more details on that later.
items to consider:
1) the three levels of on-chip memory are progressively SLOWER. Optimal speed may or may not be reached by shared data fitting on all three levels, or only the L1 (Fastest).
2) The number of other CPU Tasks sharing the chip time. Remember that CPU multitasking is achieved by time-slicing. Wealth-Lab is not the only application competing for time. Antivirus, Indexing, other OS processes, and well as installed softwares are all requesting time. If tasks competing for chip time have significantly different data requirements, you will see slower performance as the data is swapped in and out of the on-chip cache. This is where setting the Afinity may be helpful if you also lockout all other programs.
Note: While the windows 11 OS might use all available cores for smoother user experience, it could potentially reduce your WL7 chip utilization due to this competing time-slicing. This is theoretical only, I have not tested Win 11 nor researched how Win 11 manages time slicing, so I cannot comment from an experience perspective
3) The total number of active CPU tasks will impact the time allocation of each slice. Therefore LOTS of tasks will result in smaller time-slices allocated which in turn resutls in more L1--L2-L3-Ram-Paging cache swapping (and more processor idle time). This is inherant in the time-slicing concept due to its ultimate goal of keeping all tasks progressing to give the user the "ilusion" of simultaneous processing. You can set WL7 to a higher priority, but this can also introduce undesired side effects. However, there are tools like Process Laso (https://bitsum.com/) that allow you to set threshold for individual application CPU utilization to keep you system running optionally.
4) The Length of time each task needs to complete. Similar to the above, the CPU time is finite, therefore the length of the task will impact CPU utilization as longer tasks are broken into shorter runs on the CPU. If the task is short, then it completes and returns swiftly, if its long, it may be waiting for many other tasks to complete before the next "section" is run. This has the efect of extending the run time of the task. generally the task is assigned to a single logical CPU rather than spread across CPU's. So the task may be sitting waiting for considerable time depending on that Locigal core load.
The impact of the scenario above would be evident in a spooling of 1000 longrunning tasks compared with 10,000 small tasks. Even if the total time required for each "set" was similar, the later would likely complete first in a typical environment. In WL7 we see evidence of this with smaller datasets (number of bars) and larger symbol set combinations. No only is the data more likely to fit entirely on the on-chip memory, but the processing task completes in a minimal number of time-slices resulting in the task completing with minimal swaping of chip cache, ram and paging file.
In my research into this specificly for WL7, I have found that 100% utilization is a significant challenge on large data. There are ways to improve speeds significantly. However, this requires a different resource management aproach that is a little counter intuative to the .Net ideology and factors all aspects of resource management that I have mentioned in my posts.
I am hoping to produce a new optimizer in the near future that combats many of these issues resulting in superior overall speed. Preliminary tests of an Exhaustive Optimizer are showing very encouraging results <-- more details on that later.
All of the considerations above are important from a performance perspective, but not all are important for CPU utilization.
So you could have 100% processor utilization at 0.2 GHz or 2.5 GHz. Which depends on the Principle of Locality of your strategy design. Do you write really tight code? But whether or not you get 100% processor utilization depends on fitting your entire problem in L3 cache. L2 and L1 cache only play a role for determining the "effective" CPU clock speed, and that's a performance issue, not a CPU utilization issue.
Let's not get performance considerations mixed up with CPU utilization issues. They are different.
Correct the CPU utilization problem first by shrinking the memory footprint of your problem. Use about 4 cores per Xeon chip and simulate over 3 years. Once you have acceptable CPU utilization, then address the other "performance items" mention in the list above. They are important too.
QUOTE:
1) the three levels of on-chip memory are progressively SLOWER. Optimal speed may or may not be reached by shared data fitting on all three levels, or only the L1 (Fastest).
So you could have 100% processor utilization at 0.2 GHz or 2.5 GHz. Which depends on the Principle of Locality of your strategy design. Do you write really tight code? But whether or not you get 100% processor utilization depends on fitting your entire problem in L3 cache. L2 and L1 cache only play a role for determining the "effective" CPU clock speed, and that's a performance issue, not a CPU utilization issue.
Let's not get performance considerations mixed up with CPU utilization issues. They are different.
Correct the CPU utilization problem first by shrinking the memory footprint of your problem. Use about 4 cores per Xeon chip and simulate over 3 years. Once you have acceptable CPU utilization, then address the other "performance items" mention in the list above. They are important too.
Appologies to all for my long winded posts if I've misunderstood the intent behind the original post.
While the title of the thread specifies CPU Utilization the question asked is that of increasing speed during optimizations (Performance). Therefore in the context of this thread I interpreted the CPU utilization as a common go-to, albiet incomplete, "measurement/indicator" of performance.
I agree, the two are not synonomous and are best not confused
QUOTE:
Is there anything that can be done to improve Wealth-Lab's processing speed?
While the title of the thread specifies CPU Utilization the question asked is that of increasing speed during optimizations (Performance). Therefore in the context of this thread I interpreted the CPU utilization as a common go-to, albiet incomplete, "measurement/indicator" of performance.
I agree, the two are not synonomous and are best not confused
I think the overall goal here is to find good optimization results as fast as possible.
There are two paths to get there:
1. Use a faster machine. (That was the main focus of this discussion).
2. Use a better optimization algorithm.
There is this old wisdom from computer science: A better algorithm will always outperform a better machine.
For the problem at hand I'd suggest to replace "Exhaustive Search" optimizer by "SMAC" optimizer.
The latter will find good optimization results in a fraction of the time used by "exhaustive" and therefore it will be (much) faster - even on a slower machine.
There are two paths to get there:
1. Use a faster machine. (That was the main focus of this discussion).
2. Use a better optimization algorithm.
There is this old wisdom from computer science: A better algorithm will always outperform a better machine.
For the problem at hand I'd suggest to replace "Exhaustive Search" optimizer by "SMAC" optimizer.
The latter will find good optimization results in a fraction of the time used by "exhaustive" and therefore it will be (much) faster - even on a slower machine.
QUOTE:
A better algorithm will always outperform a better machine.
For the problem at hand ... replace "Exhaustive Search" optimizer by "SMAC" optimizer.
I totally agree.
There's also a Principle of Locality issue to get the most out of the L1 and L2 caches. Are you declaring vectors (TimeSeries variables) in MyStrategy as "common/global" when you could be declaring them in Initialize instead to make them "local"? But this has nothing to do with CPU utilization, which is what we are discussing here. If you want to talk about Principle of Locality (L1 and L2 cache efficiency), start a new thread and we'll dig in there.
Agreed... the brute force aproach of the Exhaustive optimizer is not the fastest way to achieve an end goal of finding optimal system parameters. There are other methods that "optimize" the optimization process.
QUOTE:I couldn't agree more! A huge percentage of Exhaustive runs are nowhere close.
There is this old wisdom from computer science: A better algorithm will always outperform a better machine.
Changing the Optimization method, in fact CPU utilization increases.
But, I believe the problem is in Windows that can not optimize processing on 2 CPUs.
In the task manager it is only possible to select affinity with CPU 0 or CPU 1.
For this reason most of the time the processing is only 50%.
But, I believe the problem is in Windows that can not optimize processing on 2 CPUs.
In the task manager it is only possible to select affinity with CPU 0 or CPU 1.
For this reason most of the time the processing is only 50%.
QUOTE:
Changing the Optimization method, in fact CPU utilization increases.
The CPU utilization is dependent on the implementation. And different methods have vastly different implementations, so this is no surprise.
QUOTE:
... Windows that cannot optimize processing on 2 CPUs.
In the task manager it is only possible to select affinity with CPU 0 or CPU 1.
Well that's not good. So for Windows 11, it's not worth the money to buy a motherboard with more than one CPU for sharing across a single application (such as WL). I'll remember that.
What I would do then is pare down the number of cores used to 3 or 4 on one CPU to improve your on-chip hit ratio on the L3 cache for that chip. And I would just forget about the other processor chip because its CPU utilization will always be poor since all its cores will be fighting over the same L3 cache.
And reduce the size of your problem from 7 years to 2 or 3 years so it all fits into L3 cache. That should help a great deal.
--
Off topic, but if you converted all the WL TimeSeries collections to single precision arrays, that would give you a major speed boost as well. The reason is two fold.
1) Your data would be half the size with single precision arrays, so that effectively doubles your L3 cache utilization.
2) Arrays execute much more efficiently than C# collections (such as a WL TimeSeries) because they are contiguous and faster to index into (no address-indirect addressing in computer engineering terms). That's why all numerical analysis packages (e.g. Math.Net) use arrays and not collections. But WL must use collections because one can't vary the length of an array in C#, and WL needs storage with variable length for each new bar over the day.
Thanks for the analysis. And happy computing to you.
QUOTE:
In the task manager it is only possible to select affinity with CPU 0 or CPU 1.
Check your Processor is supported by Windows 11
https://docs.microsoft.com/en-us/windows-hardware/design/minimum/supported/windows-11-supported-intel-processors
If not supported, perhaps this is why you have limited Affinity options
Based on your screen grab it looks like all the logical cores are running.
QUOTE:
Changing the Optimization method, in fact CPU utilization increases.
I agree with @superticker. This is not a suprise. A different method of optimization will have entirely different processor "fingerprint".
Be careful not to confuse CPU utilization with speed of optimizer processing (if that is your goal). A lower thread count may result in higher oscilating CPU performance, but slower overall optimizer speed compared to a higher thread count with stable CPU ultilization. Below are examples from my personal tests. The lower, more stable CPU ultilization performs better over the entire duration of the optimization.
Exhaustive Optimizer Comparison Test:
I have 128 vCPU's & it not getting used
Okay, I'll post all related questions here
It was using all 32 CPUs with the build 20 update. However, after I installed larger Amazon AWS EC2 instances in order to test it with larger UPCs, it stopped recognizing them, and it stopped using all of the UPCs.
When I ran your Code test to see how many UPC's WL recognized, it recognized all of them.
It was using all 32 CPUs with the build 20 update. However, after I installed larger Amazon AWS EC2 instances in order to test it with larger UPCs, it stopped recognizing them, and it stopped using all of the UPCs.
When I ran your Code test to see how many UPC's WL recognized, it recognized all of them.
I'm not sure how much more we can do because we don't have a similar machine and we're really just doing the most basic kind of parallel processing that the .Net framework offers. It sounds like some system-specific issue going on in this case unfortunately.
QUOTE:
after I installed larger Amazon AWS EC2 instances in order to test it with larger UPCs, it stopped recognizing them,
This sounds like a system level question for Amazon. Let us know what you find out from Amazon. There might be some limitation Amazon places on the size of the working set (memory per process) of each parallel process. But clearly there's some kind of limitation Amazon is placing on these processes. We just don't know what that is.
Your Response
Post
Edit Post
Login is required