Optimization with Extended Scorecard never reaches 100% CPU usage
Author: kazuna
Creation Date: 3/3/2019 5:55 PM
profile picture

kazuna

#1
This is with the Extended Scorecard and the optimization has some scalability issue.

I open 20 strategy windows and optimize them all at once.

CPU utilization peaks out at around 70~80% and never reaches to 100%.
Aren't you holding global semapthore or spinlock for accessing the data set?

[Hardware configuration]
8 cores / 16 hyperthreads CPU
64GB memory

[Optimization parameters]
Scorecard: Extended Scorecard
Scale: 1 Minute
Data Range: 1 Year
Position Size: SetShareSize
Optimization Method: Exhaustive
Runs Required: 20000
profile picture

Eugene

#2
Is this a problem? I don't see any issue here.
profile picture

kazuna

#3
Yes, it is a problem.

As I upgrade from 4 cores to 6 cores, and 6 cores to 8 cores, the CPU time utilization getting lower and lower.
There must be some bottleneck in the code which is preventing the all the thread running.

Note that I have 20 optimizations running at once and all CPU thread should run without being blocked by the other thread.
profile picture

Eugene

#4
It's just a speculation but there may be some lock for accessing the same data simultaneously as you assume. I guess back in the time when WL5 was architected being able to run 20 optimizations wasn't even considered. Call this a limitation.

¯\_(ツ)_/¯
profile picture

kazuna

#5
If it is something to do with accessing the data, you may consider having worker thread to prepare the data for the next run while the main thread is executing the optimization. That will prevent the main thread from stalling on waiting for the data for the next run.

In any case, more or more CPU cores become available and this scalability issue will become real pain very soon.
profile picture

Eugene

#6
Interesting. This is something to investigate if Fidelity considers next round of development. Thanks.
profile picture

superticker

#7
QUOTE:
As I upgrade from 4 cores to 6 cores, and 6 cores to 8 cores, the CPU time utilization getting lower and lower.
And that's to be expected. It's a hardware problem with bus bandwidth between the processor chip and external RAM (DIMMs) memory chips on the motherboard. You can add more and more cores, but your "processor bus" bandwidth (i.e. front-side bus speed, 333MHz?) between processor chip and the motherboard isn't increasing. The bandwidth between these two is your bottleneck.

So how to fix this? Well, the first thing is to buy a processor chip with the biggest L3 (and L2) cache available on chip so you don't have off-chip cache misses. So you should be using a i7-level Intel processor in your system. You might even consider a Xenon-based server with significantly more on-chip cache, but you'll want to put that in a machine room because all those fans to cool that Xenon processor down are very noisy.

The next thing to do is reduce your memory footprint. Try to make your program as small as possible. Delete all those DataSeries arrays you're not using. Use the Principle of Locality to localize all your array access to reduce off-chip cache misses. Perhaps even switch from double precision to single precision on cached DataSeries to save memory space. If all those cached DataSeries create off-chip cache misses, you'll have a problem with off-chip bandwidth bottlenecks.

And happy computing to you!
profile picture

kazuna

#8
I don't think it's memory issue. Because using less data range say 1 month instead of 1 year, CPU utilization gets even lower.
This means it's not memory locality issue but the overhead of the bottleneck is actually getting worse for smaller data.

Also there is no difference in the CPU usage when optimizing the same symbol and optimizing the different symbol.
If it is memory locality issue, there should be difference.

If you cannot believe it, I think you got to duplicate the problem. It's easy.
profile picture

superticker

#9
QUOTE:
I don't think it's memory issue because using less data range say 1 month instead of 1 year, CPU utilization gets even lower.
Just to clarify, we are talking about a front-side bus bandwidth bottleneck, so please don't call it a memory issue. But you're correct, if there's a L3 cache miss on the processor chip, then the bus bottleneck becomes a memory issue because you're now stepping down from processor speed (4GHz) to front-side bus speed (333MHz).

Generally, I agree with you. If you reduce the Data Range, cache misses would be less of a problem per optimization. But by doing so, you may have more parallel optimizations trying to compete for the same front-side bus bandwidth, which "could" make the bus bottleneck worse. I just can't be sure without monitoring the execution of each optimization thread. This scenario is too complex to say for sure; I don't know.

What I can say is, if you reduce the load on a "resource strangled" processor/system, you will get better performance in every case. So try to determine if running two or three parallel optimizations gives you more throughput than running four or five. Find the sweet spot of your system.

QUOTE:
Also there is no difference in the CPU usage when optimizing the same symbol and optimizing the different symbol.
I can't imagine why optimizing one symbol would be any different than optimizing another. Please explain why you think there would be a difference between symbols under any circumstance?
profile picture

kazuna

#10
I just found one correction to the original post, I think I mixed up with the other issue.

CPU utilization peaks out at around 20~33% and never reaches to 100%.
profile picture

kazuna

#11
QUOTE:
So try to determine if running two or three parallel optimizations gives you more throughput than running four or five. Find the sweet spot of your system.
As far as I optimize for 20000 runs at 1 minute scale data, the CPU usages on 8 cores 16 hyperthreads are as follows:

1 optimize: 13%
2 optimizes: 16%
4 optimizes: 19%
8 optimizes: 24%
16 optimizes and more: 33%

So the CPU usages peaks out at around 20~33%.

QUOTE:
I can't imagine why optimizing one symbol would be any different than optimizing another.
It was just an experiment to see if reducing the data set improves the memory locality but it didn't change.
profile picture

Eugene

#12
Now these numbers sounds more realistic. Architected in 2006 during .NET 2.0 days, Wealth-Lab is not optimized for parallel optimizations (no pun intended).
profile picture

superticker

#13
QUOTE:
1 optimize: 13%
2 optimizes: 16%
4 optimizes: 19%
8 optimizes: 24%
16 optimizes and more: 33%

So the CPU usages peaks out at around 20~33%.
Thank you very much for posting these benchmarks. They are very interesting.

Since you mentioned (in another thread) you're running a Core i9-9900K, I did a little research on its benchmark performance. What I found is very interesting. A couple benchmarking articles compared this 8-core processor with its 6-core cousins and found it doesn't scale. Bottom line, adding the 7th and 8th core really doesn't improve performance that much on any application. Practically speaking, the 7th and 8th core aren't worth the extra power dissipation, and they certainly aren't worth the extra price. If I were laying out the chip, I would have dropped the 7th and 8th cores and replaced them with more cache memory. This processor has 16MBytes of cache now.

Why doesn't the number of cores scale better? I don't know. But the 64GBytes of main memory is a source of contingency between the 8 cores. Now I would have made this memory block dual ported (It doesn't take much additional logic to dual port memory.) so two cache misses could be serviced by this memory block simultaneously. But if you're trying to service three cache misses simultaneously, then one core is going to have to wait. And that's an inherent source of contingency.

If you do find a benchmarking article that finds an application where the 7th and 8th cores scale better, please let me know about it.
profile picture

kazuna

#14
You can scale Core i9-9900K to utilize all 8 cores.
The key is TDP and you have to increase it from 95W(default) to 120~140W, then all 8 cores run at the full speed (4.7GHz).
I do some video encoding and transcoding, i9-9900K does perform very well, much faster than 8700K (6 cores) and 7700K (4 cores).
profile picture

superticker

#15
QUOTE:
I do some video encoding and transcoding, i9-9900K does perform very well
Perhaps the video encoding and transcoding problems require a smaller and tighter memory footprint so you can effectively fit 8 problems (for 8 cores) entirely in the i9-9900K 16MB cache. Signal processing (which is what encoding is) can be a very tight problem, which is great for cache hits and this processor.

Some of the benchmarks I was discussing above (Post# 13) were for gaming and Adobe Premiere. I'm not sure why they targeted those apps, but these applications would have large memory footprints. Wealth-Lab would also have a large memory footprint, but that would largely depend on your trading strategy and Data Range settings.

The biggest plus with the i9-9900K is its 64GB of main memory, which is very fast.
profile picture

kazuna

#16
Finally I was able to push my 8 cores / 16 hyperthreads to use 100% while optimizing.
Becasue WLP peaks out at around 25%, I made a hypothesis that 4 instances of WLP would push CPU to 100%.
I installed 3 virtual machines and 4 WLPs ran for 6 optimizations each, CPU finally reached to 100%.

It took 13 fours to finish them all.
This is incredible fast as a signle machine given that the total optimizations were 480000 runs for 1 year intra-day data.

The drawback would be (1) you need a lot of memory (2) you need multiple windows licenses.
profile picture

Eugene

#17
I had intended how to suggest you to reach the same goal without multiple Windows licenses but discarded it as awkward. But now I see you're fine with what seems an even more complicated solution so I'll chime in. The idea is to assign a CPU core to each running instance of WLP:

1. Create multiple Windows usernames on the same PC (no need in multiple licenses, just the Win-L multi-login feature),

2. Start a copy of Wealth-Lab in each one,
2.1. Copy your data and Strategy files to each user account's AppData folder

3. Change the CPU affinity for every running WLP process to each CPU core.

This should do the trick. I can even imagine a scripting solution for semi-automation of #3 e.g. Change affinity of process with windows script

P.S. As for Windows licenses for the VMs, you might avoid this requirement by obtaining preconfigured Windows in a VM of your choice straight from Microsoft website. Get a new copy when they expire.
profile picture

Carova

#18
Hi Eugene!

I had noticed the same problem that kazuna addressed but I was just (foolishly) accepting it. Have a couple of questions since I plan to try your suggestion.

1. I assume the reason for your suggestion in Item #1 was to permit quick toggling between the various User accounts without having to Log into each one. Correct?

2. Since all of the data for WLP is store in a User's directory, is there anyway to have all Users employ a single User's data directory?

Thanks!
Vince
profile picture

Eugene

#19
Hi Vince,

1. Correct. The multiple optimizations should run concurrently.

2. This is not a supported use case. I would advise against attempting to work around it.
profile picture

Carova

#20
Thanks!

Vince
profile picture

superticker

#21
QUOTE:
Finally I was able to push my 8 cores / 16 hyperthreads to use 100% while optimizing.
Becasue WLP peaks out at around 25%, I made a hypothesis that 4 instances of WLP would push CPU to 100%.
I installed 3 virtual machines and 4 WLPs ran for 6 optimizations each, CPU finally reached to 100%.
Since you were able to achieve 100% processor utilization, then the problem may not be a cache hit issue at all (as I suggested above), but solely an issue with WL parallelism. Interesting.

When I run optimizations on my workstation, I get about 22% CPU utilization for WL. That's doing just one optimization, though, and other stuff may be simultaneously running on my system. Well, it would certainly be nice to improve the optimization parallelism on WL even if I had to rewrite my stuff.

What also bothers me is why Windows only grants 1GB of working set (physical memory) to WL when you have 64GB of physical memory available. That has to be a Windows bug. The only reason I can think to limit the working set to 1GB is because the .NET garage collector would take too long to make a single pass through a larger working set. It might be worthwhile to force the WL working set higher, but I would only do this for optimization.
profile picture

Carova

#22
All,

I decided to try Eugene's approach to getting better CPU utilization with WLP.

My System:
i9-9900K processor (no overclock)
32GB of DDR4 2666MHz RAM
1TB M.2 SSD drive

I created 4 accounts and have an independent instance of WLP running in each account. Each instance of WLP has 2 copies of a processor-bound optimization running.

Best result with all 4 instances of WLP was 81% utilization; worst case was 76%

Running 8 simultaneous optimizations in a single account provides only 46% utilization, so Eugene's suggestion is a substantial ~70% improvement. I suspect that I will need to set up 8 separate accounts with a single optimization each to get to that mythical 100% figure! ;)

Vince
profile picture

Eugene

#23
Nice to know that.

I've added more details to the Wiki FAQ which already touched this technique.
profile picture

Panache

#24
I'm running i7's and can easily hit 100% CPU utilization all the time running one instance of Wealth-Lab Pro and 8 instances of my strategy (which is not necessarily a good thing in terms of efficiency). However, I use my own hand-rolled optimizer. https://www.wealth-lab.com/Forum/Posts/WLP-becomes-unresponsible-when-running-several-optimizations-with-MS123-Scorecard-39606

I haven't tested the multiple accounts technique, but I was running into problems with the Wealth-Lab Pro optimizer "skipping" various parameter combinations when running multiple optimizations simultaneously. https://www.wealth-lab.com/Forum/Posts/Optimization-randomly-showing-0-trades-for-some-combinations-of-strategy-parameters-39507
profile picture

Eugene

#25
QUOTE:
but I was running into problems with the Wealth-Lab Pro optimizer "skipping" various parameter combinations when running multiple optimizations simultaneously.

We encountered a thread safety issue with one of the backtester classes when trying to speed up Monte Carlo Lab by making it multi-threaded and this forced us to call off the change. Maybe the problem you were running into had something to do with this. But running multiple WLP instances should not be affected by design. Note that the instances must not concurrently update/alter the data and/or WL configuration files to avoid issues.
profile picture

Domintia-Carlos

#26
Hi!

I recommend checking our BTUtils for Wealth-Lab toolset that dramatically speeds up the backtests and optimizations in Wealth-Lab.

It implements a multithreaded simulator and runs simulations and optimizations in parallel, executing each symbol and/or each optimization step in an independent thread.

Carlos pérez
https://www.domintia.com
profile picture

kazuna

#27
If you are using multiple Windows login as Eugene suggested at #17.

Do not install this latest Windows Update.
2020-09 Cumulative Update for Windows 10 (KB4574727)

It will break multiple Windows login.
profile picture

Eugene

#28
🤦
profile picture

kazuna

#29
The bug is now rolled into October update.

Do not install this Windows update or it will break multiple Windows login.
2020-10 Cumulative Update for Windows 10 (KB4577671)
profile picture

Eugene

#30
Like Dion said, WL7 will support multi-core optimizations. Also you can start multiple WL7 instances. I remember having tested multiple Bulk Update in 3 instances just for fun and they didn't crash. ;)

For now these Windows 10 bugs can be annoying. Not sure about the last releases but in earlier builds it is quite possible to disable Windows updates altogether. Basically what one has to do is:

1) By withdrawal of Write/Full Access rights of Trustedinstaller for the UsoClient task (or whatever task MSFT has now).
2) By making the Windows Update service (wuauserv) start with Guest privileges instead of System.

These effectively disable it writing to the disk ;)

3) And of course apply all anti-telemetry and anti-update registry tweaks using your script or tool of choice (they are countless). Like Shut Up Win 10 or Privacy.sexy, for example.
profile picture

kazuna

#31
The bug is now rolled into November update.

Do not install this Windows update or it will break multiple Windows login.
2020-11 Cumulative Update for Windows 10 (KB4586786)

I'm running on Windows 10 1909 and hitting this problem all long since September update.
If you are running on later Windows 10 2004 (20H1) or 20H2 and not hitting this problem, I would like to know that.
This website uses cookies to improve your experience. We'll assume you're ok with that, but you can opt-out if you wish (Read more).