WL8-optimization-compared-to-WL6

kazuna8

( 42.83% )

2024-09-07T10:09:54Z - 2024-09-07T10:09:54Z ago

I am migrating my WL6 strategy code to WL8 and found a noticeable performance degradation in the optimization speed in WL8 compared to WL6.

Even running the very simple sample code attached, WL8 demonstrates the inefficiency compared to WL6.

Once I run my real strategy code, the results get even worse.

I see the CPU being utilized at 60% to 80% most of the time, so it's not a utilization issue.

[Simple Sample Code]

WL8
Exhaustive (non-Parallel): 54:00 // using 2 cores 2 threads
Exhaustive: 8:30 // using 24 cores 32 threads

WL6
Exhaustive (non-Parallel): 14:00 // using 2 cores 2 threads

[My Real Strategy]

WL8
Exhaustive (non-Parallel): 3:12:00 // using 2 cores 2 threads
Exhaustive: 56:00 // using 24 cores 32 threads

WL6
Exhaustive (non-Parallel): 23:30 // using 2 cores 2 threads

// Simple Sample Code (WL6)

CODE:
namespace WealthLab.Strategies
{
   public class MyStrategy4 : WealthScript
   {
      private StrategyParameter Param1;
      private StrategyParameter Param2;
      private StrategyParameter Param3;

      public MyStrategy4()
      {
         Param1 = CreateParameter("Param1", 20.0, 18.0, 24.0, 0.1);
         Param2 = CreateParameter("Param2", 0.97, 0.80, 1.00, 0.01);
         Param3 = CreateParameter("Param3", 0.0, -0.2, 0.2, 0.01);
      }

      protected override void Execute()
      {
      }
   }
}

// Simple Sample Code (WL8)

CODE:
namespace WealthScript14 
{
    public class MyStrategy : UserStrategyBase
    {
      public MyStrategy() : base()
      {
         AddParameter("Param1", ParameterType.Double, 20.0, 18.0, 24.0, 0.1);
         AddParameter("Param2", ParameterType.Double, 0.97, 0.80, 1.00, 0.01);
         AddParameter("Param3", ParameterType.Double, 0.0, -0.2, 0.2, 0.01);
      }

      public override void Initialize(BarHistory bars)
      {
      }

      public override void Execute(BarHistory bars, int idx)
      {
      }
   }
}

415

21 Replies

Bookmark

Notify

Sort

Glitch8

( 10.10% )

2024-09-07T17:25:31Z - 2024-09-07T17:25:31Z ago

Sorry but you're making an unfair comparison here. Your WL6 code doesn't do anything in the Execute method, which isn't a real world case. In WL8, and Execute method will always get called for every bar, so to make the comparison valid would require at least a WL6 Strategy like the one below.

In my test (after disabling the last parameter and optimizing only the first two for time's sake) WL8's Exhaustive (parallel) optimizer complete in about one and a half minutes while WL6 took about 15 minutes.

CODE:

namespace WealthLab.Strategies
{
   public class MyStrategy4 : WealthScript
   {
      private StrategyParameter Param1;
      private StrategyParameter Param2;
      private StrategyParameter Param3;

      public MyStrategy4()
      {
         Param1 = CreateParameter("Param1", 20.0, 18.0, 24.0, 0.1);
         Param2 = CreateParameter("Param2", 0.97, 0.80, 1.00, 0.01);
         Param3 = CreateParameter("Param3", 0.0, -0.2, 0.2, 0.01);
      }

      protected override void Execute()
      {
         for (int n = 0; n < Bars.Count; n++)
         {
            ExecuteBar(Bars, n);
         }
      }

      private void ExecuteBar(Bars bar, int idx)
      {
      }
   }
}

kazuna8

( 42.83% )

2024-09-07T22:13:01Z - 2024-09-07T22:13:01Z ago

I changed my Simple Sample Code as such and ran it on WL6.
The result is the same. It's still 14 minutes.

It makes sense because the loop and the function call are negligibly small enough in modern systems and programming languages.

The simple sample code is not a real-world case, but it's enough to demonstrate WL8's inefficiency, especially for the non-parallel comparison, and it proves some overheads existing in WL8 design.

As you can see, I'm also comparing the real-world case that was summarized in the [My Real Strategy].

QUOTE:
WL8
Exhaustive (non-Parallel): 3:12:00 // using 2 cores 2 threads
Exhaustive: 56:00 // using 24 cores 32 threads

WL6
Exhaustive (non-Parallel): 23:30 // using 2 cores 2 threads

That's the real code I'm trading every day and I'm making real money from it.

I carefully moved all the one-time code from Execution() to Initialization() when I migrated my WL6 code to WL8, so there is no inefficient code in my WL8 strategy.

QUOTE:
In WL8, and Execute method will always get called for every bar

This could be the fact where the inefficiency could be coming from.

In fact, I never did such an inefficient thing in my WL6 code because leaving the function at every bar will waste some local variables which would have been shared among the bars if you did the loop within the function.

If you are dispatching the task at every single bar, there would be overhead accumulating at every single bar call.

I would suggest you consider providing a batch method like WL6's Execute() method, say:

CODE:
   public override void BatchExecute(BarHistory bars, int BarCount)

I do understand why WL8's Execute() method gets called at every bar so that you can backtest multiple symbols at the same bar.

Unfortunately, strategies like mine which don't need executing multiple symbols at the same bar run much faster on WL6's design than WL8's design.

Glitch8

( 10.10% )

2024-09-07T23:12:47Z - 2024-09-07T23:12:47Z ago

>>In fact, I never did such an inefficient thing in my WL6 code because leaving the function at every bar will waste some local variables which would have been shared among the bars if you did the loop within the function.<<

You can use class level variables instead of method local variables so it’s not an issue.

I don’t have your particular strategies and can’t see how you converted the code but like I said above it took about 15 minutes for WL6 to do that optimization and less than two minutes in WL8. Looks like a big improvement to me instead of a step backwards.

kazuna8

( 42.83% )

2024-09-08T00:46:47Z - 2024-09-08T00:46:47Z ago

QUOTE:
You can use class level variables instead of method local variables so it’s not an issue.

Yes, I have pretty much all variables that are class level not local unless it is the one-time value used in the function.
That example was just to explain the possible inefficiency of using every single bar call design.

QUOTE:
I said above it took about 15 minutes for WL6 to do that optimization and less than two minutes in WL8.

But that's not an apple to apple comparison.

If you compare WL6 and WL8 both using the same non-parallel optimization, that will unveil the underlying overhead and the inefficiency in WL8 design.

QUOTE:
[Simple Sample Code]

WL8
Exhaustive (non-Parallel): 54:00 // using 2 cores 2 threads

WL6
Exhaustive (non-Parallel): 14:00 // using 2 cores 2 threads

Glitch8

( 10.10% )

2024-09-08T01:12:51Z - 2024-09-08T01:12:51Z ago

Sorry, still can't confirm. On my machine WL8 using the Exhaustive non-parallel completed in 4 minutes as opposed to 15 in WL6.

superticker8

2024-09-08T01:27:40Z - 2024-09-08T01:27:40Z ago

QUOTE:
I do understand why WL8's Execute() method gets called at every bar so that you can backtest multiple symbols at the same bar.

Unfortunately, strategies like mine which don't need executing multiple symbols at the same bar run much faster on WL6's design than WL8's design.

I agree with you. There's better Principle of Locality on processor L3 cache hits if execution is done with time as the fastest moving variable rather than symbols (over bars). There's no argument there.

But the PreExecute{block}, which can compare metrics across stocks (symbols) in a dataset, is by far the most powerful feature of WL8. But that most powerful feature comes with a cost.

The solution is to pick your processor chip carefully for your workstation. (They sell motherboards without processor chips because you need to carefully select the right one!) You want the processor chip with the largest L3 on-chip cache possible. And having more than 4 cores is probably a waste because anything more than 4 processor cores will consume all that on-chip cache memory.

The other thing to do is avoid caching indicators with the .Series method that have their parameters manipulated by the optimizer. Use the "new" operator instead to declare these indicators; otherwise, you will be caching every possible parameter combination the optimizer wants to throw at it.

CODE:
         SMA sma = SMA.Series(bars.Close, 10); //yes, cache constant parameter indicators
         SMA sma = SMA.Series(bars.Close, Parameters[1].AsInt); //No, no!  Do NOT cache; wastes memory
         SMA sma = new SMA(bars.Close, Parameters[1].AsInt); //yes, avoid caching optimizable indicators

Now if you're not optimizing parameters in an indicator, then caching that indicator with the .Series method is probably a good idea because it will speed you up somewhat. So write your code accordingly.

I suppose a blog article discussing Principle of Locality, cache hits, and memory management would be worthwhile. Happy engineering to you.

kazuna8

( 42.83% )

2024-09-08T01:54:24Z - 2024-09-08T01:54:24Z ago

QUOTE:
Sorry, still can't confirm. On my machine WL8 using the Exhaustive non-parallel completed in 4 minutes as opposed to 15 in WL6.

Interesting...

I tested in single symbol mode against QQQ with Daily scale and All Data range.
But the data set doesn't seem to matter much in the performance so far I tested with other configurations.

I have Intel i9-13900K and I see two threads running 60% ~ 100% utilization at 5.2GHz when WL8 is optimizing in the Exhaustive (non-Parallel) method.

It still takes 55 minutes on this machine, however.

kazuna8

( 42.83% )

2024-09-08T02:12:13Z - 2024-09-08T02:12:13Z ago

QUOTE:
But the PreExecute{block}, which can compare metrics across stocks (symbols) in a dataset, is by far the most powerful feature of WL8.

I see. The PreExecute() would be a motivation behind the employment of the every single bar call design in WL8.

QUOTE:
Now if you're not optimizing parameters in an indicator, then caching the indicator with the .Series method is probably a good idea because it will speed you up somewhat. So write your code accordingly.

Yes, I spent some time on those caching ideas. My results so far are pretty much inline with your suggestions.

superticker8

2024-09-08T02:24:15Z - 2024-09-08T02:24:15Z ago

QUOTE:
The PreExecute() would be a motivation behind the employment of the every single bar call design in WL8.

You should be using PreExecute() in every production strategy you have because you want to prioritize trading stocks with better chances of making money (at that "bar" moment) over others.

kazuna8

( 42.83% )

2024-09-08T02:41:54Z - 2024-09-08T02:41:54Z ago

#10

QUOTE:
You should be using PreExecute() in every production strategy you have because you want to prioritize trading stocks with better chances of making money (at that "bar" moment) over others.

Thank you for your suggestion but my trade style doesn't need anything like that. Mine is so damn simple.

kazuna8

( 42.83% )

2024-09-08T21:59:10Z - 2024-09-08T21:59:10Z ago

#11

QUOTE:
In my test (after disabling the last parameter and optimizing only the first two for time's sake) WL8's Exhaustive (parallel) optimizer complete in about one and a half minutes while WL6 took about 15 minutes.

If I do the same (disabling the last parameter and optimizing only the first two) on WL6 on my machine, WL6's Exhaustive (non-Parallel) optimizer completes in 17 seconds!
It makes sense because the last parameter takes x41 more runs and three parameters take 14 minutes, two parameters would take 14 * 60 / 41 = 20 seconds.

You said WL8's Exhaustive (parallel) optimizer completes it in one and a half minutes, that is 90 seconds.

I don't know how many cores and threads you have on your computer but WL8 running in parallel takes more than 5x than WL6 running in non-parallel.

These results also prove that WL8 optimizer runs significantly slower than WL6 optimizer.

Glitch8

( 10.10% )

2024-09-09T02:51:14Z - 2024-09-09T02:51:14Z ago

#12

I really hate to keep arguing with you on this, but I don't want to have misinformation on the forum here. Here's a real-world test, running with the EXACT SAME strategy and EXACT SAME data. A simple RSI overbought/oversold building block strategy.

I did the screen cap when WL8 had a few seconds remaining, because it doesn't show the time when it's complete, but the results are:

WL6: 6 minutes 47 sec
WL8 1 minute 50 sec

So WL8's optimizer at least 3 times faster than WL6 on this real-world test. Now, you seem convinced that WL6's optimizer is superior, and I've learned that it's impossible to change the mind of someone who's really convinced of something, but I thought it was important to set the record straight here.

kazuna8

( 42.83% )

2024-09-09T03:13:18Z - 2024-09-09T03:13:18Z ago

#13

Would you mind sharing your WL6 and WL8 codes?
I want to understand where the difference is coming from.

By the way, your WL6 test seems running too slow?
651 runs take 6 min 47 sec?
That means each run takes more than 600ms.

superticker8

2024-09-09T03:57:01Z - 2024-09-09T03:57:01Z ago

#14

QUOTE:
I want to understand where the difference is coming from.

The difference is likely coming from the hardware: Different processor, L3 cache size, memory size, memory access time, ... number of processor cores.

Of course, the strategy code and preferences could be different too. With code, of course you want to compare apples to apples.

Cone8

( 5.57% )

2024-09-09T06:38:14Z - 2024-09-09T06:38:14Z ago

#15

I wonder if there's another factor that's slowing down WL8's optimizer - like having multiple Event Providers enabled. That should only be a factor during the initial data load, but it could add a lot of time to any backtest. If you have any Event Providers checked, let us know which ones.

kazuna8

( 42.83% )

2024-09-09T07:01:23Z - 2024-09-09T07:01:23Z ago

#16

QUOTE:
If you have any Event Providers checked, let us know which ones.

My WL8 is pretty much the default setting and I didn't change much except some chart settings which have nothing to do with the backtesting.

I have WealthData event provider checked with Dividend and Split checked. I think this is the default setting.

Looking at Backtest Preferences, there is nothing enabled in there.

kazuna8

( 42.83% )

2024-09-09T07:28:52Z - 2024-09-09T07:28:52Z ago

#17

By the way, it's very possible that my strategy is very simple and it could be just unveiling the fundamental overhead in the way how the task is dispatched and executed on WL8 in order to maximize the parallelism.

I guess WL6 doesn't have such thing but simply executing on the same thread, so no overhead whatsoever.

If the task is small enough like my strategy code, I guess the overhead becomes non-negligible.

For those who prefer complicated data sets and combinations may leverage the new modern design employed in WL8.

Cone8

( 5.57% )

2024-09-09T08:16:07Z - 2024-09-09T08:16:07Z ago

#18

Something else that hasn't been considered above is:

1. Data Range + Interval => number of bars processed
(kazuna posted above, "I tested in single symbol mode against QQQ with Daily scale and All Data range.")

2. Quantity of Positions created in the backtest

Why don't you post a simple equivalent strategy and a picture of the Strategy settings so we can all test the same thing?

kazuna8

( 42.83% )

2024-09-09T08:44:41Z - 2024-09-09T08:44:41Z ago

#19

The [Simple Sample Code] illustrates the problem for me.
If the [Simple Sample Code] doesn't do well, my [My Real Strategy] won't do well.

The strategy setting is very simple: QQQ with Daily scale and All Data range.

QUOTE:
[Simple Sample Code]

WL8
Exhaustive (non-Parallel): 54:00 // using 2 cores 2 threads
Exhaustive: 8:30 // using 24 cores 32 threads

WL6
Exhaustive (non-Parallel): 14:00 // using 2 cores 2 threads

Cone8

( 5.57% )

2024-09-09T08:58:41Z - 2024-09-09T08:58:41Z ago

#20

Glitch already explained that's not a valid test because it doesn't do anything in WL6.
But at least it answers the Position quantity question => 0.

kazuna8

( 42.83% )

2024-09-09T09:06:37Z - 2024-09-09T09:06:37Z ago

#21

Yes, that's the whole point of doing nothing.
If it doesn't do anything, it can illustrate the overhead.

My strategy is relatively simple and the test results are showing the scaled number compared to the simple sample code.

QUOTE:
[Simple vs Real]

WL8 Exhaustive (non-Parallel): 3.5x
WL8 Exhaustive (Parallel): 6.6x

WL6 Exhaustive (non-Parallel): 1.7x

QUOTE:
But at least it answers the Position quantity question => 0.

Position sizing settings are all default:
WL6: Fixed Dollar, 100000
WL8: Fixed Value, 100000 (Starting Capital), 5000 (Amount), 1.00 (Margin Factor)

These settings don't seem to be contributing to the performance, however.

Anyway, I think I get it. I guess it's just my strategy is too simple for WL8 and WL6 is more than enough for it. I kinda felt it when I was porting my WL6 code to WL8.

No more investigation is needed on this topic.
Thank you for looking at it and I'm sorry for wasting your time.

Bookmark

Notify

Sort