MIH8
- ago
Loading large amounts of data takes a lot of time.

Example: 101 symbols / NASDAQ 100 / 147289916 lines of data.

If a strategy loads the data with a data range constraint, it takes about 10 minutes before the strategy starts running. But if there is no restriction and "All data" is loaded, it is 10x faster.

For this reason I would like to ask the WL team if it is possible to change the concept in this context. It could be possible to always load the complete data and suppress signals that are not in the data range.

The use case is especially relevant when working on the logic of the strategy code and having to restart the strategy several times. Waiting one minute instead of ten minutes would allow to use large amounts of data while working on the logic of the strategy.
1
732
21 Replies

Reply

Bookmark

Sort
Glitch8
 ( 12.08% )
- ago
#1
I don't think we want to always just load all of the data. That will open up a whole new series of complaints (why am I running out of memory all of a sudden?!?) But we should be able to optimize this and make it work faster when loading a data range. Is there any change you might be able to share one or two of these large files with us via email (support@wealth-lab.com?)
0
MIH8
- ago
#2
Hi Glitch.

Similar to my proposal for debuggin scenario you can try the following.
The OHLC data can be duplicated while you produce data with a date increment of any size you wish. (Just as simple idea).

And i don't think you would run into a "new" problem. Each user is able to use "All Data" right now.
And if your memory logic doesn't catch this the problem you describe can kick in every moment.

I don't ask for this right now, but of course it would be reasonable to make available memory configurable to the user. At least the software should not use more than the available RAM. I made a feature request some time ago for this, at least we discussed it.
0
Glitch8
 ( 12.08% )
- ago
#3
I disagree, we'd absolutely run into new issues if the data loading always loaded ALL DATA as opposed to only what's needed.
1
MIH8
- ago
#4
OK, you have more insights on this point.
I'm curious what improvement awaits us here.

The use case I described is extremely annoying.

Thank you, I am looking forward.
0
MIH8
- ago
#5
I just thought, finally it is not the important part of the idea.
If it is possible to suppress signals that are not in the data range, the user can decide how much data he wants to load.
0
Glitch8
 ( 12.08% )
- ago
#6
Just want to clarify, each symbol has 150 million lines of data?
0
MIH8
- ago
#7
No, it is the total. Sorry to let you wait for an hour. I just dived into the other topic.

You can see the lines per symbol in the "main thread" (Permanent conversion from ASCII to binary data) if you are curios.
0
MIH8
- ago
#8
You can close this feature request.

Of course, a general improvement would still be nice.

At least for the people who code the idea can be implemented in the strategie with one line. One only needs to return from the execution if the current date is not in a the self-defined range. With this little change in a code, someone can load "All data" more than 10x faster while getting the result for the desired range.

CODE:
public override void Execute(BarHistory bars, int idx) { // loading "All Data" but with evaluation of a smaller time period // initialization of signal_range can be done once. // the initialization in the routine serves as an illustration if (LARGEDATA) { DateTime signal_range = new DateTime(2021, 07, 15, 0, 0, 0); if (DateTime.Compare(bars.DateTimes[idx], signal_range) < 0) return; } ... more code


Doing it like this, is very helpful for the use case in the description. Large data can be accessed this way much more quickly.
0
Glitch8
 ( 12.08% )
- ago
#9
Still, we want to optimize the process, obviously something is going on there that’s sub optimal in the loading of these large files with a date range specified.
0
MIH8
- ago
#10
Any optimization from you is appreciated. Thank you.
0
MIH8
- ago
#11
The above hack has disadvantages that must be taken into account.

It is important to note that the Metrics Report is influenced.
Especially calculation on average time won't be correct, as example APR.
You can easily see what I mean in the monthly returns, where 0.00% months are then shown.

Of course it keeps to be useful. Many metrics keep to be the same, like the profit. You can still make a "slow-load" to get a clean Metric Report. You just have to be aware that there are differences.
0
MIH8
- ago
#12
Is there a time frame when the corrections/improvements are planned?
0
Glitch8
 ( 12.08% )
- ago
#13
Currently no.
0
MIH8
- ago
#14
Have you at least figured out what the problem is?

0
Glitch8
 ( 12.08% )
- ago
#15
Not yet, MIH. We have it logged as an issue but we’ve been working on other issues as well.
0
vk8
 ( 57.24% )
- ago
#16
@Glitch
Where you able to get the data?
0
Glitch8
 ( 12.08% )
- ago
#17
@vk, we don’t need any data to work the issue, we can generate some data on our own. Just will take a bit more time and effort.
0
Glitch8
 ( 12.08% )
- ago
#18
Hi MIH, I created an ASCII DataSet of 100 symbols with 1 million bars of data each symbol, but I'm not seeing a difference in data load time when I use all data as opposed to a date range. Is there a particular date range that you're using that's causing the issue that I can try?
0
MIH8
- ago
#19
Data scale: 1 Minute
Data range: Most Recent Years
Number of years: 5
Enabled Filter Pre/Post market

The data i was talking of has a range of more than 15 years.
The ohlc data is not equaly distributed.

Now, after 29 minutes of waiting, the loading is finished. "All data". Now not even the workaround works in acceptable time

By the way. Why this caching stuff, a permanent conversion into binary format would be more effective (with appending option). It looks like the data is cached again when the data range will be changed.

I almost don't dare to say it, an import and export function for ASCII data is not rocket science and should be part of the basic functionality. Yes, an export function for the data I have downloaded from my provider, that belongs to me and that I want to access in a readable and generally usable format (too).
0
- ago
#20
QUOTE:
By the way. Why this caching stuff, a permanent conversion into binary format would be more effective (with appending option).

Wrong. You do not consider for data corrections that may happen (such as split adjustments applied or duplicate lines wiped out).

QUOTE:
It looks like the data is cached again when the data range will be changed.

If you change the bar scale to a non-cached time frame (from say 2 Second to 13 Second) then yes, of course the data will be cached again - once.
0
MIH8
- ago
#21
QUOTE:
Wrong. You do not consider for data corrections that may happen (such as split adjustments applied or duplicate lines wiped out).


This is only true if you work with adjusted data. But this does not necessarily have to be. At least you made a valid point.

QUOTE:

If you change the bar scale to a non-cached time frame (from say 2 Second to 13 Second) then yes, of course the data will be cached again - once.


Do you see a difference between "scale" and "range"?

0

Reply

Bookmark

Sort