TL;DR: Simple information about the structure of a model can be used to determine the amount of data needed for a modeling and prediction task.
Audience: data scientists, modelers, managers of modeling projects, and anyone who plans data.
Read time: 10 minutes.
All projects that use data for modeling and prediction face the question of how much data is enough. This is an important question that could make or break such a project. The easy answer is: as much as possible, the more the merrier. One idea behind this answer is the Law of Large Numbers, which states that as a probabilistic process is repeated the relative frequencies of its outcomes better approximate their true probabilities. But, as usual in life, things are not always that simple: many projects do not have the luxury of abundant data.
Practical Limitations
Before we dive into a quantitative approach for answering the question, which is the main topic of this post, let’s consider a couple of practical limitations on the amount of data that would be gathered.
- Scale limitations: It may become too difficult to gather the data. For example, consider DNA sequencing, which is used for gathering genetic data. This process involves a chemistry reaction that requires reagents (special materials), a DNA sequencing machine, and trained personnel. If any of these is limited in supply, as has been the case in many countries during the coronavirus pandemic, scaling of the process will be limited.
- Cost limitations: It may become too expensive to gather the data. If storage for historical data is paid for by its size and the time of storage, then the cost of storage grows pretty fast over time. To get a sense just how fast, let’s consider an example. Suppose storage of 1 month worth of data over a period of 1 month costs $10. After 2 months, the first month’s data would be stored for 2 months while the second month’s data would be stored for 1 month, so that the cost of storage would be $30. Similarly, after 1 year the cost would be $780, and after 5 years it would be $18,300.
- Value limitations: It may become too unappealing to gather the data. The Law of Large Numbers also tells us that as more and more data is gathered the approximation improves less and less. An example of this is estimating the average height of a population by averaging samples. The first 100 samples would improve the estimation much more than the next 100 samples.
A Quantitative Approach
But, ignoring practical considerations, how can we estimate the right amount of data to gather for a particular task? We’ll consider this question from a systems perspective, which views the model as a system with inputs and outputs. Understanding these inputs and outputs is key. The analysis will make a couple of assumptions:
- Stationarity, i.e. that the distribution of the data does not change during the period of gathering. If it does change, it is possible we may not have gathered enough data during the period of stability.
- Finiteness, i.e. that the set of system states is finite (and so discrete).
Let’s start by considering the number of possible outputs. A simple yet illuminating case for analyzing model outputs, known as explained variables, is that of a dice. Suppose we are modeling a 6-faces dice that may be unfair (i.e. the probability of getting a particular face may not be 1/6) in an unknown way. How many times should we roll the dice to get a good estimate of its face-probabilities? We first need to define what good here means: we’d like at least 0.99 probability of getting the least probable face at least once. Then, we can calculate the required number of rolls given the minimum face probability p0 we’d like to be able to distinguish from 0. Here is a table of calculated examples followed by the Python code used to generate it.
Minimum face probability | Required number of rolls |
---|---|
1/6 | 26 |
1/8 | 35 |
1/12 | 53 |
1/16 | 72 |
1/20 | 90 |
>>> def rolls(p,q,iters=1000):
... for i in range(iters):
... if (1.0 - ((1.0 - p) ** i)) > q:
... return i
...
>>> f = [6,8,12,16,20]
>>> r = [rolls(1.0/f[i],0.99) for i in range(len(f))]
>>> r
[26, 35, 53, 72, 90]
In this range of parameters, we see the required number of rolls is about 4 to 4.5 times 1/p0, the reciprocal of the minimum face probability, i.e. an approximately linear connection.
Next, let’s consider the number of inputs. A simple case for analyzing modeling and prediction inputs, known as explaining variables, is that of multiple dices. Suppose there are 2 dices we can tell apart, each is a 6-faces dice that may be unfair in a different unknown way; moreover, each time the dice to be rolled is chosen by the flip of a 2-faces coin that may be unfair in an unknown way as well. How many times should we roll to get a good estimate of the face-probabilities of each dice? Using a similar definition of a good estimate, we now have a required number of rolls r1 and r2 from each dice given their minimum face probabilities p1 and p2. Then, we can find a sufficient minimum number of coin flips given its minimum face probability p0 as follows. For a required number of rolls r, we take blocks of r flips and aim in each block to observe the event of rolling the face with minimum probability. To ensure this happens with probability 0.99 across all block events, we take each block event with probability b such that pow(b0, r0) = 0.99, i.e. b0 = exp(log(0.99)/r0), where r0 is the number of rolls required for a minimum face probability of p0. Here is a table of calculated examples followed by the Python code used to generate it.
Minimum dice face probability | Block event probability | Minimum coin face probability | Sufficient number of flips |
---|---|---|---|
1/6 | exp(log(0.99)/26) | 1/6 | 1144 |
1/8 | exp(log(0.99)/35) | 1/8 | 2170 |
1/12 | exp(log(0.99)/53) | 1/12 | 5247 |
1/16 | exp(log(0.99)/72) | 1/16 | 9936 |
1/20 | exp(log(0.99)/90) | 1/20 | 16020 |
>>> from math import exp, log
>>> x = [exp(log(0.99) / i) for i in r]
>>> x
[0.9996135233223779, 0.9997128887713624, 0.9998103890000704, 0.999860421743986, 0.9998883358365346]
>>> r2 = [rolls(1.0/f[i],x[i]) * rolls(1.0/f[i],0.99) for i in range(len(f))]
>>> r2
[1144, 2170, 5247, 9936, 16020]
In this range of parameters, we see the sufficient number of flips is about 1.5 to 2 times the square of the number of required rolls we saw in the previous table. i .e. an approximately quadratic connection.
Finally, we consider the adaptive decision of when we may stop gathering data. That is, now we want to decide when to stop gathering data based on our observations of the data as we gather it, rather than upfront before we start gathering. In particular, we do not want to estimate a minimum face probability upfront, perhaps because we do not know how. For this decision, we can use a simple rule: stop gathering at G amount of data when a small number n of each output is observed for each input. For n=1 we can use the above analysis to find the minimum face probability p (that can be distinguished from 0) corresponding to G. The analysis can be enhanced to find p for higher values of n.
Conclusions
Sometimes there are practical limitations to gathering ever more data for modeling and prediction projects, and one needs to decide how much data is enough. We presented a simple quantitative approach for answering this question for a stationary finite systems that accounts for:
- The number of possible inputs, and outputs, of the system.
- The minimum probability of a possible input, and output, that should be distinguished from 0.
- The success probability of such distinguishing.
The simple examples we analyzed showed a connection between the reciprocal of the minimum probability distinguishable from 0 and the required amount of data: when considering the output only the connection is approximately linear and when considering the inputs too the connection is approximately quadratic.