You are a data scientist working for a global travel company. A company has 10 million customers, but only 10,000 of them bring the most revenue (we will call them "valuable" clients and other clients are considered "regular").
Information about each customer is stored in one row in the database (thus, there are 10 million rows in the database). In order to increase reliability of the database, the data is not stored on a single server, but instead it is equally split across 5 different servers (horizontal sharding is applied meaning that for each customer information is stored on one server).
Company leadership wants to estimate risks associated with a potential equipment failure resulting in complete loss of data from the failed server. You are asked to compute a probability of loosing at least one valuable client after one server is crashed.
As we have 10 million records split equally across 5 servers, each server stores 10 million / 5 = 2 million records. As we don’t know which server breaks first, and data being split randomly, without loss of generality we may assume that server #1 will break first.
Let x denote the number of valuable clients lost after a server crash. We are asked to compute P(x ≥ 1). Because of mutual exclusivity, this probability can be expressed as a sum of probabilities:
P(x ≥ 1) = P(x = 1) + P(x = 2) + ... + P(x = 10,000)
As we can loose any number of valuable clients from 0 to 10,000 we have:
P(x = 0) + P(x = 1) + P(x = 2) + ... + P(x = 10,000) = 1
This is equivalent to:
P(x = 0) + P(x ≥ 1) = 1
Thus:
P(x ≥ 1) = 1 - P(x = 0)
We can answer the question by simply calculating the value of P(x = 0), instead of computing 10,000 values for P(x = 1), P(x = 2), ..., P(x = 10,000).
So, we "know" that server #1 will crash and that it contains 2 million records. Thus, in order to compute a value of P(x = 0) we need to compute:
- A: number of ways in which we can select 2 million customers out of regular customers (so that no single valuable customer is selected)
- B: number of ways in which we can select 2 million out of all our customers
- Calculate ratio: P(x = 0) = A / B
We don't care about the order in which customers' records populate database, but instead we are only interested in the final set of customers. Thus, we can use the following formula for computing a number of combinations in which r objects can be selected from a set of n objects where repetition is not allowed (as there is just one record for each customer):
Exclamation mark "!" in the formula above denotes factorial (r! = 1 * 2 * ... * (r - 1) * r).
In both cases (A and B): r = 2 million, because this is the number of records stored in each server. However, in case A: n = (10 million - 10,000) = 9,990,000 because here we account for only regular clients. But, in case B: n = 10 million, because we account for all possible combinations.
So now we need to just plug those numbers to our formula:
In the interview setting you definitely won't be asked to compute the final value of this formula. But if you are very curious, this value is much smaller than 0.01. Thus, P(x ≥ 1) = 1 - P(x = 0) > 0.99 and hence chances of loosing at least one valuable client are greater than 99%.
Tip: when you see words "probability of ... at least ..." in the problem formulation, think of "probability of ... No/Zero/Nobody ..." first, and then subtract it from 1 to get the answer to the question.