status 1377 people subscribed

DS Interview Daily

Prepare to your Data Science interview by sharpening your skills in such crucial topics as Core ML, SQL and Probability with daily email containing interview-like problems asked by the top tech companies.

facebook amazon apple netflix google

Sample Problems

Core ML

Core ML

What is the difference between Random Forest (RF) and Gradient Boosted Trees (GBT)?

Key Differences:

  • Overfitting: GBT are more prone to overfitting than RF, because in RF each tree is built with a random sample of the data which makes RF models more robust. Pruning can be used to handle a problem of overfitting in GBT.
  • Parallelization: In RF decision trees are trained independently from each other. This can significantly speed up the overall computation time due to parallelization. In GBT however, trees are built sequentially - one at a time. Thus, GBT are generally slower to train.
  • Final Prediction: In RF final prediction is computed as a simple average (or majority vote for classification problems) of the outputs of all trees.

Bonus Interview Points! Similarities:

  • Ensemble: Both RF and GBT are ensemble techniques which combine the outputs of individual trees to calculate the final prediction.
  • Usage: RF and GBT can be used for solving both regression and classification problems.

SQL

SQL

Let's say you work for a supermarket chain company. The company leadership has noticed, that stores generated historically high profit in 2019, and decided to award top-3 stores which sold the most items that year in each city. For cities with less than 3 stores only top-1 store should be awarded. You are asked to extract stores that should be awarded.

Sample Input Tables:

Daily Sales Table Stores Table2
SQL Problem Solution
    Notes:
  • Line 5: we explicitly specified ds and s, but they could be omitted because the column names on which tables are joined are different.
  • strftime('%Y', date) is a SQLite command. For other database management systems syntax might be different (e.g. SELECT EXTRACT(YEAR FROM date) in MySQL and PostgreSQL).
  • DENSE_RANK() is used instead of RANK() because there might be multiple stores in one city with same number of annual sales.

Probability

Probability Concepts

You are a data scientist working for a global travel company. A company has 10 million customers, but only 10,000 of them bring the most revenue (we will call them "valuable" clients and other clients are considered "regular").

Information about each customer is stored in one row in the database (thus, there are 10 million rows in the database). In order to increase reliability of the database, the data is not stored on a single server, but instead it is equally split across 5 different servers (horizontal sharding is applied meaning that for each customer information is stored on one server).

Company leadership wants to estimate risks associated with a potential equipment failure resulting in complete loss of data from the failed server. You are asked to compute a probability of loosing at least one valuable client after one server is crashed.

As we have 10 million records split equally across 5 servers, each server stores 10 million / 5 = 2 million records. As we don’t know which server breaks first, and data being split randomly, without loss of generality we may assume that server #1 will break first.

Let x denote the number of valuable clients lost after a server crash. We are asked to compute P(x ≥ 1). Because of mutual exclusivity, this probability can be expressed as a sum of probabilities:

P(x ≥ 1) = P(x = 1) + P(x = 2) + ... + P(x = 10,000)

As we can loose any number of valuable clients from 0 to 10,000 we have:

P(x = 0) + P(x = 1) + P(x = 2) + ... + P(x = 10,000) = 1

This is equivalent to:

P(x = 0) + P(x ≥ 1) = 1

Thus:

P(x ≥ 1) = 1 - P(x = 0)

We can answer the question by simply calculating the value of P(x = 0), instead of computing 10,000 values for P(x = 1), P(x = 2), ..., P(x = 10,000).

So, we "know" that server #1 will crash and that it contains 2 million records. Thus, in order to compute a value of P(x = 0) we need to compute:

  • A: number of ways in which we can select 2 million customers out of regular customers (so that no single valuable customer is selected)
  • B: number of ways in which we can select 2 million out of all our customers
  • Calculate ratio: P(x = 0) = A / B

We don't care about the order in which customers' records populate database, but instead we are only interested in the final set of customers. Thus, we can use the following formula for computing a number of combinations in which r objects can be selected from a set of n objects where repetition is not allowed (as there is just one record for each customer):

Combinations Formula

Exclamation mark "!" in the formula above denotes factorial (r! = 1 * 2 * ... * (r - 1) * r).

In both cases (A and B): r = 2 million, because this is the number of records stored in each server. However, in case A: n = (10 million - 10,000) = 9,990,000 because here we account for only regular clients. But, in case B: n = 10 million, because we account for all possible combinations.

So now we need to just plug those numbers to our formula:

A Calculation A Calculation A Calculation

In the interview setting you definitely won't be asked to compute the final value of this formula. But if you are very curious, this value is much smaller than 0.01. Thus, P(x ≥ 1) = 1 - P(x = 0) > 0.99 and hence chances of loosing at least one valuable client are greater than 99%.

Tip: when you see words "probability of ... at least ..." in the problem formulation, think of "probability of ... No/Zero/Nobody ..." first, and then subtract it from 1 to get the answer to the question.

FAQ

  • After you subscribe, you will receive an email from ds@dsinterviewdaily.com every day containing a problem in one of the following disciplines: Core ML, SQL, and Probability. Depending on the time of subscription, you will receive first email either same day or next day.

  • Please check spam folder in your email and look for ds@dsinterviewdaily.com. If you find it, please mark it as a non-spam email, so that it appears properly in your inbox. If our messages did not reach you, please contact us at ds@dsinterviewdaily.com.

  • You can unsubscribe any time by clicking the "Unsubscribe" link at the end of any email.

  • For any questions, suggestions, or concerns, please reach out to us at ds@dsinterviewdaily.com.