Data
I use the demo dataset provided by Robyn and follow the same methodological steps in data preparation to have the same baseline for comparison. For comparison with Robyn solutions I ran the demo.R file that comes with Robyn package, written in R, without any changes in settings.
The dataset consists of 208 weeks of revenue having:
- 5 media spend channels: tv_S, ooh_S, print_S, facebook_S, search_S
- 2 media channels that have also the exposure information (Impression, Clicks): facebook_I, search_clicks_P
- Organic media without spend: newsletter
- Control variables: events, holidays, competitor sales (competitor_sales_B)
The modeling window is 92 weeks from 2016–11–21 to 2018–08–20. To make it generic I define two index variables to refer to this time window:
START_ANALYSIS_INDEX = 52END_ANALYSIS_INDEX = 144
Data Preparation
We have to apply two preparation steps as described in Robyn:
Robyn leverages Prophet, Facebook’s open source ML library for time series forecasting. We use Prophet to automatically decompose the trend, seasonality and holidays effects directly from the response as input variables for further modeling. This capability often increases model fit and reduces autoregressive patterns in residuals.
In addition, we can convert categorical variables such events into a numeric using Prophet’s decomposition.
The second preparation step:
When using exposure variables (impressions, clicks, GRPs etc) instead of spend Robyn fits an nonlinear model with Michaelis Menten function between exposure and spend to establish the spend-exposure relationship
The second step will allow us to convert exposure into spend for the final share of spend vs. share of effect comparison.
Let’s load the data first:
data = pd.read_csv("./data/data_raw_2015-11-23__2019-11-11.csv", parse_dates = ["DATE"])data.columns = [c.lower() if c in ["DATE"] else c for c in data.columns]
dataImage by the author
The holidays data is a separate file that is originally shipped with Prophet. The original holidays data has a daily granularity so it should be aggregated on a weekly level first. Robyn uses German holidays in their demo.
holidays = pd.read_csv("./data/prophet_holidays_daily.csv", parse_dates = ["ds"])holidays["begin_week"] = holidays["ds"].dt.to_period('W-SUN').dt.start_time
#combine same week holidays into one holiday
holidays_weekly = holidays.groupby(["begin_week", "country", "year"], as_index = False).agg({'holiday':'#'.join, 'country': 'first', 'year': 'first'}).rename(columns = {'begin_week': 'ds'})
holidays_weekly_de = holidays_weekly.query("(country == 'DE')").copy()
holidays_weekly_deImage by the author
Prophet Decomposition
Now we are ready to fit Prophet on our data, including holidays, a categorical variable, and using yearly seasonality. Important to note that we apply the decomposition on all available data and not on the modeling window.
prophet_data = data.rename(columns = {'revenue': 'y', 'date': 'ds'})#add categorical into prophet
prophet_data = pd.concat([prophet_data, pd.get_dummies(prophet_data["events"], drop_first = True, prefix = "events")], axis = 1)prophet = Prophet(yearly_seasonality=True, holidays=holidays_weekly_de)
prophet.add_regressor(name = "events_event2")
prophet.add_regressor(name = "events_na")prophet.fit(prophet_data[["ds", "y", "events_event2", "events_na"]])
prophet_predict = prophet.predict(prophet_data[["ds", "y", "events_event2", "events_na"]])Image by the author
Let’s extract the seasonality, trend, holidays, and events components:
prophet_columns = [col for col in prophet_predict.columns if (col.endswith("upper") == False) & (col.endswith("lower") == False)]events_numeric = prophet_predict[prophet_columns].filter(like = "events_").sum(axis = 1)final_data = data.copy()
final_data["trend"] = prophet_predict["trend"]
final_data["season"] = prophet_predict["yearly"]
final_data["holiday"] = prophet_predict["holidays"]
final_data["events"] = (events_numeric - np.min(events_numeric)).values
Spend-Exposure estimation
In this step, we estimate the spend-exposure non-linear relationship using the Michaelis Menten function
The spend-to-exposure function is defined as follows:
Image by the authorwhere Vmax and Km are two parameters we have to estimate. These two parameters will be later used to find the reverse relationship of exposure-to-spend. The parameters are estimated for the modeling window.
#define the functionspend_to_exposure_menten_func = lambda spend, V_max, K_m: V_max * spend / (K_m + spend)media_exposures = ["facebook_I", "search_clicks_P"]
media_spends = ["facebook_S", "search_S"]media_spend_exposure_df = pd.DataFrame()
for (media_exposure, media_spend) in zip(media_exposures, media_spends):
V_max = final_data[media_exposure].values[START_ANALYSIS_INDEX : END_ANALYSIS_INDEX].max()
K_m = V_max / 2
spend = final_data[media_spend].values[START_ANALYSIS_INDEX : END_ANALYSIS_INDEX]
exposure = final_data[media_exposure].values[START_ANALYSIS_INDEX : END_ANALYSIS_INDEX]
best_values, _ = optimize.curve_fit(f = spend_to_exposure_menten_func, xdata = spend, ydata = exposure, p0 = [V_max, K_m])
media_spend_exposure_df = pd.concat([media_spend_exposure_df, pd.DataFrame({'spend': [media_spend], 'exposure': [media_exposure], 'V_max': [best_values[0]], 'K_m': [best_values[1]]})]).reset_index(drop = True)
media_spend_exposure_df
Image by the authorBayesian Modeling
Now we are ready for modeling. We model our response variable (revenue) as an additive linear regression, which can be described by the following equation:
Image by the authorwhere b_0 corresponds to the baseline revenue, b_m coefficients correspond to media variables that were transformed through adstock and saturation functions, b_c coefficients correspond to control variables and ϵ is some noise. All mentioned coefficients, noise, and the parameters of the adstock and saturation function should be estimated by the model.
The first decision we have to make prior to modeling is how we are going to normalize our dependent and independent variables. By normalizing independent variables we generalize our modeling to other sources of data because we can use the same priors for most of our independent variables. Moreover, it will be very difficult to set priors on non-normalized data. Therefore I apply 0–1 normalization on independent variables and experiment with two different normalizations of the response variable:
- Scaling by 100K
- 0–1 Normalization
Scaling by 100K
The original revenue range is 672250–3827520, so by scaling the revenue by 100K, I get the following range: 6.72–38.27 which makes it easier to experiment with priors.
First, I define my input variables:
data = final_datatransform_variables = ["trend", "season", "holiday", "competitor_sales_B", "events", "tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P", "newsletter"]delay_channels = ["tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P", "newsletter"]media_channels = ["tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P"]control_variables = ["trend", "season", "holiday", "competitor_sales_B", "events"]target = "revenue"
Then I normalize independent variables using MinMaxScaler and scale the dependent variable by 100K
data_transformed = data.copy()numerical_encoder_dict = {}for feature in transform_variables:scaler = MinMaxScaler()
original = data[feature].values.reshape(-1, 1)
transformed = scaler.fit_transform(original)
data_transformed[feature] = transformed
numerical_encoder_dict[feature] = scalerdependent_transformation = None
original = data[target].values
data_transformed[target] = original / 100_000
The modeling part consists of several steps:
- I iterate first through media channels (delay channels) and define the priors for adstock and saturation transformation. I experimented with different priors but am finally using those described in the Bayesian Methods paper
- I apply adstock transformation on all available data to allow the carryover effect build up using the historical data
- I apply saturation transformation on the modeling window defined by START_ANALYSIS_INDEX and END_ANALYSIS_INDEX
- I force the regression coefficients for delay channels to be positive using HalfNormal distribution
- Next, I iterate through the rest of the variables without any constraints on the sign of the coefficients
- I defined the prior for Intercept as having normal distribution starting at the average of the revenue. I follow the Robyn’s methodology constraining the Intercept to be positive.
with pm.Model() as model_2:
for channel_name in delay_channels:
print(f"Delay Channels: Adding {channel_name}")
x = data_transformed[channel_name].values
adstock_param = pm.Beta(f"{channel_name}_adstock", 3, 3)
saturation_gamma = pm.Beta(f"{channel_name}_gamma", 2, 2)
saturation_alpha = pm.Gamma(f"{channel_name}_alpha", 3, 1)
x_new = adstock_geometric_theano_pymc3(x, adstock_param)
x_new_sliced = x_new[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]
saturation_tensor = saturation_hill_pymc3(x_new_sliced, saturation_alpha, saturation_gamma)
channel_b = pm.HalfNormal(f"{channel_name}_media_coef", sd = 3)
response_mean.append(saturation_tensor * channel_b)
for control_var in control_variables:
print(f"Control Variables: Adding {control_var}")
x = data_transformed[control_var].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]
control_beta = pm.Normal(f"{control_var}_control_coef", sd = 3)
control_x = control_beta * x
response_mean.append(control_x)
intercept = pm.Normal("intercept", np.mean(data_transformed[target].values), sd = 3)
#intercept = pm.HalfNormal("intercept", 0, sd = 3)
sigma = pm.HalfNormal("sigma", 4)
likelihood = pm.Normal("outcome", mu = intercept + sum(response_mean), sd = sigma, observed = data_transformed[target].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX])
I generate samples from prior distribution to check if my choice of priors was reasonable:
Image by the authorAnd plot prior distributions: