Synthetic Control Method¶
Overview¶
The synthetic control method is due to Abadie and Gardeazabal [AG03] (also see Abadie, Diamond and Hainmueller [ADH07] [ADH15]). This method constructs a weighted combination of the control units that most resembles the selected characteristics of the treated unit in a time period prior to the treatment time. This so-constructed “synthetic control unit” can then be compared with the treated unit to investigate the causal effect of the treatment.
Details¶
In particular, this method constructs a vector of non-negative weights \(w = (w_1, w_2, \dots, w_k)\) whose sum is 1 and \(k\) is the number of control units that minimizes
where
\(\|A\|_V=\sqrt{A^TVA}\), where \(V\) is a diagonal matrix with non-negative entries that captures the relationship between the outcome variable and the predictors,
\(X_0\) is a matrix of the values for the control units of the chosen statistic for the chosen predictors over the selected (pre-intervention) time-period (each column corresponds to a control),
\(x_1\) is a (column) vector of the corresponding values for the treated unit.
The matrix \(V\) can be supplied otherwise it is part of the optimization problem: it is obtained by minimizing the quantity
where
\(Z_0\) is a matrix of the values of the outcome variable for the control units over the (pre-intervention) time-period (each column corresponds to a control),
\(z_1\) is a (column) vector of the corresponding values for the treated unit.
The Synth
class¶
The Synth
class implements the synthetic control
method. The expected way to use the class is to first create a
Dataprep
object that defines the study data and
then use it as input to a Synth
object. See the
examples folder
of the repository for examples illustrating usage.
The implementation is based on the same method in the R Synth package and aims to produce results that can be reconciled with that package.
- class pysyncon.Synth¶
Implementation of the synthetic control method due to Abadie & Gardeazabal [AG03].
- att(time_period: Iterable | Series | dict, Z0: DataFrame | None = None, Z1: Series | None = None) dict[str, float] ¶
Computes the average treatment effect on the treated unit (ATT) and the standard error to the value over the chosen time-period.
- Parameters:
time_period (Iterable | pandas.Series | dict, optional) – Time period to compute the ATT over.
Z0 (pandas.DataFrame, shape (n, c), optional) – The matrix of the time series of the outcome variable for the control units. If no dataprep is set, then this must be supplied along with Z1, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – The matrix of the time series of the outcome variable for the treated unit. If no dataprep is set, then this must be supplied along with Z0, by default None.
- Returns:
A dictionary with the ATT value and the standard error to the ATT.
- Return type:
dict
- Raises:
ValueError – If there is no weight matrix available
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not supplied
- confidence_interval(alpha: float, time_periods: list, tol: float, pre_periods: list | None = None, dataprep: Dataprep | None = None, X0: DataFrame | None = None, X1: Series | None = None, Z0: DataFrame | None = None, Z1: Series | None = None, custom_V: ndarray | None = None, optim_method: Literal['Nelder-Mead', 'Powell', 'CG', 'BFGS', 'L-BFGS-B', 'TNC', 'COBYLA', 'trust-constr'] | None = None, optim_initial: Literal['equal', 'ols'] | None = None, optim_options: dict | None = None, method: Literal['conformal'] = 'conformal', max_iter: int = 50, step_sz: float | None = None, step_sz_div: float = 20.0, verbose: bool = True) DataFrame ¶
Confidence intervals obtained from test-inversion, where the p-values are obtained by adjusted refits of the data following Chernozhukov et al. [VCZ21].
- Parameters:
alpha (float) – The required significance level, e.g. alpha = 0.05 will yield a confidence level of 100 * (1 - alpha) = 95%.
time_periods (list) – The time-periods to calculate confidence intervals for.
tol (float) – The required tolerance (accuracy) required when calculating the lower/upper cut-off point of the confidence interval. The search will try to obtain this tolerance level but will not exceed max_iter iterations trying to achieve that.
pre_periods (Optional[list], optional) – The time-periods to use for the optimization when refitting the data with the adjusted outcomes, optional.
dataprep (Optional[Dataprep], optional) – Dataprep object defining the study data, if this is not supplied then either self.dataprep must be set or else (X0, X1, Z0, Z1) must all be supplied, by default None.
X0 (pd.DataFrame, shape (m, c), optional) –
Matrix with each column corresponding to a control unit and each row is covariates, if this is not supplied then either dataprep must
be supplied or self.dataprep must be set by default None.
X1 (pandas.Series, shape (m, 1), optional) –
Column vector giving the covariate values for the treated unit, if this is not supplied then either dataprep must
be supplied or self.dataprep must be set by default None.
Z0 (pandas.DataFrame, shape (n, c), optional) – A matrix of the time series of the outcome variable with each column corresponding to a control unit and the rows are the time steps; the columns correspond with the columns of X0, if this is not supplied then either dataprep must be supplied or self.dataprep must be set by default None.
Z1 (pandas.Series, shape (n, 1), optional) –
Column vector giving the outcome variable values over time for the treated unit, if this is not supplied then either dataprep must
be supplied or self.dataprep must be set by default None.
custom_V (numpy.ndarray, shape (c, c), optional) – Provide a V matrix (using the notation of the Abadie, Diamond & Hainmueller paper), the optimisation problem will only then be solved for the weight matrix W. This is the same argument as in the fit method, by default None.
optim_method (str, optional) –
Optimisation method to use for the outer optimisation, can be any of the valid options for scipy minimize that do not require a jacobian matrix, namely
’Nelder-Mead’
’Powell’
’CG’
’BFGS’
’L-BFGS-B’
’TNC’
’COBYLA’
’trust-constr’
This is the same argument as in the fit method, by default ‘Nelder-Mead’.
optim_initial (str, optional) –
Starting value for the outer optimisation, possible starting values are
’equal’, where the weights are all equal,
’ols’, which uses a starting value obtained for fitting a regression.
This is the same argument as in the fit method, by default ‘equal’.
optim_options (dict, optional) –
options to provide to the outer part of the optimisation, value options are any option that can be provided to scipy minimize for the given optimisation method. This is the same argument as in
the fit method, by default {‘maxiter’: 1000}.
method (str, optional) – The type of method to use when computing the confidence intervals, currently only conformal inference (conformal) is implemented, by default “conformal”.
max_iter (int, optional) – Maximum number of times to re-fit the data when trying to locate the lower/upper cut-off point and when binary searching for the cut-off point, by default 20.
step_sz (Optional[float], optional) – Step size to use when searching for an interval that contains the lower or upper cut-off point of the confidence interval, by default None.
step_sz_div (float, optional) – Alternative way to define step size: it is the fraction that defines step-size in terms of the standard deviation of the att, i.e. if step_sz_div=20.0 then the step size used will be (att +/- 2.5 * std(att)) / 20.0, by default 20.0.
verbose (bool, optional) – Print output, by default True.
- Returns:
A pandas.DataFrame indexed by post_periods, with 3 columns: value that gives the calculated treatment effect, lower_ci that gives the value defining the lower-end of the confidence interval, upper_ci that gives the value defining the upper-end of the confidence interval.
- Return type:
pd.DataFrame
- Raises:
ValueError – If there is no
Dataprep
object set or (X0, X1, Z0, Z1) is not supplied or self.dataprep is not set.TypeError – if (\(X1\), \(Z1\)) are not of type pandas.Series.
ValueError – if dataprep is not set and pre-periods is not set.
ValueError – if an invalid option for method is given, currently only conformal is supported.
- fit(dataprep: Dataprep | None = None, X0: DataFrame | None = None, X1: Series | None = None, Z0: DataFrame | None = None, Z1: Series | None = None, custom_V: ndarray | None = None, optim_method: Literal['Nelder-Mead', 'Powell', 'CG', 'BFGS', 'L-BFGS-B', 'TNC', 'COBYLA', 'trust-constr'] = 'Nelder-Mead', optim_initial: Literal['equal', 'ols'] = 'equal', optim_options: dict = {'maxiter': 1000}) None ¶
Fit the model/calculate the weights. Either a
Dataprep
object should be provided or otherwise matrices (\(X_0\), \(X_1\), \(Z_0\), \(Z_1\)) should be provided (using the notation of Abadie & Gardeazabal [AG03]).- Parameters:
dataprep (Dataprep, optional) –
Dataprep
object containing data to model, by default None.X0 (pd.DataFrame, shape (m, c), optional) – Matrix with each column corresponding to a control unit and each row is covariates, by default None.
X1 (pandas.Series, shape (m, 1), optional) – Column vector giving the covariate values for the treated unit, by default None.
Z0 (pandas.DataFrame, shape (n, c), optional) – A matrix of the time series of the outcome variable with each column corresponding to a control unit and the rows are the time steps; the columns correspond with the columns of X0, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – Column vector giving the outcome variable values over time for the treated unit, by default None.
custom_V (numpy.ndarray, shape (c, c), optional) – Provide a V matrix (using the notation of the Abadie, Diamond & Hainmueller paper), the optimisation problem will only then be solved for the weight matrix W, by default None.
optim_method (str, optional) –
Optimisation method to use for the outer optimisation, can be any of the valid options for scipy minimize that do not require a jacobian matrix, namely
’Nelder-Mead’
’Powell’
’CG’
’BFGS’
’L-BFGS-B’
’TNC’
’COBYLA’
’trust-constr’
By default ‘Nelder-Mead’.
optim_initial (str, optional) –
Starting value for the outer optimisation, possible starting values are
’equal’, where the weights are all equal,
’ols’, which uses a starting value obtained for fitting a regression.
By default ‘equal’.
optim_options (dict, optional) – options to provide to the outer part of the optimisation, value options are any option that can be provided to scipy minimize for the given optimisation method, by default {‘maxiter’: 1000}.
- Returns:
None
- Return type:
NoneType
- Raises:
ValueError – if neither a Dataprep object nor all of (\(X_0\), \(X_1\), \(Z_0\), \(Z_1\)) are supplied.
TypeError – if (\(X1\), \(Z1\)) are not of type pandas.Series.
ValueError – if optim_initial=ols and there is collinearity in the data.
ValueError – if optim_initial is not one of ‘equal’ or ‘ols’.
- gaps_plot(time_period: Iterable | Series | dict | None = None, treatment_time: int | None = None, grid: bool = True, Z0: DataFrame | None = None, Z1: Series | None = None) None ¶
Plots the gap between the treated unit and the synthetic unit over time.
- Parameters:
time_period (Iterable | pandas.Series | dict, optional) – Time range to plot, if none is supplied then the time range used is the time period over which the optimisation happens, by default None
treatment_time (int, optional) – If supplied, plot a vertical line at the time period that the treatment time occurred, by default None
grid (bool, optional) – Whether or not to plot a grid, by default True
Z0 (pandas.DataFrame, shape (n, c), optional) – The matrix of the time series of the outcome variable for the control units. If no dataprep is set, then this must be supplied along with Z1, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – The matrix of the time series of the outcome variable for the treated unit. If no dataprep is set, then this must be supplied along with Z0, by default None.
- Raises:
ValueError – If there is no weight matrix available
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not supplied
- mae(Z0: DataFrame | None = None, Z1: Series | None = None) float ¶
Returns the mean absolute error in the fit of the synthetic control versus the treated unit over the optimization time-period.
- Parameters:
Z0 (pandas.DataFrame, shape (n, c), optional) – The matrix of the time series of the outcome variable for the control units. If no dataprep is set, then this must be supplied along with Z1, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – The matrix of the time series of the outcome variable for the treated unit. If no dataprep is set, then this must be supplied along with Z0, by default None.
- Returns:
Mean absolute error
- Return type:
float
- Raises:
ValueError – If the fit method has not been run (no weights available.)
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not supplied
- mape(Z0: DataFrame | None = None, Z1: Series | None = None) float ¶
Returns the mean absolute percentage error in the fit of the synthetic control versus the treated unit over the optimization time-period.
- Parameters:
Z0 (pandas.DataFrame, shape (n, c), optional) – The matrix of the time series of the outcome variable for the control units. If no dataprep is set, then this must be supplied along with Z1, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – The matrix of the time series of the outcome variable for the treated unit. If no dataprep is set, then this must be supplied along with Z0, by default None.
- Returns:
Mean absolute percentage error
- Return type:
float
- Raises:
ValueError – If the fit method has not been run (no weights available.)
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not supplied
- mspe(Z0: DataFrame | None = None, Z1: Series | None = None) float ¶
Returns the mean square prediction error in the fit of the synthetic control versus the treated unit over the optimization time-period.
- Parameters:
Z0 (pandas.DataFrame, shape (n, c), optional) – The matrix of the time series of the outcome variable for the control units. If no dataprep is set, then this must be supplied along with Z1, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – The matrix of the time series of the outcome variable for the treated unit. If no dataprep is set, then this must be supplied along with Z0, by default None.
- Returns:
Mean square prediction Error
- Return type:
float
- Raises:
ValueError – If the fit method has not been run (no weights available.)
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not supplied
- path_plot(time_period: Iterable | Series | dict | None = None, treatment_time: int | None = None, grid: bool = True, Z0: DataFrame | None = None, Z1: Series | None = None) None ¶
Plot the outcome variable over time for the treated unit and the synthetic control.
- Parameters:
time_period (Iterable | pandas.Series | dict, optional) – Time range to plot, if none is supplied then the time range used is the time period over which the optimisation happens, by default None
treatment_time (int, optional) – If supplied, plot a vertical line at the time period that the treatment time occurred, by default None
grid (bool, optional) – Whether or not to plot a grid, by default True
Z0 (pandas.DataFrame, shape (n, c), optional) – The matrix of the time series of the outcome variable for the control units. If no dataprep is set, then this must be supplied along with Z1, by default None.
Z1 (pandas.Series, shape (n, 1), optional) – The matrix of the time series of the outcome variable for the treated unit. If no dataprep is set, then this must be supplied along with Z0, by default None.
- Raises:
ValueError – If there is no weight matrix available
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not supplied
- summary(round: int = 3, X0: DataFrame | None = None, X1: Series | None = None) DataFrame ¶
Generates a
pandas.DataFrame
with summary data. In particular, it will show the values of the V matrix for each predictor, then the next column will show the mean value of each predictor over the time periodtime_predictors_prior
for the treated unit and the synthetic unit and finally there will be a column ‘sample mean’ that shows the mean value of each predictor over the time periodtime_predictors_prior
across all the control units, i.e. this will be the same as a synthetic control where all the weights are equal.- Parameters:
round (int, optional) – Round the numbers to given number of places, by default 3
X0 (pd.DataFrame, shape (n_cov, n_controls), optional) – Matrix with each column corresponding to a control unit and each row is a covariate. If no dataprep is set, then this must be supplied along with X1, by default None.
X1 (pandas.Series, shape (n_cov, 1), optional) – Column vector giving the covariate values for the treated unit. If no dataprep is set, then this must be supplied along with Z1, by default None.
- Returns:
Summary data.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If there is no V matrix available
ValueError – If there is no
Dataprep
object set or (Z0, Z1) is not suppliedValueError – If there is no weight matrix available
- weights(round: int = 3, threshold: float | None = None) Series ¶
Return a
pandas.Series
of the weights for each control unit.- Parameters:
round (int, optional) – Round the weights to given number of places, by default 3
threshold (float, optional) – If supplied, will only show weights above this value, by default None
- Returns:
The weights computed
- Return type:
pandas.Series
- Raises:
ValueError – If there is no weight matrix available