Dataprep class

This class and its API are based on the similarly named function in the R Synth package.

The dataprep class defines all the information necessary for the synthetic control study. It takes in as argument a pandas.DataFrame foo containing the panel data, a list of predictors, special predictors, the statistical operation to apply to the predictors over the selected time frame, the dependant variable, the columns denoting the unit labels, the label denoting the control units, the label denoting the treated unit, the time period to carry out the optimisation procedure over and the time period to apply the statistical operation to the predictors. See below for further details about each individual argument, and also see the examples folder of the repository to see how this class is set up in three real research contexts.

The principal difference between the function signature here and the one in the R synth package is that whereas there are two arguments unit.variable and unit.names.variable in that package, in this package these are consolidated into one argument unit_variable as here it is unnecessary to have both.

class pysyncon.Dataprep(foo: pd.DataFrame, predictors: Axes, predictors_op: PredictorsOp_t, dependent: Any, unit_variable: Any, time_variable: Any, treatment_identifier: Any | list | tuple, controls_identifier: list | tuple, time_predictors_prior: IsinArg_t, time_optimize_ssr: IsinArg_t, special_predictors: Iterable[SpecialPredictor_t] | None = None)

Helper class that takes in the panel data and all necessary information needed to describe the study setup. It is used to automatically generate the matrices needed for the optimisation methods, plots of the results etc.

Parameters:
  • foo (pandas.DataFrame) – A pandas DataFrame containing the panel data where the columns are predictor/outcome variables and each row is a time-step for some unit

  • predictors (Axes) – The columns of foo to use as predictors

  • predictors_op ("mean" | "std" | "median" | "sum" | "count" | "max" | "min" | "var") – The statistical operation to use on the predictors - the time range that the operation is applied to is time_predictors_prior

  • dependent (Any) – The column of foo to use as the dependent variable

  • unit_variable (Any) – The column of foo that contains the unit labels

  • time_variable (Any) – The column of foo that contains the time period

  • treatment_identifier (Any) – The unit label that denotes the treated unit

  • controls_identifier (Iterable) – The unit labels denoting the control units

  • time_predictors_prior (Iterable) – The time range over which to apply the statistical operation to the predictors (see predictors_op argument)

  • time_optimize_ssr (Iterable) – The time range over which the loss function should be minimised

  • special_predictors (Iterable[SpecialPredictor_t], optional) –

    An iterable of special predictors which are additional predictors that should be averaged over a custom time period and an indicated statistical operator. In particular, a special predictor consists of a triple of:

    • column: the column of foo containing the predictor to use,

    • time-range: the time range to apply operator over - it should have the same type as time_predictors_prior or time_optimize_ssr

    • operator: the statistical operator to apply to column - it should have the same type as predictors_op

    by default None

Raises:
  • TypeError – if foo is not of type pandas.DataFrame

  • ValueError – if predictor is not a column of foo

  • ValueError – if predictor_op is not one of “mean”, “std”, “median”, “sum”, “count”, “max”, “min” or “var”.

  • ValueError – if dependent is not a column of foo

  • ValueError – if unit_variable is not a column of foo

  • ValueError – if time_variable is not a column of foo

  • ValueError – if treatment_identifier is not present in foo['unit_variable']

  • TypeError – if controls_identifier is not of type Iterable

  • ValueError – if treatment_identifier is in the list of controls

  • ValueError – if any of the controls is not in foo['unit_variable']

  • ValueError – if any element of special_predictors is not an Iterable of length 3

  • ValueError – if a predictor in an element of special_predictors is not a column of foo

  • ValueError – if one of the operators in an element of special_predictors is not one of “mean”, “std”, “median”, “sum”, “count”, “max”, “min” or “var”.