Dataprep
class¶
This class and its API are based on the similarly named function in the R Synth package.
The dataprep
class defines all the information necessary for the synthetic
control study. It takes in as argument a pandas.DataFrame
foo containing
the panel data, a list of predictors, special predictors, the statistical operation to
apply to the predictors over the selected time frame, the dependant variable,
the columns denoting the unit labels, the label denoting the control units,
the label denoting the treated unit, the time period to carry out the optimisation
procedure over and the time period to apply the statistical operation to the
predictors. See below for further details about each individual argument, and also see
the examples folder
of the repository to see how this class is set up in three real research contexts.
The principal difference between the function signature here and the one in
the R
synth
package is that whereas there are two arguments unit.variable
and unit.names.variable in that package, in this package these are
consolidated into one argument unit_variable as here it is unnecessary to have
both.
- class pysyncon.Dataprep(foo: pd.DataFrame, predictors: Axes, predictors_op: PredictorsOp_t, dependent: Any, unit_variable: Any, time_variable: Any, treatment_identifier: Any | list | tuple, controls_identifier: list | tuple, time_predictors_prior: IsinArg_t, time_optimize_ssr: IsinArg_t, special_predictors: Iterable[SpecialPredictor_t] | None = None)¶
Helper class that takes in the panel data and all necessary information needed to describe the study setup. It is used to automatically generate the matrices needed for the optimisation methods, plots of the results etc.
- Parameters:
foo (pandas.DataFrame) – A pandas DataFrame containing the panel data where the columns are predictor/outcome variables and each row is a time-step for some unit
predictors (Axes) – The columns of
foo
to use as predictorspredictors_op ("mean" | "std" | "median" | "sum" | "count" | "max" | "min" | "var") – The statistical operation to use on the predictors - the time range that the operation is applied to is
time_predictors_prior
dependent (Any) – The column of
foo
to use as the dependent variableunit_variable (Any) – The column of
foo
that contains the unit labelstime_variable (Any) – The column of
foo
that contains the time periodtreatment_identifier (Any) – The unit label that denotes the treated unit
controls_identifier (Iterable) – The unit labels denoting the control units
time_predictors_prior (Iterable) – The time range over which to apply the statistical operation to the predictors (see
predictors_op
argument)time_optimize_ssr (Iterable) – The time range over which the loss function should be minimised
special_predictors (Iterable[SpecialPredictor_t], optional) –
An iterable of special predictors which are additional predictors that should be averaged over a custom time period and an indicated statistical operator. In particular, a special predictor consists of a triple of:
column
: the column offoo
containing the predictor to use,time-range
: the time range to applyoperator
over - it should have the same type astime_predictors_prior
ortime_optimize_ssr
operator
: the statistical operator to apply tocolumn
- it should have the same type aspredictors_op
by default None
- Raises:
TypeError – if
foo
is not of typepandas.DataFrame
ValueError – if
predictor
is not a column offoo
ValueError – if
predictor_op
is not one of “mean”, “std”, “median”, “sum”, “count”, “max”, “min” or “var”.ValueError – if
dependent
is not a column offoo
ValueError – if
unit_variable
is not a column offoo
ValueError – if
time_variable
is not a column offoo
ValueError – if
treatment_identifier
is not present infoo['unit_variable']
TypeError – if
controls_identifier
is not of typeIterable
ValueError – if
treatment_identifier
is in the list of controlsValueError – if any of the controls is not in
foo['unit_variable']
ValueError – if any element of
special_predictors
is not an Iterable of length 3ValueError – if a predictor in an element of
special_predictors
is not a column of fooValueError – if one of the operators in an element of
special_predictors
is not one of “mean”, “std”, “median”, “sum”, “count”, “max”, “min” or “var”.