SDTL Cut Functions
November 13, 2020
George Alter
Stata, R, and Python include a function called “cut” that is used to convert a numerical variable into a set of categories. In each of these languages the “cut” function operates in two different modes, and there are actually three different modes. For this reason, SDTL has three different cut functions. The Python Pandas qcut function provides one mode overlapping with Stata as well as an additional mode.
cut_list( ) – uses a list of break-points to divide cases into groups (Stata egen, R, Python)
cut_range( ) – divides rows into k groups by dividing the range of the reference variable into groups of equal widths (R, Python)
cut_freq ( ) – divides rows into k groups with equal numbers of rows in each (Stata egen, Python qcut)
cut_quant( ) – divides rows into groups based on a list of quantile boundaries (Python qcut)
cut_list( ) |
isRequired |
defaultValue |
|
EXP1 |
VariableSymbolExpression for variable used in assigning rows to groups |
true |
|
EXP2 |
ValueListExpression – list of break-points |
true |
|
EXP3 |
“Right” if intervals are closed on the right; “Left” if intervals are closed on the left |
false |
“Right” |
EXP4 |
Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left. See R example |
false |
|
EXP5 |
Type of coding: “Left” – Minimum value of group “Int_code” – Integer codes are assigned to each group Note: Labels assigned to groups are defined in a separate SDTL SetValueLabels command |
false |
“Left” |
Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999) |
||
cut_list( ) |
EXP1 |
varX |
EXP2 |
-999, 3, 9, 999 |
|
EXP3 |
“Left” |
|
EXP4 |
Null |
|
EXP5 |
“Left” |
Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999) icodes |
||
cut_list( ) |
EXP1 |
varX |
EXP2 |
-999, 3, 9, 999 |
|
EXP3 |
“Left” |
|
EXP4 |
Null |
|
EXP5 |
“Int_code” |
R: cut(cdata$varX, c( -999, 3, 9, 999), include_lowest=TRUE ) Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command. R assigns the range of each group as a value label, which should be assigned in a SetValueLabels command. |
||
cut_list( ) |
EXP1 |
varX |
EXP2 |
-999, 3, 9, 999 |
|
EXP3 |
“Right” |
|
EXP4 |
“True” |
|
EXP5 |
“Int_code” |
|
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1” } ], “labels”: [ {“value”: “1”, “label”: “(-999,3]”}, {“value”: “2”, “label”: “(3.9]”}, {“value”: “3”, “label”: “(9,999]”} ] }, |
R: cut(cdata$varX, c( -999, 3, 9, 999), right=FALSE ) Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command. |
||
cut_list( ) |
EXP1 |
varX |
EXP2 |
-999, 3, 9, 999 |
|
EXP3 |
“Left” |
|
EXP4 |
null |
|
EXP5 |
“Int_code” |
|
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1” } ], “labels”: [ {“value”: “1”, “label”: “[-999,3)”}, {“value”: “2”, “label”: “[3.9)”}, {“value”: “3”, “label”: “[9,999)”} ] }, |
Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], include_lowest=True) Note: Pandas assigns the range of each group as a value label. Use the SDTL SetValueLabels command for value labels. |
||
cut_list( ) |
EXP1 |
varX |
EXP2 |
-999, 3, 9, 999 |
|
EXP3 |
“Right” |
|
EXP4 |
“True” |
|
EXP5 |
“Int_code” |
|
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1d” } ], “labels”: [ {“value”: “0”, “label”: “(-999.001, 3.0]”}, {“value”: “1”, “label”: “(3.0, 9.0]”}, {“value”: “2”, “label”: “(9.0, 999.0]”} ] }, |
Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], right=False, labels=[‘low’, ‘med’, ‘hi’] ) Note: Use the SDTL SetValueLabels command for value labels. |
||
cut_list( ) |
EXP1 |
varX |
EXP2 |
-999, 3, 9, 999 |
|
EXP3 |
“Left” |
|
EXP4 |
null |
|
EXP5 |
“Int_code” |
|
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1b” } ], “labels”: [ {“value”: “0”, “label”: “low”}, {“value”: “1”, “label”: “med”}, {“value”: “2”, “label”: “hi”} ] }, |
cut_range( ) |
isRequired |
defaultValue |
|
EXP1 |
VariableSymbolExpression for variable used in assigning rows to groups |
true |
|
EXP2 |
Number of groups to create by dividing the range of EXP1 into equal segments |
true |
|
EXP3 |
“Right” if intervals are closed on the right; “Left” if intervals are closed on the left |
false |
“Right” |
EXP4 |
Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left |
false |
False |
R: cut(cdata$varX, 3, include.lowest = TRUE, dig.lab=4) Note: Output of the cut function is a factor. Default value labels are boundaries of the group, e.g. “(3.83, 6.67)”, which should be set with the SetValueLabels command if the values are known. “dig.lab” sets the number of decimal places shown in the group label. Since “dig.lab” affects labels not data, it affects the SetValueLabels command when labels are known. |
||
cut_range( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Right” |
|
EXP4 |
True |
R: cut(cdata$varX, 3 , right=FALSE, labels=c(‘low’, ‘med’, ‘hi’)) |
||
cut_range( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Left” |
|
EXP4 |
null |
|
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut2” } ], “labels”: [ {“value”: “1”, “label”: “low”}, {“value”: “2”, “label”: “med”}, {“value”: “3”, “label”: “hi”} ] }, |
Python (pandas): pd.cut(cdata.varX, 3, precision=4) Note: Default value labels are boundaries of the group, e.g. “(0.992, 3.833]”, which should be set with the SetValueLabels command if they are known. “precision” sets the number of decimal places shown in the group label. Since “precision” affects labels not data, it affects the SetValueLabels command when labels are known. |
||
cut_range( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Right” |
|
EXP4 |
null |
Python (pandas): pd.cut(cdata.varX, 3, labels=False) Note: “labels=False” assigns only integer indicators of the bins, instead of labels. |
||
cut_range( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Right” |
|
EXP4 |
null |
cut_freq ( ) |
isRequired |
defaultValue |
|
EXP1 |
VariableSymbolExpression for variable used in assigning rows to groups |
true |
|
EXP2 |
Number of groups with equal numbers of rows in each group. Rows are assigned to groups by sorting on EXP1 |
true |
|
EXP3 |
“Right” if intervals are closed on the right; “Left” if intervals are closed on the left |
false |
“Right” |
EXP4 |
Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left |
false |
False |
Stata: egen cut2 = cut(varX), group(3) |
||
cut_freq ( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Left” |
|
EXP4 |
null |
Stata: egen cut2a = cut(varX), group(3) label Note: “label” creates value labels with the left-hand ends of the groups, such as “3.5-”. Labels should be set with the SetValueLabels command. Since the boundaries of the groups depend on the data, a SetValueLabels is not possible. |
||
cut_freq ( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Left” |
|
EXP4 |
null |
Python (pandas): pd.qcut(cdata.varX, 3) |
||
Note: The values of cut points depend upon the data. |
||
cut_freq ( ) |
EXP1 |
varX |
EXP2 |
3 |
|
EXP3 |
“Right” |
|
EXP4 |
null |
cut_quant( ) |
isRequired |
defaultValue |
|
EXP1 |
VariableSymbolExpression for variable used in assigning rows to groups |
true |
|
EXP2 |
ValueListExpression with boundaries of groups defined by quantiles, e.g. [0, .25, .75, 1] |
true |
|
EXP3 |
“Right” if intervals are closed on the right; “Left” if intervals are closed on the left |
false |
“Right” |
Python (pandas): pd.qcut(cdata.varX, 3) |
||
cut_quant( ) |
EXP1 |
varX |
EXP2 |
[0, .4, .8, 1] |
|
EXP3 |
“Right” |
References
Stata: https://www.stata.com/help.cgi?egen
R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cut
Python:
pandas.cut https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
pandas.qcut: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html