SDTL Cut Functions

November 13, 2020

George Alter

Stata, R, and Python include a function called “cut” that is used to convert a numerical variable into a set of categories. In each of these languages the “cut” function operates in two different modes, and there are actually three different modes. For this reason, SDTL has three different cut functions. The Python Pandas qcut function provides one mode overlapping with Stata as well as an additional mode.

cut_list( ) – uses a list of break-points to divide cases into groups (Stata egen, R, Python)

cut_range( ) – divides rows into k groups by dividing the range of the reference variable into groups of equal widths (R, Python)

cut_freq ( ) – divides rows into k groups with equal numbers of rows in each (Stata egen, Python qcut)

cut_quant( ) – divides rows into groups based on a list of quantile boundaries (Python qcut)

cut_list( )   isRequired defaultValue
EXP1 VariableSymbolExpression for variable used in assigning rows to groups true  
EXP2 ValueListExpression – list of break-points true  
EXP3 “Right” if intervals are closed on the right; “Left” if intervals are closed on the left false “Right”
EXP4 Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left. See R example false  
EXP5

Type of coding:

“Left” – Minimum value of group

“Int_code” – Integer codes are assigned to each group

Note: Labels assigned to groups are defined in a separate SDTL SetValueLabels command

false “Left”
Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999)    
cut_list( ) EXP1 varX
EXP2 -999, 3, 9, 999
EXP3 “Left”
EXP4 Null
EXP5 “Left”
Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999) icodes    
cut_list( ) EXP1 varX
  EXP2 -999, 3, 9, 999
  EXP3 “Left”
  EXP4 Null
  EXP5 “Int_code”

R: cut(cdata$varX, c( -999, 3, 9, 999), include_lowest=TRUE )

Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command.

R assigns the range of each group as a value label, which should be assigned in a SetValueLabels command.

   
cut_list( ) EXP1 varX
  EXP2 -999, 3, 9, 999
  EXP3 “Right”
  EXP4 “True”
  EXP5 “Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1” }

],

“labels”: [

{“value”: “1”, “label”: “(-999,3]”},

{“value”: “2”, “label”: “(3.9]”},

{“value”: “3”, “label”: “(9,999]”}

]

},

   

R: cut(cdata$varX, c( -999, 3, 9, 999), right=FALSE )

Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command.

   
cut_list( ) EXP1 varX
  EXP2 -999, 3, 9, 999
  EXP3 “Left”
  EXP4 null
  EXP5 “Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1” }

],

“labels”: [

{“value”: “1”, “label”: “[-999,3)”},

{“value”: “2”, “label”: “[3.9)”},

{“value”: “3”, “label”: “[9,999)”}

]

},

   

Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], include_lowest=True)

Note: Pandas assigns the range of each group as a value label. Use the SDTL SetValueLabels command for value labels.

   
cut_list( ) EXP1 varX
  EXP2 -999, 3, 9, 999
  EXP3 “Right”
  EXP4 “True”
  EXP5 “Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1d” }

],

“labels”: [

{“value”: “0”, “label”: “(-999.001, 3.0]”},

{“value”: “1”, “label”: “(3.0, 9.0]”},

{“value”: “2”, “label”: “(9.0, 999.0]”}

]

},

   

Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], right=False, labels=[‘low’, ‘med’, ‘hi’] )

Note: Use the SDTL SetValueLabels command for value labels.

   
cut_list( ) EXP1 varX
  EXP2 -999, 3, 9, 999
  EXP3 “Left”
  EXP4 null
  EXP5 “Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1b” }

],

“labels”: [

{“value”: “0”, “label”: “low”},

{“value”: “1”, “label”: “med”},

{“value”: “2”, “label”: “hi”}

]

},

   
cut_range( )   isRequired defaultValue
EXP1 VariableSymbolExpression for variable used in assigning rows to groups true  
EXP2 Number of groups to create by dividing the range of EXP1 into equal segments true  
EXP3 “Right” if intervals are closed on the right; “Left” if intervals are closed on the left false “Right”
EXP4 Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left false False

R: cut(cdata$varX, 3, include.lowest = TRUE, dig.lab=4)

Note: Output of the cut function is a factor.

Default value labels are boundaries of the group, e.g. “(3.83, 6.67)”, which should be set with the SetValueLabels command if the values are known.

“dig.lab” sets the number of decimal places shown in the group label. Since “dig.lab” affects labels not data, it affects the SetValueLabels command when labels are known.

   
cut_range( ) EXP1 varX
  EXP2 3
  EXP3 “Right”
  EXP4 True
R: cut(cdata$varX, 3 , right=FALSE, labels=c(‘low’, ‘med’, ‘hi’))    
cut_range( ) EXP1 varX
  EXP2 3
  EXP3 “Left”
  EXP4 null

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut2” }

],

“labels”: [

{“value”: “1”, “label”: “low”},

{“value”: “2”, “label”: “med”},

{“value”: “3”, “label”: “hi”}

]

},

   

Python (pandas): pd.cut(cdata.varX, 3, precision=4)

Note:

Default value labels are boundaries of the group, e.g. “(0.992, 3.833]”, which should be set with the SetValueLabels command if they are known.

“precision” sets the number of decimal places shown in the group label. Since “precision” affects labels not data, it affects the SetValueLabels command when labels are known.

   
cut_range( ) EXP1 varX
  EXP2 3
  EXP3 “Right”
  EXP4 null

Python (pandas): pd.cut(cdata.varX, 3, labels=False)

Note: “labels=False” assigns only integer indicators of the bins, instead of labels.

   
cut_range( ) EXP1 varX
  EXP2 3
  EXP3 “Right”
  EXP4 null
cut_freq ( )   isRequired defaultValue
EXP1 VariableSymbolExpression for variable used in assigning rows to groups true  
EXP2 Number of groups with equal numbers of rows in each group. Rows are assigned to groups by sorting on EXP1 true  
EXP3 “Right” if intervals are closed on the right; “Left” if intervals are closed on the left false “Right”
EXP4 Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left false False
Stata: egen cut2 = cut(varX), group(3)    
cut_freq ( ) EXP1 varX
EXP2 3
EXP3 “Left”
EXP4 null

Stata: egen cut2a = cut(varX), group(3) label

Note: “label” creates value labels with the left-hand ends of the groups, such as “3.5-”. Labels should be set with the SetValueLabels command.

Since the boundaries of the groups depend on the data, a SetValueLabels is not possible.

   
cut_freq ( ) EXP1 varX
  EXP2 3
  EXP3 “Left”
  EXP4 null
Python (pandas): pd.qcut(cdata.varX, 3)    
Note: The values of cut points depend upon the data.    
cut_freq ( ) EXP1 varX
EXP2 3
EXP3 “Right”
EXP4 null
cut_quant( )   isRequired defaultValue
EXP1 VariableSymbolExpression for variable used in assigning rows to groups true  
EXP2 ValueListExpression with boundaries of groups defined by quantiles, e.g. [0, .25, .75, 1] true  
EXP3 “Right” if intervals are closed on the right; “Left” if intervals are closed on the left false “Right”
Python (pandas): pd.qcut(cdata.varX, 3)    
cut_quant( ) EXP1 varX
EXP2 [0, .4, .8, 1]
EXP3 “Right”