SDTL Cut Functions¶

November 13, 2020

George Alter

Stata, R, and Python include a function called “cut” that is used to convert a numerical variable into a set of categories. In each of these languages the “cut” function operates in two different modes, and there are actually three different modes. For this reason, SDTL has three different cut functions. The Python Pandas qcut function provides one mode overlapping with Stata as well as an additional mode.

cut_list( ) – uses a list of break-points to divide cases into groups (Stata egen, R, Python)

cut_range( ) – divides rows into k groups by dividing the range of the reference variable into groups of equal widths (R, Python)

cut_freq ( ) – divides rows into k groups with equal numbers of rows in each (Stata egen, Python qcut)

cut_quant( ) – divides rows into groups based on a list of quantile boundaries (Python qcut)

cut_list( )		isRequired	defaultValue
EXP1	VariableSymbolExpression for variable used in assigning rows to groups	true


EXP2	ValueListExpression – list of break-points	true

EXP3	“Right” if intervals are closed on the right; “Left” if intervals are closed on the left	false	“Right”


EXP4	Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left. See R example	false


EXP5	Type of coding: “Left” – Minimum value of group “Int_code” – Integer codes are assigned to each group Note: Labels assigned to groups are defined in a separate SDTL SetValueLabels command	false	“Left”

Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999)
cut_list( )	EXP1	varX
	EXP2	-999, 3, 9, 999
	EXP3	“Left”
	EXP4	Null
	EXP5	“Left”

Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999) icodes
cut_list( )	EXP1	varX
	EXP2	-999, 3, 9, 999
	EXP3	“Left”
	EXP4	Null
	EXP5	“Int_code”

R: cut(cdata$varX, c( -999, 3, 9, 999), include_lowest=TRUE ) Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command. R assigns the range of each group as a value label, which should be assigned in a SetValueLabels command.
cut_list( )	EXP1	varX
	EXP2	-999, 3, 9, 999
	EXP3	“Right”
	EXP4	“True”
	EXP5	“Int_code”
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1” } ], “labels”: [ {“value”: “1”, “label”: “(-999,3]”}, {“value”: “2”, “label”: “(3.9]”}, {“value”: “3”, “label”: “(9,999]”} ] },

R: cut(cdata$varX, c( -999, 3, 9, 999), right=FALSE ) Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command.
cut_list( )	EXP1	varX
	EXP2	-999, 3, 9, 999
	EXP3	“Left”
	EXP4	null
	EXP5	“Int_code”
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1” } ], “labels”: [ {“value”: “1”, “label”: “[-999,3)”}, {“value”: “2”, “label”: “[3.9)”}, {“value”: “3”, “label”: “[9,999)”} ] },

Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], include_lowest=True) Note: Pandas assigns the range of each group as a value label. Use the SDTL SetValueLabels command for value labels.
cut_list( )	EXP1	varX
	EXP2	-999, 3, 9, 999
	EXP3	“Right”
	EXP4	“True”
	EXP5	“Int_code”
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1d” } ], “labels”: [ {“value”: “0”, “label”: “(-999.001, 3.0]”}, {“value”: “1”, “label”: “(3.0, 9.0]”}, {“value”: “2”, “label”: “(9.0, 999.0]”} ] },

Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], right=False, labels=[‘low’, ‘med’, ‘hi’] ) Note: Use the SDTL SetValueLabels command for value labels.
cut_list( )	EXP1	varX
	EXP2	-999, 3, 9, 999
	EXP3	“Left”
	EXP4	null
	EXP5	“Int_code”
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut1b” } ], “labels”: [ {“value”: “0”, “label”: “low”}, {“value”: “1”, “label”: “med”}, {“value”: “2”, “label”: “hi”} ] },

cut_range( )		isRequired	defaultValue
EXP1	VariableSymbolExpression for variable used in assigning rows to groups	true
EXP2	Number of groups to create by dividing the range of EXP1 into equal segments	true
EXP3	“Right” if intervals are closed on the right; “Left” if intervals are closed on the left	false	“Right”
EXP3		false	“Right”
EXP4	Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left	false	False

R: cut(cdata$varX, 3, include.lowest = TRUE, dig.lab=4) Note: Output of the cut function is a factor. Default value labels are boundaries of the group, e.g. “(3.83, 6.67)”, which should be set with the SetValueLabels command if the values are known. “dig.lab” sets the number of decimal places shown in the group label. Since “dig.lab” affects labels not data, it affects the SetValueLabels command when labels are known.
cut_range( )	EXP1	varX
	EXP2	3
	EXP3	“Right”
	EXP4	True

R: cut(cdata$varX, 3 , right=FALSE, labels=c(‘low’, ‘med’, ‘hi’))
cut_range( )	EXP1	varX
	EXP2	3
	EXP3	“Left”
	EXP4	null
{“$type”: “SetValueLabels”, “command”: “setValueLabels”, “variables”: [ {“$type”: “VariableSymbolExpression”, “variableName”: “cut2” } ], “labels”: [ {“value”: “1”, “label”: “low”}, {“value”: “2”, “label”: “med”}, {“value”: “3”, “label”: “hi”} ] },

Python (pandas): pd.cut(cdata.varX, 3, precision=4) Note: Default value labels are boundaries of the group, e.g. “(0.992, 3.833]”, which should be set with the SetValueLabels command if they are known. “precision” sets the number of decimal places shown in the group label. Since “precision” affects labels not data, it affects the SetValueLabels command when labels are known.
cut_range( )	EXP1	varX
	EXP2	3
	EXP3	“Right”
	EXP4	null

Python (pandas): pd.cut(cdata.varX, 3, labels=False) Note: “labels=False” assigns only integer indicators of the bins, instead of labels.
cut_range( )	EXP1	varX
	EXP2	3
	EXP3	“Right”
	EXP4	null

cut_freq ( )		isRequired	defaultValue
EXP1	VariableSymbolExpression for variable used in assigning rows to groups	true
EXP2	Number of groups with equal numbers of rows in each group. Rows are assigned to groups by sorting on EXP1	true
EXP3	“Right” if intervals are closed on the right; “Left” if intervals are closed on the left	false	“Right”
EXP4	Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left	false	False

Stata: egen cut2 = cut(varX), group(3)
cut_freq ( )	EXP1	varX
	EXP2	3
	EXP3	“Left”
	EXP4	null

Stata: egen cut2a = cut(varX), group(3) label Note: “label” creates value labels with the left-hand ends of the groups, such as “3.5-”. Labels should be set with the SetValueLabels command. Since the boundaries of the groups depend on the data, a SetValueLabels is not possible.
cut_freq ( )	EXP1	varX
	EXP2	3
	EXP3	“Left”
	EXP4	null

Python (pandas): pd.qcut(cdata.varX, 3)
Note: The values of cut points depend upon the data.
cut_freq ( )	EXP1	varX
	EXP2	3
	EXP3	“Right”
	EXP4	null

cut_quant( )		isRequired	defaultValue
EXP1	VariableSymbolExpression for variable used in assigning rows to groups	true
EXP2	ValueListExpression with boundaries of groups defined by quantiles, e.g. [0, .25, .75, 1]	true
EXP3	“Right” if intervals are closed on the right; “Left” if intervals are closed on the left	false	“Right”

Python (pandas): pd.qcut(cdata.varX, 3)
cut_quant( )	EXP1	varX
	EXP2	[0, .4, .8, 1]
	EXP3	“Right”

References¶

Stata: https://www.stata.com/help.cgi?egen

R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cut

Python:

pandas.cut https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

pandas.qcut: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html