SDTL Cut Functions

November 13, 2020

George Alter

Stata, R, and Python include a function called “cut” that is used to convert a numerical variable into a set of categories. In each of these languages the “cut” function operates in two different modes, and there are actually three different modes. For this reason, SDTL has three different cut functions. The Python Pandas qcut function provides one mode overlapping with Stata as well as an additional mode.

cut_list( ) – uses a list of break-points to divide cases into groups (Stata egen, R, Python)

cut_range( ) – divides rows into k groups by dividing the range of the reference variable into groups of equal widths (R, Python)

cut_freq ( ) – divides rows into k groups with equal numbers of rows in each (Stata egen, Python qcut)

cut_quant( ) – divides rows into groups based on a list of quantile boundaries (Python qcut)

cut_list( )

isRequired

defaultValue

EXP1

VariableSymbolExpression for variable used in assigning rows to groups

true

EXP2

ValueListExpression – list of break-points

true

EXP3

“Right” if intervals are closed on the right; “Left” if intervals are closed on the left

false

“Right”

EXP4

Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left. See R example

false

EXP5

Type of coding:

“Left” – Minimum value of group

“Int_code” – Integer codes are assigned to each group

Note: Labels assigned to groups are defined in a separate SDTL SetValueLabels command

false

“Left”

Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999)

cut_list( )

EXP1

varX

EXP2

-999, 3, 9, 999

EXP3

“Left”

EXP4

Null

EXP5

“Left”

Stata: egen cut1 = cut(varX), at( -999, 3, 9, 999) icodes

cut_list( )

EXP1

varX

EXP2

-999, 3, 9, 999

EXP3

“Left”

EXP4

Null

EXP5

“Int_code”

R: cut(cdata$varX, c( -999, 3, 9, 999), include_lowest=TRUE )

Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command.

R assigns the range of each group as a value label, which should be assigned in a SetValueLabels command.

cut_list( )

EXP1

varX

EXP2

-999, 3, 9, 999

EXP3

“Right”

EXP4

“True”

EXP5

“Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1” }

],

“labels”: [

{“value”: “1”, “label”: “(-999,3]”},

{“value”: “2”, “label”: “(3.9]”},

{“value”: “3”, “label”: “(9,999]”}

]

},

R: cut(cdata$varX, c( -999, 3, 9, 999), right=FALSE )

Note: The new variable in R is a factor except when “labels=FALSE” which results in an integer vector of level codes. The resulting variable type should be declared with an SDTL SetDataType command.

cut_list( )

EXP1

varX

EXP2

-999, 3, 9, 999

EXP3

“Left”

EXP4

null

EXP5

“Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1” }

],

“labels”: [

{“value”: “1”, “label”: “[-999,3)”},

{“value”: “2”, “label”: “[3.9)”},

{“value”: “3”, “label”: “[9,999)”}

]

},

Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], include_lowest=True)

Note: Pandas assigns the range of each group as a value label. Use the SDTL SetValueLabels command for value labels.

cut_list( )

EXP1

varX

EXP2

-999, 3, 9, 999

EXP3

“Right”

EXP4

“True”

EXP5

“Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1d” }

],

“labels”: [

{“value”: “0”, “label”: “(-999.001, 3.0]”},

{“value”: “1”, “label”: “(3.0, 9.0]”},

{“value”: “2”, “label”: “(9.0, 999.0]”}

]

},

Python (pandas): pd.cut(cdata.varX, [-999, 3, 9, 999], right=False, labels=[‘low’, ‘med’, ‘hi’] )

Note: Use the SDTL SetValueLabels command for value labels.

cut_list( )

EXP1

varX

EXP2

-999, 3, 9, 999

EXP3

“Left”

EXP4

null

EXP5

“Int_code”

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut1b” }

],

“labels”: [

{“value”: “0”, “label”: “low”},

{“value”: “1”, “label”: “med”},

{“value”: “2”, “label”: “hi”}

]

},

cut_range( )

isRequired

defaultValue

EXP1

VariableSymbolExpression for variable used in assigning rows to groups

true

EXP2

Number of groups to create by dividing the range of EXP1 into equal segments

true

EXP3

“Right” if intervals are closed on the right; “Left” if intervals are closed on the left

false

“Right”

EXP4

Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left

false

False

R: cut(cdata$varX, 3, include.lowest = TRUE, dig.lab=4)

Note: Output of the cut function is a factor.

Default value labels are boundaries of the group, e.g. “(3.83, 6.67)”, which should be set with the SetValueLabels command if the values are known.

“dig.lab” sets the number of decimal places shown in the group label. Since “dig.lab” affects labels not data, it affects the SetValueLabels command when labels are known.

cut_range( )

EXP1

varX

EXP2

3

EXP3

“Right”

EXP4

True

R: cut(cdata$varX, 3 , right=FALSE, labels=c(‘low’, ‘med’, ‘hi’))

cut_range( )

EXP1

varX

EXP2

3

EXP3

“Left”

EXP4

null

{“$type”: “SetValueLabels”,

“command”: “setValueLabels”,

“variables”: [

{“$type”: “VariableSymbolExpression”,

“variableName”: “cut2” }

],

“labels”: [

{“value”: “1”, “label”: “low”},

{“value”: “2”, “label”: “med”},

{“value”: “3”, “label”: “hi”}

]

},

Python (pandas): pd.cut(cdata.varX, 3, precision=4)

Note:

Default value labels are boundaries of the group, e.g. “(0.992, 3.833]”, which should be set with the SetValueLabels command if they are known.

“precision” sets the number of decimal places shown in the group label. Since “precision” affects labels not data, it affects the SetValueLabels command when labels are known.

cut_range( )

EXP1

varX

EXP2

3

EXP3

“Right”

EXP4

null

Python (pandas): pd.cut(cdata.varX, 3, labels=False)

Note: “labels=False” assigns only integer indicators of the bins, instead of labels.

cut_range( )

EXP1

varX

EXP2

3

EXP3

“Right”

EXP4

null

cut_freq ( )

isRequired

defaultValue

EXP1

VariableSymbolExpression for variable used in assigning rows to groups

true

EXP2

Number of groups with equal numbers of rows in each group. Rows are assigned to groups by sorting on EXP1

true

EXP3

“Right” if intervals are closed on the right; “Left” if intervals are closed on the left

false

“Right”

EXP4

Boolean “True” if EXP3=”Right” and lowest group is inclusive on the left

false

False

Stata: egen cut2 = cut(varX), group(3)

cut_freq ( )

EXP1

varX

EXP2

3

EXP3

“Left”

EXP4

null

Stata: egen cut2a = cut(varX), group(3) label

Note: “label” creates value labels with the left-hand ends of the groups, such as “3.5-”. Labels should be set with the SetValueLabels command.

Since the boundaries of the groups depend on the data, a SetValueLabels is not possible.

cut_freq ( )

EXP1

varX

EXP2

3

EXP3

“Left”

EXP4

null

Python (pandas): pd.qcut(cdata.varX, 3)

Note: The values of cut points depend upon the data.

cut_freq ( )

EXP1

varX

EXP2

3

EXP3

“Right”

EXP4

null

cut_quant( )

isRequired

defaultValue

EXP1

VariableSymbolExpression for variable used in assigning rows to groups

true

EXP2

ValueListExpression with boundaries of groups defined by quantiles, e.g. [0, .25, .75, 1]

true

EXP3

“Right” if intervals are closed on the right; “Left” if intervals are closed on the left

false

“Right”

Python (pandas): pd.qcut(cdata.varX, 3)

cut_quant( )

EXP1

varX

EXP2

[0, .4, .8, 1]

EXP3

“Right”

References

Stata: https://www.stata.com/help.cgi?egen

R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cut

Python:

pandas.cut https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

pandas.qcut: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html