Text categories encoding
Columns with textual (categorical) data, such as job descriptions, vehicle model names, house features, sometimes hold key information regarding the respective rows. However, it is impossible to use these data for statistical analyses like fitting a regression model or computing a correlation matrix because it is not numeric in nature.
TableTorch can detect most common word combinations in the provided text and convert it into the following numeric forms:
- Binary columns: having either 1 or 0 as values depending on the presence of particular words. Note that these columns are not mutually exclusive, i.e. a single row can have a few ones rather than just a single one.
- Single category identifier column: A numeric identifier is assigned to each category and selected for each row depending on the presence of particular words. The most specific word combination is selected if there is more than one appropriate category.
- Phrase counting sheet: An information sheet containing all of the found word combinations in the input range with their respective counts.
This article will demonstrate usage of these TableTorch functions on the
model name column of vehicle dataset.
- Install TableTorch to Google Sheets via Google Workspace Marketplace. More details on initial setup.
- Click on the TableTorch icon on right-side panel of Google Sheets.
name column and click the Text categories encoding button.
Data loading and processing might take some time, after that the first 25 most commonly occurred phrases will be presented for selection.
Click the Binary columns button to insert the binary columns for the selected categories. The columns will appear in a few moments.
Although they might look excessive, binary category columns often help significantly improve the accuracy of linear models.
Single identifier column
Click the Single identifier column button to produce just one additional column. Note that if input text satisfies more than one category, the most specific, i.e. the lengthiest variant will be selected. Category identifiers are sorted by the frequency of occurrence in descending order, so that the #1 is the most commonly occurring category, whereas #N-1 is the least common one. Identifier #N is always assigned to the Other category, which is assigned to the rows that could not be assigned any other number.
Phrase counting sheet
Finally, a click on the Phrase counting sheet button will produce a separate sheet with all of the identified categories and their appropriate frequencies of occurrence.
Google, Google Sheets, Google Workspace and YouTube are trademarks of Google LLC. Gaujasoft TableTorch is not endorsed by or affiliated with Google in any way.
Let us know!
Thank you for using or considering to use TableTorch!
Does this page accurately and appropriately describe the function in question? Does it actually work as explained here or is there any problem? Do you have any suggestion on how we could improve?
Please let us know if you have any questions.
- E-mail: ___________
- Facebook page
- Twitter profile