# Data sampling

TableTorch’s **data sampling** function reads the data of the
selected range and inserts a new sheet containing separate samples
of original data with rows being selected in accordance with
specified options.

It can be used for the following purposes:

**Splitting**the data into separate train-test sets or a number of equally sized sets useful for K-Fold Cross-Validation.**Randomizing**rows order.**Stratified**random sampling:**Uniform:**the splits should have the same number of rows belonging to each stratum.**Proportional:**the share of each stratum should be the same as it was in the original dataset in every sample.

- Sampling
**with replacement:**each row has same probability of being included in resulting split, number of rows can be greater than it is in original dataset, there is a chance that the same row appears more than once in particular sample.

The sampling techniques that are present on the **Sampling**
panel are the same as they are for linear
and logistic regressions. The underlying
algorithm is also the same. Thus, **sampling** could be useful to visually
review how the data is going to be split before doing a regression,
as well as to perform any other research on the samples.

Let’s review application of each of the available options on the vehicle dataset in the following sections.

## Start TableTorch

- Install
**TableTorch**to Google Sheets via Google Workspace Marketplace. More details on initial setup. - Click on the
**TableTorch**icon on right-side panel of Google Sheets.

## Train-test split

Select the whole dataset and click on the **Sampling** button
of the TableTorch menu.

A panel with sampling options will appear:

Click the **Collect** button and **TableTorch** will insert a
new sheet with two samples of the data, each consisting
of half of the rows of the original dataset.

Rows *12..4066* are hidden on the screenshot above in order
to demonstrate that the resulting sheet contains two separate
sets of data with identical column structure, a header row with
set’s identification, as well as an additional row with
names of the columns.

Sampling with replacement was not used so each of the sets contains only unique records from original dataset.

## Stratified 3-Fold Cross-Validation splitting

Let’s try 3-Fold Cross Validation splitting with **year stratum**
(see fine-tuning regressions page for its
formula) as the stratum column and stratified uniform random sampling.
The entirety of options set is shown on the picture below.

Click the **Collect** button in order to perform the sampling.

Some of the rows are hidden on the picture above so that the header rows of the folds and their sets are seen.

TableTorch produced 3 folds, each of them contains a **training set**
with two thirds of the data and a **validation set** containing
a unique one third of original dataset.

Each stratum identified by the **year stratum** column is uniformly represented
in a each of the training and validation sets, i.e. the number of rows
should be the same. A slight deviation may occur if the number of rows cannot
be divided by cross-validation’s **k** parameter (i.e. 3, 5, or 10) evenly.

## Sampling with replacement

**Replacement allows performing stratified random sampling
without worrying about the underrepresented strata.**

Imagine a dataset of **200** rows where **40** rows belong to stratum **A**
and **160** to stratum **B**. Stratified uniform random sampling should
produce a dataset with identical number of rows belonging to each stratum.
Hence with default settings, it can only produce a dataset consisting of **80** records,
**40** of **A** and **40** of **B**. Any statistical analysis to be done
on the produced sample will lose **120** or **75%** of the rows belonging
to stratum **B** which is significant signal loss and
might impair soundness of the analysis.

Sampling with replacement is designed to alleviate this shortcoming.
It does so by randomly selecting a row from the original set a predefined
number of times. Thus, if replacement is used, stratified random uniform sampling
can produce a dataset with **160** or more rows for both strata. However,
some of those rows will be duplicates so these kind of sampling is only
suitable for certain statistical analysis, e.g. for linear regressions.

If replacement is enabled, **TableTorch** uses the following heuristic
to compute the number of rows to select:

- Let
**n**be the number of rows in original dataset divided by the count of strata. E.g. for a dataset of**240**rows and**3**strata,**n**is**80**. - If original dataset has less than
**1000**rows, it selects a multiple of**n**rows so as to increase chances of all of the original rows getting into the dataset. For example, if**n**is**80**,**TableTorch**is likely to select at least**200**rows for each stratum depending on the number of columns and other circumstances. - Otherwise, it selects
**n**rows for each stratum. This may result into some of the rows being omitted from the resulting sample, however it is needed to reduce the probability of exceeding maximum execution time or inadvertently adding more cells than 5 million cells limit of Google Sheets.

## Formulas

**TableTorch** copies the data into resulting dataset by values,
i.e. formulas are not copied. This is done to speed up the process
and avoid exceeding quotas. It is deemed that the produced
sheet is of a temporary nature and is useful for subsequent
regressions or other data manipulation rather than formula
experimentation.

See also on Wikipedia:

**Google, Google Sheets, Google Workspace** and **YouTube** are trademarks of **Google LLC**.
**Gaujasoft TableTorch** is not endorsed by or affiliated with **Google** in any way.

## Let us know!

Thank you for using or considering to use **TableTorch!**

Does this page accurately and appropriately describe the function in question? Does it actually work as explained here or is there any problem? Do you have any suggestion on how we could improve?

Please let us know if you have any questions.

- E-mail:
**__****__****__**_____ - Facebook page
- Twitter profile