Sampling is a tool that is used to indicate how much data to collect and how often it should be collected. This tool defines the samples to take in order to quantify a system, process, issue, or problem.
To illustrate sampling, consider a loaf of bread. How good is the bread? To find out, is it necessary to eat the whole loaf? No, of course not. To make a judgment about the entire loaf, it is necessary only to taste a sample of the loaf, such as a slice. In this case the loaf of bread being studied is known as the population of the study. The sample, the slice of bread, is a subset or a part of the population.
Now consider a whole bakery. The population of interest is no longer a loaf, but all the bread that has been made today. A sample size of one slice from one loaf is clearly inadequate for this larger population. The sample collected will now become several loaves of bread taken at set times throughout the day. Since the population is larger, the sample will also be larger. The larger the population, the larger the sample required.
In the bakery example, bread is made in an ongoing process. That is, bread was made yesterday, throughout today, and will be made tomorrow. For an ongoing process, samples need to be taken to identify how the process is changing over time. Studying how the samples are changing with control charts will show where and how to improve the process, and allow prediction of future performance.
For example, the bakery is interested in the weight of the loaves. The bakery does not want to weigh every single loaf, as this would be too expensive, too time consuming, and no more accurate than sampling some of the loaves. Sampling for improvement and monitoring is a matter of taking small samples frequently over time. The questions now become:
These two questions, “how much?” and “how often?” are at the heart of sampling.
Comply with critical quality standards, reduce variability, improve profitability, and reduce costs.
Factors to consider might be changes of personnel, equipment, or materials. The questions identified in step 1 may give guidance to this step.
Common frequencies of sampling are hourly, daily, weekly, or monthly. Although frequency is usually stated in time, it can also be stated in number: every tenth part, every fifth purchase order, every other invoice, for example. If it is not clear how frequently the process changes, collect data frequently, examine the results, and then set the frequency accordingly.
Determine the actual frequency times.
The purpose of this step is to state the actual time to take the samples. For instance, if the frequency were determined to be daily, what time of day should the sample be taken—in the morning at 8:00 am, around midday, or late in the day around 5:00 pm? This is important because inconsistent timing between data gathering times will lead to data that is unreliable for further analysis. For example, if a sample is to be taken daily, and on one day it is taken at 8:00 am, the next day at 5:00 pm, and the following day at midday, the timing between the samples is inconsistent and the collected data will also be inconsistent. The data will exhibit unusual patterns and will be less meaningful. Stating the time that the sample is to be taken will reduce this type of error. The actual time should be chosen as close to any expected changes in the process as possible, and when taking a sample will be convenient. Avoid difficult times, such as during a shift change or lunch break.”
For variables data: When measuring variables data, a subgroup size larger than one is preferable because larger subgroups sizes yield greater possibilities for analysis. However, it may not be possible to get a subgroup size larger than one. Some examples of this are electricity usage per month, profit per month, sales per month, temperature of a room, and the viscosity of a fluid. In situations such as these when a subgroup size larger than one does not make sense, the subgroup (or sample) size is equal to one.
If a subgroup size larger than one can be chosen, the size is usually between three and eight. A subgroup size between three and eight has been determined to be statistically efficient. The most commonly-used subgroup size is five. When more data is desired, the frequency of taking samples, not the subgroup size, should be increased.
When a sample is taken, it should be selected to assure that conditions within the sample are similar. If gathering a sample size of five, for example, take all five pieces in a row as they are produced in the process. This is known as a rational subgroup.
For attributes data: The subgroup size for attributes data depends on the process being sampled. The general rule of thumb is to gather a large enough sample so that all possible characteristics being investigated will appear. That is, the sample is large enough that a “0” occurrence is rare.
Begin by answering the question, “How many items does this process produce during the frequency interval (per hour, week, etc.)?” When that number is determined, the sample size should be at least the square root of that number. For instance, if a purchasing department processes 100 purchase orders per week, an appropriate sample size would be 10 purchase orders per week (the square root of 100 is 10.)
The above article is an excerpt from the "Sampling" chapter of Practical Tools for Continuous Improvement: Volume 1 - Statistical Tools. The full chapter provides more details on sampling.