Tuesday, May 5, 2009

Data Coding

Data Analysis: The Art And Science Of Coding And Entering Data

Data can be collected from many sources: from individuals, schools, worksites, medical care institutions, and government agencies as well as from records of all types, such as patient, work or school records. Regardless of what data are collected, from whom and under what circumstances, they usually need to be coded before being processed, analyzed and reported. This article focuses on the coding and entering of data for subsequent processing, analysis and reporting.


Why code data? Simply, processing requires a numeric response for each item of interest. For example, if you ask people if they belong to an HMO, and the possible answers are "yes" or "no," you couldn't just put an "x" next to the response and enter an "x" into the computer. Rather, you need a number so the data can be tabulated. In this case you could code "yes" = 1 and "no" = 2. In this way when you process your data, you could count the number of 1s and 2s.

However, not all data need to be coded. For example, if age is a variable of interest, then the age itself can be used. No coding is necessary. Similarly, if weight or number of days out of school or work is desired, then the actual number of pounds or days would be appropriate. Again, no coding is necessary. However, not all data use the actual values; coding is then necessary. Using the above information, let's say you had a question about age and had categorized the possible responses as "less than 18 years old," "18 to 35," "36 to 50," "51 to 65," and "over 65 years of age." To code these data you could use the following codes: less than 18 = 1, 18 to 35 = 2, 36 to 50 = 3, 51 to 65 =4 and over 65 = 5. If you are interested in what state a person or institution is located, you might use a code such as Alabama = 1, Alaska = 2, Arizona = 3 and so forth, with the last entry being Wyoming = 50 or 51 if the District of Columbia is included. Note in this case the code uses two digits, as there are more than nine categories. In most instances a one-digit code is sufficient. For example, in reviewing records a person wanting to know the highest level of education for each person's record might code: less than high school diploma = 1, high school graduate = 2, some college = 3, college graduate = 4, and graduate degree = 5. So for each numeric variable you wish to analyze, you should either use the actual numbers or a coded value. You should do this before you collect your data. Having an idea of how to code your data before collecting it will help in subsequent processing and analysis.

In coding data, sometimes the respondent doesn't know, refuses to respond, or refuses to answer an item. For example, if data were being collected about a sensitive topic such as whether a person had ever been ticketed for drinking while intoxicated, a respondent may not wish to answer, whereas asking an employer the same question about one of their employees may result in a legitimate "don't know" response. Similarly, a respondent may simply inadvertently skip a question. By one convention, "don't knows" are coded as "8" and refused or missing data as "9." Coding the don't know/refused/missing data allows them to be analyzed later.

Sometimes you collect data using open-ended questions such as "What did you like most about the program?" Respondents then write in their answers. In this case, you would review a number of responses and then develop a code such as the instructor = 1, the content = 2, the methods used = 3, the time of day = 4, the day of the week = 5, etc. To the extent possible, variables of interest should either use actual or precoded values, because postcoding of open-ended items is time consuming and some subjectivity is inevitable in translating written responses into coded numerical values.

Before entering your data into a computer for processing, you should develop a code that indicates how each variable is coded (whether it is actual or coded values) and in what field it appears (Babbie, 2000). A sample codebook for several variables might look like this:

Columns   Variable           Codes

1-3 Case # 001-150 (for 150 cases)
4 Gender 1 = male, 2= female
5-6 Month of birth 1 = January, 2 = February, etc.
7 Type of health 1 = Private, 2 = Public, 3 = Uninsured
Insurance 8 = Don't know, 9 = Refused/Missing
8-9 Height in inches e.g. 64.0 inches = 64, 64.5 = 65,
66.25 = 66
10 Marital Status 1 = Married, 2 = Widowed,
3 = Separated, 4 = Divorced,
5 = Never Married

We will now use this codebook to enter the data for the three cases. A case is the data for each person or unit of interest. For example, if you were interested in the data from 140 people enrolled in a health promotion program, there would be 140 cases in your data set. If you were interested in looking at the data for the states there would be 50 cases. There are several reasons why each case (questionnaire, etc.) should be assigned a number. One reason is that if an error in data entry is noticed during analysis, the case number can direct the person back to the original data for correction. A second reason concerns follow-up. By putting an ID number on each case at the beginning of your study, you can quickly identify those cases for which you have and don't have data. Thus, when following up nonrespondents, as in a mail or phone survey, you immediately know whom to follow-up.