Cleaning the Indian district

Home page

Files

Codebook

Announcements

Bibliography

Data cleaning:

We have undertaken extensive data checking to ensure that the data are as accurate as possible. Most of the errors in the printed publications have been corrected. In general, we have checked that

the totals for all districts in a state equal the published state totals.

internal totals are correct, for instance, that the number of males and females equal the number of persons.

Often, more than one internal consistency check was possible (e.g., checking both that males and females summed to persons and that all the industrial categories summed to the total workers).

For the census data, the ability to check for both state totals and internal consistency enabled us to identify and correct the exact source of errors in the census publications. For the census data, therefore, the computer database is more accurate than any other available source.

Data cleaning was more problematic for the agricultural data because internal consistency checks were not always available. Land utilisation data in Indian Agricultural Statistics provided sufficient internal checks so that together with the state totals, we were able to identify the source of most of the errors. Remaining problems are described above.

The cropwise area and production data were the most difficult to clean. Between 1955/56 and 1964/65, the source published area, production, and yield per hectare so we were able to calculate an internal check to see if the yield we calculated agreed with the published yield. After 1964/65, virtually the only indication of errors was the failure of the districts' area and production to sum to the state total. When these errors were encountered, we used one or more of the following to identify the source of the error:

Outliers in the calculated yields per hectare.

Area and production data for adjacent years.

Area and production as given in alternative sources, for instance, in state statistical abstracts or state season and crop reports.

These comparisions enabled us to correct for the largest of the problems. However inconsistencies remain, and we are continuing our efforts to provide the best crop data available.

Return to:
Top of current page	Codebook home page	IDD Home page

Last updated October 1, 2000

comments to: Reeve Vanneman. reeve@umd.edu