Indian District Database

Data cleaning:

We have undertaken extensive data checking to ensure that the data are as accurate as possible. Most of the errors in the printed publications have been corrected. In general, we have checked that

Often, more than one internal consistency check was possible (e.g., checking both that males and females summed to persons and that all the industrial categories summed to the total workers).

For the census data, the ability to check for both state totals and internal consistency enabled us to identify and correct the exact source of errors in the census publications. For the census data, therefore, the computer database is more accurate than any other available source.

Data cleaning was more problematic for the agricultural data because internal consistency checks were not always available. Land utilisation data in Indian Agricultural Statistics provided sufficient internal checks so that together with the state totals, we were able to identify the source of most of the errors. Remaining problems are described above.

The cropwise area and production data were the most difficult to clean. Between 1955/56 and 1964/65, the source published area, production, and yield per hectare so we were able to calculate an internal check to see if the yield we calculated agreed with the published yield. After 1964/65, virtually the only indication of errors was the failure of the districts' area and production to sum to the state total. When these errors were encountered, we used one or more of the following to identify the source of the error:

These comparisions enabled us to correct for the largest of the problems. However inconsistencies remain, and we are continuing our efforts to provide the best crop data available.

