2 Data Cleaning & Processing

in each chapter or part of the book, there will be dedicated package, so please ensure to install pre-requisted packages before runing the codes.

library(tidyverse)
library(readxl)
library(cleanR)

# survey_df <- cleanR::survey_data

survey_df <- read_excel("input/MoDa_Data_2025-03-13.xlsx")

Data processing and cleaning is critical part as it involves high frequency checks and spot check for survey data to flag issues that need to be verified and made changes by keeping reference of alterations made to the original data. There are predetermined quality checks that can be applied to any data and as well the data officer can draft unique checks based on the context. This brings the flexibility of adopting existing spot checks and as well adding your own checking parameters to the process and generate cleaning log book.

Data cleaning is an essential step between data collection and data analysis. Raw data is always imperfect and needs to be prepared for a high quality analysis and overall replicability.

Data quality means having accurate, complete, and reliable information. It’s like making sure the puzzle pieces fit perfectly, so we can trust the picture they create. Good data quality helps us make better decisions and understand what’s happening.

2.1 Productivity & Coverage

topics to work!! to be deleted :) - assessment tracking sheet report - assessment productivity - daily valid surveys

2.1.1 Survey Coverage

survey_tracking <- summarise(group_by(survey_df, ADMIN1Name, ADMIN2Name), count = n())
check_survey <- unique(na.omit(survey_df$today))
for(j in check_survey){
  survey_tracking[, ncol(survey_tracking) +1] <- (survey_df %>% 
                                                    group_by(ADMIN1Name, ADMIN2Name) %>% 
                                                    summarise(val=sum(na.omit(today==j))) %>% 
                                                    arrange(ADMIN1Name, ADMIN2Name)) $val
  names(survey_tracking)[ncol(survey_tracking)] <- paste0("Date:", j)
}

names(survey_tracking)[names(survey_tracking) == 'count'] <- "Total_Surveys"

knitr::kable(
  head(survey_tracking, 5), caption = 'Survey Tracking Table',
  booktabs = TRUE
)

Survey Tracking Table
ADMIN1Name	ADMIN2Name	Total_Surveys	Date:1743811200	Date:1743897600	Date:1743984000	Date:1744070400	Date:1744156800
Admin 1	1	36	8	6	11	4	7
Admin 1	2	24	3	3	8	6	4
Admin 1	3	31	1	4	15	5	6
Admin 1	4	31	2	5	15	6	3
Admin 1	5	42	5	9	14	7	7

2.1.2 Enumerator Performance

# summary of productivity
daily_productivity <- survey_df %>% 
  arrange(today) %>% 
  group_by(today) %>% 
  summarise(surveydate_N = n())

# enumerator level productivity
enum_productivity <- survey_df %>% 
  group_by(EnuName) %>% 
  summarise(days_worked = length(unique(today)), total_surveys_done = n(), daily_average = total_surveys_done/days_worked)

knitr::kable(
  head(enum_productivity, 5), caption = 'Enumerator Productivity',
  booktabs = TRUE
)

Enumerator Productivity
EnuName	days_worked	total_surveys_done	daily_average
Enum 1	5	193	38.6
Enum 2	5	200	40.0
Enum 3	5	193	38.6
Enum 4	5	228	45.6
Enum 5	5	186	37.2

ggplot(enum_productivity, aes(x = EnuName, y = total_surveys_done)) +
  geom_bar(stat = "identity") + theme_minimal()

2.2 Standard Checks

2.2.1 Duplicated Surveys

This check will help you to identify if there is duplicated record in the data and in most of the case uuid is the unique identifier which is auto generated.

check_dup_surveys <- check_duplicate_uuid(data = survey_df)

The following UUID(s) are duplicated:

character(0)

2.2.2 Check Missing Data

Before we think about missing data. first lets understand our survey tool as there will be constraints and relevancy logic patterns that will skip some questions if not relevent. after we’re clear from that. we can know anticipate that some questions are mandatory to collect and therefore we need to check how many missing varibles do we have on each. so in case if you see we’re missing 60% or more of key indicator you can check the tool first and than secondly follow with enumerators to undersatnad the pattern and make correction.

check_missing_data <- get_na_response_rates(data = survey_df)

knitr::kable(
  head(check_missing_data, 5), caption = 'Here is a nice table!',
  booktabs = TRUE
)

Here is a nice table!
	question	num_non_response	perc_non_response
ADMIN1Name	ADMIN1Name	0	0
ADMIN2Name	ADMIN2Name	0	0
EnuName	EnuName	0	0
EnuPartner	EnuPartner	0	0
EnuSex	EnuSex	0	0

2.2.3 Check Survey Time

survey ended before start time?
survey start time before the first day of data collection
start time after today’s date

lets check surveys which do not end on the same day as they started.

subset(subset(survey_df, as_date(survey_df$start) != as_date(survey_df$end), select = c("EnuName", "start", "end")))

# A tibble: 2 × 3
  EnuName start                    end                     
  <chr>   <chr>                    <chr>                   
1 Enum 5  2025-03-13T11:36:17+0300 2025-04-14T11:25:18+0300
2 Enum 5  2025-03-13T11:08:41+0300 2025-04-14T11:25:18+0300

Again lets check surveys that show start time earlier than first day of data collection

subset(subset(survey_df, as.Date(survey_df$start, "%y/%m/%d") < as.Date("2025-04-06", "%y/%m/%d")), select = c("EnuName", "start", "today"))

# A tibble: 0 × 3
# ℹ 3 variables: EnuName <chr>, start <chr>, today <dttm>

check_survey <- cleanR::survey_time(df = survey_df, time_min = 10, time_max = 30) %>% 
  log_sheet(question.name = "interview_duration",
            issue = " survey filled with less/more time",
            action = "check")

knitr::kable(
  head(check_survey, 5), caption = 'Here is a nice table!',
  booktabs = TRUE
)

Here is a nice table!
uuid	question.name	issue	action	old.value
76219b50-cd45-424044-868483-852c3fe17d49	interview_duration	survey filled with less/more time	check	46069.0166666667
0e4cd3f5-3dc4-434c4e-aca5a7-0fb6c891de5a	interview_duration	survey filled with less/more time	check	46096.6166666667
51cb7382-4280-414348-87808e-6f2490e183ac	interview_duration	survey filled with less/more time	check	-121.7
25a3b7cf-0f15-46454e-8c8382-50d6873ecfa1	interview_duration	survey filled with less/more time	check	-10.2833333333333
d2f90a3e-47a5-464d4c-92989e-57dec9b1fa02	interview_duration	survey filled with less/more time	check	-24.9666666666667

2.2.4 Check Other Responses

# first make a list of all other columns included in your data
other_columns <- c("RESPRelationHHH_oth",
                   "HHAsstOthCBTRecName_oth")

check_others <- check_other_responses(data = survey_df, other_columns = other_columns)

knitr::kable(
  head(check_others, 5), caption = 'Here is a nice table!',
  booktabs = TRUE
)

Here is a nice table!
uuid	question.name	issue	action	old.value
76219b50-cd45-424044-868483-852c3fe17d49	RESPRelationHHH_oth	Other response that need to be checked and recoded	translate and recode	may be relative
76219b50-cd45-424044-868483-852c3fe17d49	HHAsstOthCBTRecName_oth	Other response that need to be checked and recoded	translate and recode	other response
0e4cd3f5-3dc4-434c4e-aca5a7-0fb6c891de5a	RESPRelationHHH_oth	Other response that need to be checked and recoded	translate and recode	dumpy input for spot checks
0e4cd3f5-3dc4-434c4e-aca5a7-0fb6c891de5a	HHAsstOthCBTRecName_oth	Other response that need to be checked and recoded	translate and recode	need to translate
d7cfe208-cb21-4b414c-979c9d-fb9c1a456302	HHAsstOthCBTRecName_oth	Other response that need to be checked and recoded	translate and recode	looong narative

2.2.5 Check Outliers

there are two common ways to detect ourtliers, first using the range of 3 standards devitations from the mean and secondly using the range of 1.5 inter quartile from the 2st and 3rd uartile.

detect_outliers <- function(data) {
  # Select only numeric columns
  numeric_data <- data |>
    select(where(is.numeric))

  # Pivot to long format to handle all variables uniformly
  outlier_data <- numeric_data |>
    pivot_longer(everything(), names_to = "variable", values_to = "value") |>
    mutate(
      log_value = log(value),
      is_outlier_raw = abs(value - mean(value, na.rm = TRUE)) > 3 * sd(value, na.rm = TRUE),
      is_outlier_log = abs(log_value - mean(log_value, na.rm = TRUE)) > 3 * sd(log_value, na.rm = TRUE),
      type = case_when(
        is_outlier_raw & is_outlier_log ~ "both",
        is_outlier_raw ~ "raw",
        is_outlier_log ~ "log",
        TRUE ~ "none"
      )
    ) |>
    filter(type != "none") |>
    select(variable, value, log_value, type)

  return(outlier_data)
}

outliers_check <- detect_outliers(data = survey_df)

knitr::kable(
  head(outliers_check, 5), caption = 'Here is a nice table!',
  booktabs = TRUE
)

Here is a nice table!
variable	value	log_value	type
Lcs_em_Migration	9999	9.21024	raw
Lcs_em_Begged	9999	9.21024	raw
Lcs_stress_Saving	9999	9.21024	raw
Lcs_stress_BorrowCash	9999	9.21024	raw
Lcs_crisis_ProdAssets	9999	9.21024	raw

2.3 Specefic Checks

This is the power house of the concept as it will not be possible to log all issues that need to inspected from data due to the dynamics of different livelihood zones, socio-economic status and other attributes. therefore this specefic checks unit will show you how to first identify and log all issues that need to be addressed/flagged. the below log sheet function topic will guide you how to master the process.

2.3.1 Demographics Checks

2.3.2 Food Consumption Check

The calculate_fsl_indicators function is a powerful tool for computing essential food security and livelihood (FSL) indicators, including the food consumption score, household dietary diversity, reduced coping strategy, and livelihood coping strategy, from your raw data.

survey_df <- calculate_fsl_indicators(data = survey_df,
                                 # FCS
                                 FCSStap = "FCSStap", 
                                 FCSPulse = "FCSPulse", 
                                 FCSPr = "FCSPr", 
                                 FCSVeg = "FCSVeg", 
                                 FCSFruit = "FCSFruit",
                                 FCSDairy = "FCSDairy", 
                                 FCSFat = "FCSFat", 
                                 FCSSugar = "FCSSugar", 
                                 cutoff = "Cat28", 
                                 # rCSI
                                 rCSILessQlty = "rCSILessQlty", 
                                 rCSIBorrow = "rCSIBorrow", 
                                 rCSIMealSize = "rCSIMealSize", 
                                 rCSIMealAdult = "rCSIMealAdult", 
                                 rCSIMealNb = "rCSIMealNb",
                                 # HHS
                                 HHhSNoFood_FR = "HHhSNoFood_FR", 
                                 HHhSBedHung_FR = "HHhSBedHung_FR", 
                                 HHhSNotEat_FR = "HHhSNotEat_FR", 
                                 # HDDS
                                 # HDDSStapCer = "HDDSStapCer", 
                                 # HDDSStapRoot = "HDDSStapRoot", 
                                 # HDDSVeg = "HDDSVeg", 
                                 # HDDSFruit = "HDDSFruit", 
                                 # HDDSPrMeat = "HDDSPrMeat", 
                                 # HDDSPrEgg = "HDDSPrEgg", 
                                 # HDDSPrFish = "HDDSPrFish", 
                                 # HDDSPulse = "HDDSPulse", 
                                 # HDDSDairy = "HDDSDairy", 
                                 # HDDSFat = "HDDSFat", 
                                 # HDDSSugar = "HDDSSugar", 
                                 # HDDSCond = "HDDSCond"
                                 )

By incorporating ridge charts into your analysis, you can easily identify patterns and variations in FCS and rCSI across different clusters or field monitors.

(plot_ridge_distribution(survey_df, numeric_cols = c("FCSStap", "FCSPulse", "FCSPr", "FCSVeg", "FCSFruit", "FCSDairy", "FCSFat", "FCSSugar"),
                         name_groups = "Food Groups", name_units = "Days", grouping = "EnuName"))

By carefully examining and interpreting the ridge chart, you can gain valuable insights into the distributions and flag any inconsistency for validation and review during data collection. in the below chart, we’ll group the reduced coping strategies at area office level.

(plot_ridge_distribution(survey_df, numeric_cols = c("rCSILessQlty", "rCSIBorrow", "rCSIMealSize", "rCSIMealAdult", "rCSIMealNb"),
                         name_groups = "Food Coping Strategy", name_units = "Days", grouping = "EnuName"))

similarly, you can group distributions at field monitor level to gain more insight at the consistency of reported distributions across different monitors.

(plot_ridge_distribution(survey_df, numeric_cols = c("rCSILessQlty", "rCSIBorrow", "rCSIMealSize", "rCSIMealAdult", "rCSIMealNb"),
                         name_groups = "Food Coping Strategy", name_units = "Days", grouping = "EnuName"))

2.4 Apply Cleaning Log

logbook is the main power-hourse and it will help you flag your data based on the standard and specefic checks and log them in one format and send it to field team for review and feedback and finally use the new value information to replace with the old values in order to generate clean data. using log_sheet() you can achive it all.

2.5 Spatial Verification

In order to ensure the accuracy and relablity of data we need to run spatial verification checks that will help us to determine if the data or the point is collected on the right location. for instance, lets assume we’re doing school feeding programme baseline and want to collect 2 surveys from 5 schools in two districts. 10 surveys were than uploaded into the MoDa server after 3 days. and now we need to check if the schools visited are the right sampled ones first, and secondly if the distribution of the data is also aligning with the schools.