1 Introduction

In this code, we build four models using the random forest algorithm. The whole program is organised as follows. Section 2 is the preparation. Section 3 describes the process of building models, and the last section extracts the top features for each model.

2 Preparation

2.1 Upload libraries

Upload all libraries in this step. You may need to install some packages if this is your first time using them.

load.lib<-c("e1071","tidyr","dplyr","caret", "randomForest","ggplot2","PRROC", "klaR","questionr","kableExtra")
install.lib <-load.lib[!load.lib %in% installed.packages()]
for(lib in install.lib) install.packages(lib,dependencies=TRUE)
lapply(load.lib,require,character.only = TRUE)

select <- dplyr::select

2.1 Data input

We import the two datasets in this section for using in building the final dataset for modeling. The first one is the feature table for the whole 1000 sample paragraphs. The second dataset is the survey results table, in which we use the survey results to construct the labels for each paragraph. After that, we join the two table and save the output for using later.

#text feature data - 1000 paragraphs
text_feature_1000 <- readRDS("./data_input/text_feature_base_1000.rds")
#survey results data - 833 paragraphs, 1655 records
survey_response_score <- read.csv("./data_input/survey_table_final.csv")

# merge the above two table together for the final data
## create a para index by extracting the first 10 words from each paragraph
survey_response_score$para <- as.character(survey_response_score$para)
text_feature_1000$paragraph <- as.character(text_feature_1000$paragraph)

#using the unique question_index column to match the data.
survey_response_score$question_index <- 
  paste(survey_response_score$survey_group, survey_response_score$question_group, survey_response_score$index.x, sep = "_")

#merge the two datasets
merge_data <- left_join(text_feature_1000, survey_response_score, by = "question_index")
para_check <- merge_data %>% select(question_index,paragraph, para) #generate this dataset to check if the data merge properly

2.2 Model data

Prepare model data by removing extract columns that captured the paragraph information as those variables are not used as input variables for building models.

2.2.1 Data clean

Cleaning input data for removing paragraphs that were not rated in the survey, as well as those paragraphs with invalid survey ratings (for example, some respondents rated all paragraphs same scores).

1) Remove aragraphs without survey scores

#Remove paragraphs that were not rated
model_base_data <- merge_data %>% filter(!is.na(Question_Viewed))

#check if the number of record is 1965
print(paste("The number of survey response in the model_base_data table is: ", nrow(model_base_data),";", " and the number of record in the response table survey_response_score is: ", nrow(survey_response_score), ".", "The two numbers are supposed to be the same.", sep = ""))
## [1] "The number of survey response in the model_base_data table is: 1965; and the number of record in the response table survey_response_score is: 1965.The two numbers are supposed to be the same."

2) Remove invalid responses

Some respondents (for example, Respondent_ID=11309405502) give the same score to all paragraphs. Those scores are invalid and thus are removed from the final model datasets.

model_base_data <- 
  model_base_data %>% filter(!is.na(content.scale) & ! is.na(clarity.scale)) %>% 
  arrange(Respondent_ID)

print(paste("The number of the model data using response record is: ", nrow(model_base_data), ".", sep = ""))
## [1] "The number of the model data using response record is: 1924."

3) Remove duplicated columns;

Remove duplicated columns and and transpose three columns (word_pos.word1, word_pos.word2, word_pos.word3) into a numeric ones.

##remove duplicated variables
model_base_data <- model_base_data %>% select(-word_pos)

## change the category variables (word_pos.word1, word_pos.word2, word_pos.word3) into numeric ones
word_pos_1 <- model_base_data %>% 
  dplyr::select(question_index, word_pos.word1) %>% 
  group_by(question_index) %>% 
  dplyr::count(word_pos.word1) %>% 
  spread(word_pos.word1, n) %>% select(-'$') %>% unique()
colnames(word_pos_1)[2:ncol(word_pos_1)] <- paste("pos_word1",colnames(word_pos_1)[2:ncol(word_pos_1)], sep = "_")


word_pos_2 <- model_base_data %>% 
  dplyr::select(question_index, word_pos.word2) %>% 
  group_by(question_index) %>% 
  dplyr::count(word_pos.word2) %>% 
  spread(word_pos.word2, n) %>%  #
  # count() %>% 
  # spread(word_pos.word2, freq) %>% ## commented out this part of code as it doesn't work anymore due to conflicts with plyr on 28/10/2020 (same changes applied to code word_pos_1 and word_pos_3 too!)
  select(-',') 

colnames(word_pos_2)[2:ncol(word_pos_2)] <- paste("pos_word2",colnames(word_pos_2)[2:ncol(word_pos_2)], sep = "_")

word_pos_3 <- model_base_data %>% 
  dplyr::select(question_index, word_pos.word3) %>% 
  group_by(question_index) %>% 
  dplyr::count(word_pos.word3) %>% 
  spread(word_pos.word3, n) %>% 
  #count() %>% 
  #spread(word_pos.word3, freq) %>% 
  select(-',')

colnames(word_pos_3)[2:ncol(word_pos_3)] <- paste("pos_word3",colnames(word_pos_3)[2:ncol(word_pos_3)], sep = "_")

## remove the category variables, and then attach the numeric variables
model_base_data <- model_base_data %>% select(-word_pos.word1, -word_pos.word2, -word_pos.word3)

## attach the numeric variables back to the model base table
model_base_data <- model_base_data %>% 
  left_join(.,word_pos_1, by ="question_index") %>%
  left_join(.,word_pos_2, by ="question_index") %>%
  left_join(.,word_pos_3, by ="question_index")

## replace all NAs with 0
model_base_data[is.na(model_base_data)] <- 0 #replace NAs with 0s - only for numeric variables

#table a look of the model base data

A snapshot of the table is shown as below.

model_base_data %>% head() %>% kbl() %>%
  kable_paper() %>%
  scroll_box(width = "100%", height = "200px")
index.x.x year.x month.x issue.x paragraph source.x source_group.x survey_group.x question_group.x question_index paragraph_clean word_count_stats sentence_count readability_stats.sylls readability_stats.polys fk_grade_level FRES_score comma_count punc_count digit_count index.y.x CC CD DT IN JJ JJR MD NN NNS PRP PRP$ RB RBR RP TO VB VBD VBG VBN VBP VBZ WP POS WRB EX WDT JJS RBS PDT NNP NNPS FW WP$ SYM UH pos_prop_CC pos_prop_CD pos_prop_DT pos_prop_IN pos_prop_JJ pos_prop_JJR pos_prop_MD pos_prop_NN pos_prop_NNS pos_prop_PRP pos_prop_PRP$ pos_prop_RB pos_prop_RBR pos_prop_RP pos_prop_TO pos_prop_VB pos_prop_VBD pos_prop_VBG pos_prop_VBN pos_prop_VBP pos_prop_VBZ pos_prop_WP pos_prop_POS pos_prop_WRB pos_prop_EX pos_prop_WDT pos_prop_JJS pos_prop_RBS pos_prop_PDT pos_prop_NNP pos_prop_NNPS pos_prop_FW pos_prop_WP$ pos_prop_SYM pos_prop_UH word_choose sent1st_clue_Attitudinal sent1st_clue_connective sent1st_clue_Contrast sent1st_clue_detail sent1st_clue_emphasis sent1st_clue_inference sent1st_clue_reformulation sent1st_clue_summary sentlast_clue_Attitudinal sentlast_clue_connective sentlast_clue_Contrast sentlast_clue_detail sentlast_clue_emphasis sentlast_clue_inference sentlast_clue_reformulation sentlast_clue_summary sentmiddle_clue_Attitudinal sentmiddle_clue_connective sentmiddle_clue_Contrast sentmiddle_clue_detail sentmiddle_clue_emphasis sentmiddle_clue_inference sentmiddle_clue_reformulation sentmiddle_clue_summary sent_1st_word_, sent_1st_word_CC sent_1st_word_CD sent_1st_word_DT sent_1st_word_DT_POS sent_1st_word_EX sent_1st_word_EX_POS sent_1st_word_FW sent_1st_word_IN sent_1st_word_JJ sent_1st_word_JJR sent_1st_word_JJS sent_1st_word_LS sent_1st_word_MD sent_1st_word_NN sent_1st_word_NN_MD sent_1st_word_NN_POS sent_1st_word_NNP sent_1st_word_NNPS sent_1st_word_NNS sent_1st_word_NNS_POS sent_1st_word_PDT sent_1st_word_PRP sent_1st_word_PRP$ sent_1st_word_PRP_POS sent_1st_word_RB sent_1st_word_RBR sent_1st_word_RBS sent_1st_word_RP sent_1st_word_TO sent_1st_word_UH sent_1st_word_VB sent_1st_word_VBD sent_1st_word_VBG sent_1st_word_VBN sent_1st_word_VBP sent_1st_word_VBZ sent_1st_word_WDT sent_1st_word_WP sent_1st_word_WRB sent_1st_parse_ADJP sent_1st_parse_ADVP sent_1st_parse_NP sent_1st_parse_PP sent_1st_parse_S sent_1st_parse_SBAR sent_1st_parse_SINV sent_1st_parse_VP sent_1st_parse_WHADVP sent_1st_parse_WHNP sent_1st_parse_WHPP sent_1st_parse_SBARQ sent_1st_parse_SQ sent_1st_parse_0 sent_last_parse_ADJP sent_last_parse_ADVP sent_last_parse_NP sent_last_parse_PP sent_last_parse_S sent_last_parse_SBAR sent_last_parse_SINV sent_last_parse_VP sent_last_parse_WHADVP sent_last_parse_WHNP sent_last_parse_WHPP sent_last_parse_SBARQ sent_last_parse_SQ sent_last_parse_0 ADJP ADVP NP PP S SBAR SINV VP WHADVP WHNP WHPP SBARQ SQ 0 X index.x.y year.y month.y issue.y para source.y source_group.y survey_group.y question_group.y para_part_extraction scoed_flag_final para_unique_index rank index.y.y Respondent_ID economic_literacy economics_education education_index economic_job completion_flag Question_Viewed Clarity Content Q_Index duplicate_flag Score_Times policy_score_job policy_score_2_dimension original_job_type record_score_by para_scored_by content.scale clarity.scale unique_para_scored pos_word1_CC pos_word1_CD pos_word1_DT pos_word1_EX pos_word1_IN pos_word1_JJ pos_word1_JJR pos_word1_JJS pos_word1_MD pos_word1_NN pos_word1_NN_POS pos_word1_NNS pos_word1_NNS_POS pos_word1_PDT pos_word1_PRP pos_word1_PRP$ pos_word1_PRP_POS pos_word1_RB pos_word1_RBR pos_word1_RBS pos_word1_RP pos_word1_TO pos_word1_VB pos_word1_VBD pos_word1_VBG pos_word1_VBN pos_word1_VBP pos_word1_VBZ pos_word1_WP pos_word1_WRB pos_word2_CC pos_word2_CD pos_word2_DT pos_word2_EX pos_word2_IN pos_word2_JJ pos_word2_JJR pos_word2_JJS pos_word2_MD pos_word2_NN pos_word2_NN_MD pos_word2_NN_POS pos_word2_NNP pos_word2_NNS pos_word2_NNS_POS pos_word2_PRP pos_word2_PRP$ pos_word2_RB pos_word2_RBR pos_word2_RBS pos_word2_RP pos_word2_TO pos_word2_VB pos_word2_VBD pos_word2_VBG pos_word2_VBN pos_word2_VBP pos_word2_VBP_RB pos_word2_VBZ pos_word2_WDT pos_word2_WRB pos_word3_CC pos_word3_CD pos_word3_DT pos_word3_EX pos_word3_FW pos_word3_IN pos_word3_JJ pos_word3_JJR pos_word3_JJS pos_word3_MD pos_word3_NN pos_word3_NN_POS pos_word3_NNS pos_word3_NNS_POS pos_word3_PRP pos_word3_PRP$ pos_word3_PRP_VBP pos_word3_RB pos_word3_RBR pos_word3_TO pos_word3_VB pos_word3_VBD pos_word3_VBG pos_word3_VBN pos_word3_VBP pos_word3_VBZ pos_word3_WP pos_word3_WRB
408 2010 aug intro The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. 4_smp_intro_2006_2019 G4 3 1 3_1_83 The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. 54 2 100 14 16.79185 22.76333 6 2 0 408 1 0 6 8 9 0 0 12 2 0 1 2 0 0 2 1 0 1 4 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.85 0.00 11.11 14.81 16.67 0.00 0.00 22.22 3.70 0.00 1.85 3.70 0.00 0.00 3.70 1.85 0.00 1.85 7.41 3.70 5.56 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 the 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 4 2 1 0 4 0 0 0 0 0 0 0 0 8 2 5 1 0 8 0 0 0 0 0 0 0 0 15 6 7 2 0 12 0 0 0 0 0 0 796 83 2010 aug intro The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. 4_smp_intro_2006_2019 G4 3 1 s set to decline as the various stimulus projects Yes p_48 91 119 11295443595 1 None 0 No 1 The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. 3 3 Q3 Duplicate 4 NP NP 0 Non-economist NP&P 1.5 0.7745967 both 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
427 2006 feb intro Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. 4_smp_intro_2006_2019 G4 3 2 3_2_81 Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. 111 6 195 25 12.35473 39.43588 3 6 8 427 5 2 11 14 13 0 0 17 9 1 0 7 1 0 4 4 1 2 10 5 3 0 0 0 2 0 1 1 0 0 0 0 0 0 0 4.42 1.77 9.73 12.39 11.50 0.00 0.00 15.04 7.96 0.88 0.00 6.19 0.88 0.00 3.54 3.54 0.88 1.77 8.85 4.42 2.65 0.00 0.00 0.00 1.77 0.00 0.88 0.88 0 0 0 0 0 0 0 global 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 4 1 2 0 0 4 0 0 0 0 0 0 1 0 2 0 2 0 0 4 0 0 0 0 0 0 1 7 36 14 16 0 0 27 0 0 0 0 0 0 846 81 2006 feb intro Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. 4_smp_intro_2006_2019 G4 3 2 average in 2005, and most observers expect this to Yes p_111 212 119 11295443595 1 None 0 No 1 Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. 4 3 Q3 Duplicate 4 NP NP 0 Non-economist NP 1.5 1.7428425 Non-economist 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
455 2019 is-all-economics-local-speech-by-andy-haldan The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. 9_boe_speeches G7 3 3 3_3_142 The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. 61 4 118 16 13.18373 27.70379 5 6 2 455 3 3 7 3 10 0 0 9 8 1 3 4 0 0 3 2 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4.84 4.84 11.29 4.84 16.13 0.00 0.00 14.52 12.90 1.61 4.84 6.45 0.00 0.00 4.84 3.23 1.61 3.23 1.61 1.61 1.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 the 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 3 0 2 0 0 4 0 0 0 0 0 0 0 2 3 0 1 0 0 2 0 0 0 0 0 0 3 4 21 3 6 0 0 10 0 0 0 0 0 0 893 142 2019 0 is-all-economics-local-speech-by-andy-haldan The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace. Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. 9_boe_speeches G7 3 3 time. This transformed inventory management at Wal Yes p_183 333 119 11295443595 1 None 0 No 1 The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace. Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. 3 2 Q3 Duplicate 2 NP NP 0 Non-economist NP 0.0 0.7745967 Non-economist 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
477 2019 na Can Kyriakos Mitsotakis ensure the Greek economy starts growing again_ - To hell and back Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. 10_economist G9 3 4 3_4_168 Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. 94 5 153 16 10.94838 50.05300 2 7 0 477 0 1 12 12 8 1 1 20 5 2 0 6 0 2 3 3 5 4 2 0 5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0.00 1.06 12.77 12.77 8.51 1.06 1.06 21.28 5.32 2.13 0.00 6.38 0.00 2.13 3.19 3.19 5.32 4.26 2.13 0.00 5.32 1.06 0.00 1.06 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 even 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 1 1 1 0 2 0 0 0 0 0 0 0 1 6 2 4 2 0 7 0 0 0 0 0 0 3 4 31 11 15 5 0 22 1 1 0 0 0 0 925 168 2019 na Can Kyriakos Mitsotakis ensure the Greek economy starts growing again_ - To hell and back Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. 10_economist G9 3 4 mising to solve the problem. A lawyer recalls bein Yes p_252 454 119 11295443595 1 None 0 No 1 Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. 3 2 Q3 Duplicate 3 NP NP 0 Non-economist NP&P 0.0 0.7745967 both 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
497 2019 na Firms that analyse climate risks are the latest hot property - Sunny days SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. 10_economist G9 3 5 3_5_161 SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. 97 6 176 21 12.12531 36.92480 5 10 15 497 1 7 10 17 8 0 0 23 8 3 0 4 0 0 3 1 5 3 3 0 3 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0.99 6.93 9.90 16.83 7.92 0.00 0.00 22.77 7.92 2.97 0.00 3.96 0.00 0.00 2.97 0.99 4.95 2.97 2.97 0.00 2.97 0.99 0.00 0.00 0.00 0.99 0.00 0.00 0 0 0 0 0 0 0 soon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 12 5 2 0 0 5 0 0 0 0 0 0 0 0 8 2 4 2 0 3 0 1 0 0 0 0 2 5 45 18 12 2 0 17 0 1 0 0 0 0 953 161 2019 na Firms that analyse climate risks are the latest hot property - Sunny days SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. 10_economist G9 3 5 ement of a net-zero goal. Ms Mazzacurati is still Yes p_316 575 119 11295443595 1 None 0 No 1 Soon after Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. 2 2 Q3 Duplicate 5 NP NP 0 Non-economist NP&P 0.0 -0.1936492 both 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
502 2019 April The Global Financial Environment | Financial Stability Review – April 2019 | RBA Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. 1_frs G1 3 6 3_6_7 Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. 51 2 88 9 14.71578 34.97603 1 5 0 502 2 1 1 6 12 0 0 14 5 0 1 0 0 0 1 1 2 0 2 2 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 3.85 1.92 1.92 11.54 23.08 0.00 0.00 26.92 9.62 0.00 1.92 0.00 0.00 0.00 1.92 1.92 3.85 0.00 3.85 3.85 0.00 0.00 1.92 0.00 0.00 0.00 1.92 0.00 0 0 0 0 0 0 0 financial 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 2 1 0 0 2 0 0 0 0 0 0 0 0 9 2 4 3 0 6 0 0 0 0 0 0 0 0 16 4 5 3 0 8 0 0 0 0 0 0 959 7 2019 April The Global Financial Environment | Financial Stability Review – April 2019 | RBA Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. 1_frs G1 3 6 rve Bank of New Zealand (RBNZ) noted that risks ha Yes p_380 695 119 11295443595 1 None 0 No 1 Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. 2 2 Q3 Duplicate 2 NP NP 0 Non-economist NP 0.0 -0.1936492 Non-economist 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2.2.2 Model data

Construct modeling dataset by removing columns that will not be put as input variables in the final model, such as text source, year, paragraph, etc.

#remove unneeded paragraphs
model_response_base1 <- model_base_data %>% 
  dplyr::select(-X, -index.x.x, -index.x.y, -index.y.x, -index.y.y, -year.x,
         -year.y, -month.y, -issue.y, -para, -source.y, -source_group.y,
         -survey_group.y, -question_group.y, -para_part_extraction,-rank)

print(paste("The number of the model data using response record is: ",
            nrow(model_response_base1), "." , sep = ""))
## [1] "The number of the model data using response record is: 1924."
model_para_base1 <- model_response_base1 %>% ungroup() %>% 
  group_by(question_index) %>% 
  dplyr::mutate(content.scale.avg = mean(content.scale), clarity.scale.avg = mean(clarity.scale)) %>%
  dplyr::select(-Respondent_ID, -economic_literacy, -economics_education,
         -education_index, -economic_job, -completion_flag, -Q_Index, -policy_score_job,
         -policy_score_2_dimension, -original_job_type, -record_score_by,
         -content.scale, -clarity.scale, -Clarity, -Content) %>% unique() 

print(paste("The number of the model data using unique paragraph as an entry is: ", nrow(model_para_base1), ".", sep = ""))
## [1] "The number of the model data using unique paragraph as an entry is: 825."
model_para_base1 %>% head() %>% kbl() %>%
  kable_paper() %>%
  scroll_box(width = "100%", height = "200px")
month.x issue.x paragraph source.x source_group.x survey_group.x question_group.x question_index paragraph_clean word_count_stats sentence_count readability_stats.sylls readability_stats.polys fk_grade_level FRES_score comma_count punc_count digit_count CC CD DT IN JJ JJR MD NN NNS PRP PRP$ RB RBR RP TO VB VBD VBG VBN VBP VBZ WP POS WRB EX WDT JJS RBS PDT NNP NNPS FW WP$ SYM UH pos_prop_CC pos_prop_CD pos_prop_DT pos_prop_IN pos_prop_JJ pos_prop_JJR pos_prop_MD pos_prop_NN pos_prop_NNS pos_prop_PRP pos_prop_PRP$ pos_prop_RB pos_prop_RBR pos_prop_RP pos_prop_TO pos_prop_VB pos_prop_VBD pos_prop_VBG pos_prop_VBN pos_prop_VBP pos_prop_VBZ pos_prop_WP pos_prop_POS pos_prop_WRB pos_prop_EX pos_prop_WDT pos_prop_JJS pos_prop_RBS pos_prop_PDT pos_prop_NNP pos_prop_NNPS pos_prop_FW pos_prop_WP$ pos_prop_SYM pos_prop_UH word_choose sent1st_clue_Attitudinal sent1st_clue_connective sent1st_clue_Contrast sent1st_clue_detail sent1st_clue_emphasis sent1st_clue_inference sent1st_clue_reformulation sent1st_clue_summary sentlast_clue_Attitudinal sentlast_clue_connective sentlast_clue_Contrast sentlast_clue_detail sentlast_clue_emphasis sentlast_clue_inference sentlast_clue_reformulation sentlast_clue_summary sentmiddle_clue_Attitudinal sentmiddle_clue_connective sentmiddle_clue_Contrast sentmiddle_clue_detail sentmiddle_clue_emphasis sentmiddle_clue_inference sentmiddle_clue_reformulation sentmiddle_clue_summary sent_1st_word_, sent_1st_word_CC sent_1st_word_CD sent_1st_word_DT sent_1st_word_DT_POS sent_1st_word_EX sent_1st_word_EX_POS sent_1st_word_FW sent_1st_word_IN sent_1st_word_JJ sent_1st_word_JJR sent_1st_word_JJS sent_1st_word_LS sent_1st_word_MD sent_1st_word_NN sent_1st_word_NN_MD sent_1st_word_NN_POS sent_1st_word_NNP sent_1st_word_NNPS sent_1st_word_NNS sent_1st_word_NNS_POS sent_1st_word_PDT sent_1st_word_PRP sent_1st_word_PRP$ sent_1st_word_PRP_POS sent_1st_word_RB sent_1st_word_RBR sent_1st_word_RBS sent_1st_word_RP sent_1st_word_TO sent_1st_word_UH sent_1st_word_VB sent_1st_word_VBD sent_1st_word_VBG sent_1st_word_VBN sent_1st_word_VBP sent_1st_word_VBZ sent_1st_word_WDT sent_1st_word_WP sent_1st_word_WRB sent_1st_parse_ADJP sent_1st_parse_ADVP sent_1st_parse_NP sent_1st_parse_PP sent_1st_parse_S sent_1st_parse_SBAR sent_1st_parse_SINV sent_1st_parse_VP sent_1st_parse_WHADVP sent_1st_parse_WHNP sent_1st_parse_WHPP sent_1st_parse_SBARQ sent_1st_parse_SQ sent_1st_parse_0 sent_last_parse_ADJP sent_last_parse_ADVP sent_last_parse_NP sent_last_parse_PP sent_last_parse_S sent_last_parse_SBAR sent_last_parse_SINV sent_last_parse_VP sent_last_parse_WHADVP sent_last_parse_WHNP sent_last_parse_WHPP sent_last_parse_SBARQ sent_last_parse_SQ sent_last_parse_0 ADJP ADVP NP PP S SBAR SINV VP WHADVP WHNP WHPP SBARQ SQ 0 scoed_flag_final para_unique_index Question_Viewed duplicate_flag Score_Times para_scored_by unique_para_scored pos_word1_CC pos_word1_CD pos_word1_DT pos_word1_EX pos_word1_IN pos_word1_JJ pos_word1_JJR pos_word1_JJS pos_word1_MD pos_word1_NN pos_word1_NN_POS pos_word1_NNS pos_word1_NNS_POS pos_word1_PDT pos_word1_PRP pos_word1_PRP$ pos_word1_PRP_POS pos_word1_RB pos_word1_RBR pos_word1_RBS pos_word1_RP pos_word1_TO pos_word1_VB pos_word1_VBD pos_word1_VBG pos_word1_VBN pos_word1_VBP pos_word1_VBZ pos_word1_WP pos_word1_WRB pos_word2_CC pos_word2_CD pos_word2_DT pos_word2_EX pos_word2_IN pos_word2_JJ pos_word2_JJR pos_word2_JJS pos_word2_MD pos_word2_NN pos_word2_NN_MD pos_word2_NN_POS pos_word2_NNP pos_word2_NNS pos_word2_NNS_POS pos_word2_PRP pos_word2_PRP$ pos_word2_RB pos_word2_RBR pos_word2_RBS pos_word2_RP pos_word2_TO pos_word2_VB pos_word2_VBD pos_word2_VBG pos_word2_VBN pos_word2_VBP pos_word2_VBP_RB pos_word2_VBZ pos_word2_WDT pos_word2_WRB pos_word3_CC pos_word3_CD pos_word3_DT pos_word3_EX pos_word3_FW pos_word3_IN pos_word3_JJ pos_word3_JJR pos_word3_JJS pos_word3_MD pos_word3_NN pos_word3_NN_POS pos_word3_NNS pos_word3_NNS_POS pos_word3_PRP pos_word3_PRP$ pos_word3_PRP_VBP pos_word3_RB pos_word3_RBR pos_word3_TO pos_word3_VB pos_word3_VBD pos_word3_VBG pos_word3_VBN pos_word3_VBP pos_word3_VBZ pos_word3_WP pos_word3_WRB content.scale.avg clarity.scale.avg
aug intro The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. 4_smp_intro_2006_2019 G4 3 1 3_1_83 The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. 54 2 100 14 16.79185 22.76333 6 2 0 1 0 6 8 9 0 0 12 2 0 1 2 0 0 2 1 0 1 4 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.85 0.00 11.11 14.81 16.67 0.00 0.00 22.22 3.70 0.00 1.85 3.70 0.00 0.00 3.70 1.85 0.00 1.85 7.41 3.70 5.56 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 the 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 4 2 1 0 4 0 0 0 0 0 0 0 0 8 2 5 1 0 8 0 0 0 0 0 0 0 0 15 6 7 2 0 12 0 0 0 0 0 0 Yes p_48 The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. Duplicate 4 NP&P both 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.1837511 0.2222012
feb intro Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. 4_smp_intro_2006_2019 G4 3 2 3_2_81 Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. 111 6 195 25 12.35473 39.43588 3 6 8 5 2 11 14 13 0 0 17 9 1 0 7 1 0 4 4 1 2 10 5 3 0 0 0 2 0 1 1 0 0 0 0 0 0 0 4.42 1.77 9.73 12.39 11.50 0.00 0.00 15.04 7.96 0.88 0.00 6.19 0.88 0.00 3.54 3.54 0.88 1.77 8.85 4.42 2.65 0.00 0.00 0.00 1.77 0.00 0.88 0.88 0 0 0 0 0 0 0 global 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 4 1 2 0 0 4 0 0 0 0 0 0 1 0 2 0 2 0 0 4 0 0 0 0 0 0 1 7 36 14 16 0 0 27 0 0 0 0 0 0 Yes p_111 Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. Duplicate 4 NP Non-economist 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.3411905 0.6515765
is-all-economics-local-speech-by-andy-haldan The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. 9_boe_speeches G7 3 3 3_3_142 The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. 61 4 118 16 13.18373 27.70379 5 6 2 3 3 7 3 10 0 0 9 8 1 3 4 0 0 3 2 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4.84 4.84 11.29 4.84 16.13 0.00 0.00 14.52 12.90 1.61 4.84 6.45 0.00 0.00 4.84 3.23 1.61 3.23 1.61 1.61 1.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 the 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 3 0 2 0 0 4 0 0 0 0 0 0 0 2 3 0 1 0 0 2 0 0 0 0 0 0 3 4 21 3 6 0 0 10 0 0 0 0 0 0 Yes p_183 The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace. Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. Duplicate 2 NP Non-economist 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000 0.6244692
na Can Kyriakos Mitsotakis ensure the Greek economy starts growing again_ - To hell and back Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. 10_economist G9 3 4 3_4_168 Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. 94 5 153 16 10.94838 50.05300 2 7 0 0 1 12 12 8 1 1 20 5 2 0 6 0 2 3 3 5 4 2 0 5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0.00 1.06 12.77 12.77 8.51 1.06 1.06 21.28 5.32 2.13 0.00 6.38 0.00 2.13 3.19 3.19 5.32 4.26 2.13 0.00 5.32 1.06 0.00 1.06 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 even 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 1 1 1 0 2 0 0 0 0 0 0 0 1 6 2 4 2 0 7 0 0 0 0 0 0 3 4 31 11 15 5 0 22 1 1 0 0 0 0 Yes p_252 Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. Duplicate 3 NP&P both 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.6828015 -0.5403628
na Firms that analyse climate risks are the latest hot property - Sunny days SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. 10_economist G9 3 5 3_5_161 SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. 97 6 176 21 12.12531 36.92480 5 10 15 1 7 10 17 8 0 0 23 8 3 0 4 0 0 3 1 5 3 3 0 3 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0.99 6.93 9.90 16.83 7.92 0.00 0.00 22.77 7.92 2.97 0.00 3.96 0.00 0.00 2.97 0.99 4.95 2.97 2.97 0.00 2.97 0.99 0.00 0.00 0.00 0.99 0.00 0.00 0 0 0 0 0 0 0 soon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 12 5 2 0 0 5 0 0 0 0 0 0 0 0 8 2 4 2 0 3 0 1 0 0 0 0 2 5 45 18 12 2 0 17 0 1 0 0 0 0 Yes p_316 Soon after Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. Duplicate 5 NP&P both 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.2293331 0.7157260
April The Global Financial Environment | Financial Stability Review – April 2019 | RBA Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. 1_frs G1 3 6 3_6_7 Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. 51 2 88 9 14.71578 34.97603 1 5 0 2 1 1 6 12 0 0 14 5 0 1 0 0 0 1 1 2 0 2 2 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 3.85 1.92 1.92 11.54 23.08 0.00 0.00 26.92 9.62 0.00 1.92 0.00 0.00 0.00 1.92 1.92 3.85 0.00 3.85 3.85 0.00 0.00 1.92 0.00 0.00 0.00 1.92 0.00 0 0 0 0 0 0 0 financial 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 2 1 0 0 2 0 0 0 0 0 0 0 0 9 2 4 3 0 6 0 0 0 0 0 0 0 0 16 4 5 3 0 8 0 0 0 0 0 0 Yes p_380 Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. Duplicate 2 NP Non-economist 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5139362 0.6959227

3 Models

For models are generated in this section.They are:

  • Two reasoning (Content) models from economist and non-economist perspectives
  • Two readability (clarity) models from economist perspective

3.1 Data preparation for models

3.1.1 Set up threshold

As discussed in section 5.2 of the paper, we remove ambiguous paragraphs with standardised survey scores between -0.4 and 0.4. So responses with scores equal or higher than 0.4 are labeled as “high”, and those rated equal or lower than -0.4 rated as “low”. For those paragaphs with ratings between -0.4 and 0.4, we label them as “middle”.

high.cutoff = 0.4
low.cutoff = -0.4

3.1.2 Create target label

Construct two columns (content_lable an clarity_lable) for each paragraph. The sample size for each model is shown as:

model_para_base1 <- model_para_base1 %>% mutate(
  content_label= case_when(content.scale.avg >= high.cutoff ~ "high",
                           content.scale.avg <= low.cutoff ~ "low",
                           content.scale.avg < high.cutoff & content.scale.avg > low.cutoff ~ "middle"
                           ),
  clarity_label= case_when(clarity.scale.avg >= high.cutoff ~ "high",
                           clarity.scale.avg <= low.cutoff ~ "low",
                           clarity.scale.avg < high.cutoff & clarity.scale.avg > low.cutoff ~ "middle"
                           ))
#Check the count of paragraph by each label 
model_para_base1 %>% ungroup() %>% group_by(content_label) %>% dplyr::summarise(para_count = n()) %>% 
  kbl(caption = "Composition of sample pararaphs by reasoning (content) labels") %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Composition of sample pararaphs by reasoning (content) labels
content_label para_count
high 248
low 251
middle 326
#Check the count of paragraph by each label 
model_para_base1 %>% ungroup() %>% group_by(clarity_label) %>% dplyr::summarise(para_count = n()) %>% 
  kbl(caption = "Composition of sample pararaphs by readability (clarity) labels") %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Composition of sample pararaphs by readability (clarity) labels
clarity_label para_count
high 264
low 241
middle 320

3.1.3 Model data

Separate our model data into two parts: economist and non-economist.

model_para_eco <- model_para_base1 %>% filter(unique_para_scored=="Economist"|unique_para_scored=="both") %>%
  select(-'0',-scoed_flag_final, -para_unique_index,-duplicate_flag,-Score_Times, -para_scored_by, -unique_para_scored,
         -Question_Viewed, -para_scored_by,
         -paragraph_clean, -question_group.x, -survey_group.x, -source_group.x, -source.x, -paragraph, -issue.x,
         -month.x,-word_choose)

model_para_noneco <- model_para_base1 %>% filter(unique_para_scored=="Non-economist"|unique_para_scored=="both") %>%
  select(-'0',-scoed_flag_final, -para_unique_index,-duplicate_flag,-Score_Times, -para_scored_by, -unique_para_scored,
         -Question_Viewed, -para_scored_by,
         -paragraph_clean, -question_group.x, -survey_group.x, -source_group.x, -source.x, -paragraph, -issue.x,
         -month.x,-word_choose)

#check the number

print(paste("The number of paragraphs that are rated by economists is: ", nrow(model_para_eco), sep = ""))
## [1] "The number of paragraphs that are rated by economists is: 518"
print(paste("The number of paragraphs that are rated by non-economists is: ", nrow(model_para_noneco), sep = ""))
## [1] "The number of paragraphs that are rated by non-economists is: 772"

3.2 Models

Four models are built separately using RF algorithm. To address over-fitting concerns, we build models using 75 per cent of sample data as training dataset and use the rest 25 per cent to evaluate model performance. Three metrics are reported here for testing model performance, which are confusion matrix, ROC curve with AUC value, and LogLoss.

3.2.1 Model 1 Economists - Reasoning

3.2.1.1 prepare model data

Prepare model data for building the economics reasoning (content) model.

function_model_data_process <- function(model_data){
  model_data$target <- as.factor(model_data$target)
  names(model_data) <- gsub(x = names(model_data), pattern ='[$]', replacement = 'ds')
  names(model_data) <- gsub(x = names(model_data), pattern ='[-]', replacement = 'dash')
  if ("sent_1st_word_," %in% colnames(model_data) ) {
    model_data <- model_data %>% dplyr::select(-'sent_1st_word_,')} else
      model_data
  model_data
}
# step1: split data into training and validation dataset
model_para_eco$index <- row.names(model_para_eco)

model_data2 <- model_para_eco %>% filter(content_label !="middle") %>% 
  ungroup() %>%
  mutate(target=case_when(content_label=="high" ~ "high",content_label=="low" ~ "low")) %>%
  select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label) %>% 
  unique()

model_data <- function_model_data_process(model_data2)

set.seed(1234)
validationIndex <- createDataPartition(model_data$target, p=0.75, list=FALSE) 
validation <- model_data[-validationIndex,] 
training <- model_data[validationIndex,] 

print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is:  295 and columns is:  273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is:  222 and columns is:  273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is:  73 and columns is:  273"
training %>% head() %>% kbl() %>%
  kable_paper() %>%
  scroll_box(width = "100%", height = "200px")
word_count_stats sentence_count readability_stats.sylls readability_stats.polys fk_grade_level FRES_score comma_count punc_count digit_count CC CD DT IN JJ JJR MD NN NNS PRP PRPds RB RBR RP TO VB VBD VBG VBN VBP VBZ WP POS WRB EX WDT JJS RBS PDT NNP NNPS FW WPds SYM UH pos_prop_CC pos_prop_CD pos_prop_DT pos_prop_IN pos_prop_JJ pos_prop_JJR pos_prop_MD pos_prop_NN pos_prop_NNS pos_prop_PRP pos_prop_PRPds pos_prop_RB pos_prop_RBR pos_prop_RP pos_prop_TO pos_prop_VB pos_prop_VBD pos_prop_VBG pos_prop_VBN pos_prop_VBP pos_prop_VBZ pos_prop_WP pos_prop_POS pos_prop_WRB pos_prop_EX pos_prop_WDT pos_prop_JJS pos_prop_RBS pos_prop_PDT pos_prop_NNP pos_prop_NNPS pos_prop_FW pos_prop_WPds pos_prop_SYM pos_prop_UH sent1st_clue_Attitudinal sent1st_clue_connective sent1st_clue_Contrast sent1st_clue_detail sent1st_clue_emphasis sent1st_clue_inference sent1st_clue_reformulation sent1st_clue_summary sentlast_clue_Attitudinal sentlast_clue_connective sentlast_clue_Contrast sentlast_clue_detail sentlast_clue_emphasis sentlast_clue_inference sentlast_clue_reformulation sentlast_clue_summary sentmiddle_clue_Attitudinal sentmiddle_clue_connective sentmiddle_clue_Contrast sentmiddle_clue_detail sentmiddle_clue_emphasis sentmiddle_clue_inference sentmiddle_clue_reformulation sentmiddle_clue_summary sent_1st_word_CC sent_1st_word_CD sent_1st_word_DT sent_1st_word_DT_POS sent_1st_word_EX sent_1st_word_EX_POS sent_1st_word_FW sent_1st_word_IN sent_1st_word_JJ sent_1st_word_JJR sent_1st_word_JJS sent_1st_word_LS sent_1st_word_MD sent_1st_word_NN sent_1st_word_NN_MD sent_1st_word_NN_POS sent_1st_word_NNP sent_1st_word_NNPS sent_1st_word_NNS sent_1st_word_NNS_POS sent_1st_word_PDT sent_1st_word_PRP sent_1st_word_PRPds sent_1st_word_PRP_POS sent_1st_word_RB sent_1st_word_RBR sent_1st_word_RBS sent_1st_word_RP sent_1st_word_TO sent_1st_word_UH sent_1st_word_VB sent_1st_word_VBD sent_1st_word_VBG sent_1st_word_VBN sent_1st_word_VBP sent_1st_word_VBZ sent_1st_word_WDT sent_1st_word_WP sent_1st_word_WRB sent_1st_parse_ADJP sent_1st_parse_ADVP sent_1st_parse_NP sent_1st_parse_PP sent_1st_parse_S sent_1st_parse_SBAR sent_1st_parse_SINV sent_1st_parse_VP sent_1st_parse_WHADVP sent_1st_parse_WHNP sent_1st_parse_WHPP sent_1st_parse_SBARQ sent_1st_parse_SQ sent_1st_parse_0 sent_last_parse_ADJP sent_last_parse_ADVP sent_last_parse_NP sent_last_parse_PP sent_last_parse_S sent_last_parse_SBAR sent_last_parse_SINV sent_last_parse_VP sent_last_parse_WHADVP sent_last_parse_WHNP sent_last_parse_WHPP sent_last_parse_SBARQ sent_last_parse_SQ sent_last_parse_0 ADJP ADVP NP PP S SBAR SINV VP WHADVP WHNP WHPP SBARQ SQ pos_word1_CC pos_word1_CD pos_word1_DT pos_word1_EX pos_word1_IN pos_word1_JJ pos_word1_JJR pos_word1_JJS pos_word1_MD pos_word1_NN pos_word1_NN_POS pos_word1_NNS pos_word1_NNS_POS pos_word1_PDT pos_word1_PRP pos_word1_PRPds pos_word1_PRP_POS pos_word1_RB pos_word1_RBR pos_word1_RBS pos_word1_RP pos_word1_TO pos_word1_VB pos_word1_VBD pos_word1_VBG pos_word1_VBN pos_word1_VBP pos_word1_VBZ pos_word1_WP pos_word1_WRB pos_word2_CC pos_word2_CD pos_word2_DT pos_word2_EX pos_word2_IN pos_word2_JJ pos_word2_JJR pos_word2_JJS pos_word2_MD pos_word2_NN pos_word2_NN_MD pos_word2_NN_POS pos_word2_NNP pos_word2_NNS pos_word2_NNS_POS pos_word2_PRP pos_word2_PRPds pos_word2_RB pos_word2_RBR pos_word2_RBS pos_word2_RP pos_word2_TO pos_word2_VB pos_word2_VBD pos_word2_VBG pos_word2_VBN pos_word2_VBP pos_word2_VBP_RB pos_word2_VBZ pos_word2_WDT pos_word2_WRB pos_word3_CC pos_word3_CD pos_word3_DT pos_word3_EX pos_word3_FW pos_word3_IN pos_word3_JJ pos_word3_JJR pos_word3_JJS pos_word3_MD pos_word3_NN pos_word3_NN_POS pos_word3_NNS pos_word3_NNS_POS pos_word3_PRP pos_word3_PRPds pos_word3_PRP_VBP pos_word3_RB pos_word3_RBR pos_word3_TO pos_word3_VB pos_word3_VBD pos_word3_VBG pos_word3_VBN pos_word3_VBP pos_word3_VBZ pos_word3_WP pos_word3_WRB target
94 5 153 16 10.948383 50.05300 2 7 0 0 1 12 12 8 1 1 20 5 2 0 6 0 2 3 3 5 4 2 0 5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0.00 1.06 12.77 12.77 8.51 1.06 1.06 21.28 5.32 2.13 0.00 6.38 0.00 2.13 3.19 3.19 5.32 4.26 2.13 0.00 5.32 1.06 0.00 1.06 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 1 1 1 0 2 0 0 0 0 0 0 0 1 6 2 4 2 0 7 0 0 0 0 0 0 3 4 31 11 15 5 0 22 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 low
114 8 185 13 9.116623 55.08178 5 17 21 4 8 14 18 12 0 2 34 10 0 1 4 0 0 2 2 5 0 2 1 2 0 1 0 0 1 0 0 0 0 0 0 0 0 0 3.25 6.50 11.38 14.63 9.76 0.00 1.63 27.64 8.13 0.00 0.81 3.25 0.00 0.00 1.63 1.63 4.07 0.00 1.63 0.81 1.63 0.00 0.81 0.00 0.00 0.81 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 14 6 1 0 0 5 0 0 0 0 0 0 0 0 9 5 1 1 0 3 0 1 0 0 0 0 0 4 43 16 11 2 0 16 0 1 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 low
48 3 93 11 13.512500 26.68250 4 4 18 2 6 8 9 3 1 0 17 1 0 0 2 1 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.70 11.11 14.81 16.67 5.56 1.85 0.00 31.48 1.85 0.00 0.00 3.70 1.85 0.00 0.00 0.00 5.56 0.00 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 2 0 0 0 1 0 0 0 0 0 0 0 1 11 4 1 0 0 1 0 0 0 0 0 0 0 2 19 8 3 1 0 4 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 low
108 5 170 14 11.408074 51.74433 5 8 0 4 0 14 10 11 0 2 23 2 1 1 10 1 0 5 11 1 1 0 1 8 0 1 2 1 1 0 0 0 0 0 0 0 0 0 3.60 0.00 12.61 9.01 9.91 0.00 1.80 20.72 1.80 0.90 0.90 9.01 0.90 0.00 4.50 9.91 0.90 0.90 0.00 0.90 7.21 0.00 0.90 1.80 0.90 0.90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 5 2 0 12 0 0 0 0 0 0 1 2 6 1 2 1 0 5 0 0 0 0 0 0 3 3 32 8 15 6 0 29 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 high
201 8 343 41 14.345068 36.96596 10 14 27 3 7 19 44 19 2 0 48 20 0 1 8 1 1 4 3 5 3 12 5 2 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1.44 3.35 9.09 21.05 9.09 0.96 0.00 22.97 9.57 0.00 0.48 3.83 0.48 0.48 1.91 1.44 2.39 1.44 5.74 2.39 0.96 0.00 0.48 0.00 0.00 0.48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 10 5 2 1 0 2 0 0 0 0 0 0 0 0 9 3 4 2 0 7 0 0 0 0 0 0 3 6 70 31 20 6 0 30 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 low
170 7 277 26 13.108487 44.33676 8 10 0 1 0 21 26 13 7 2 37 9 5 1 10 0 0 7 7 0 3 5 3 10 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0.59 0.00 12.35 15.29 7.65 4.12 1.18 21.76 5.29 2.94 0.59 5.88 0.00 0.00 4.12 4.12 0.00 1.76 2.94 1.76 5.88 0.59 0.00 0.59 0.59 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 5 1 0 0 1 0 0 0 0 0 0 0 2 12 5 3 2 0 6 0 1 0 0 0 0 2 8 60 23 19 9 0 33 1 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 high

3.2.1.2 Build a basic model

Build the basic model without turning hyper-parameters. The model performance is evaluated using confusion matrix and ROC-AUC curve.

# Step 2: build the basic model
set.seed(226) 
baseModel <- randomForest(target~., training, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
finalPredictions <- predict(baseModel, validation, type = "Prob") %>% as.data.frame()
validation$rf_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high   32  13
##       low     5  23
##                                           
##                Accuracy : 0.7534          
##                  95% CI : (0.6386, 0.8468)
##     No Information Rate : 0.5068          
##     P-Value [Acc > NIR] : 1.411e-05       
##                                           
##                   Kappa : 0.5053          
##                                           
##  Mcnemar's Test P-Value : 0.09896         
##                                           
##             Sensitivity : 0.8649          
##             Specificity : 0.6389          
##          Pos Pred Value : 0.7111          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.5068          
##          Detection Rate : 0.4384          
##    Detection Prevalence : 0.6164          
##       Balanced Accuracy : 0.7519          
##                                           
##        'Positive' Class : high            
## 
#ROC curve
PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)

3.2.1.3 Build the final model

Build the basic model using optimal variables and hyperparameters. The model performance is evaluated using confusion matrix and ROC-AUC curve.

#Step 3: build the final model using optimal variables and hyperparameters

opt_variable <- read.csv("./data_input/optvariables_para_eco_content0.4.csv")
opt_var <- opt_variable$x
training_short <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short$target <- training$target

##Final model
set.seed(7) 
baseModel <- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.

finalModel1 <- readRDS("./data_input/model/para_eco_content_final_model.rda") #import the pre-saved model so the results of confusion matrix is the same as shown in the paper
finalPredictions <- predict(finalModel1, validation, type = "Prob") %>% as.data.frame()
validation$rf_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high   33  13
##       low     4  23
##                                           
##                Accuracy : 0.7671          
##                  95% CI : (0.6535, 0.8581)
##     No Information Rate : 0.5068          
##     P-Value [Acc > NIR] : 4.513e-06       
##                                           
##                   Kappa : 0.5326          
##                                           
##  Mcnemar's Test P-Value : 0.05235         
##                                           
##             Sensitivity : 0.8919          
##             Specificity : 0.6389          
##          Pos Pred Value : 0.7174          
##          Neg Pred Value : 0.8519          
##              Prevalence : 0.5068          
##          Detection Rate : 0.4521          
##    Detection Prevalence : 0.6301          
##       Balanced Accuracy : 0.7654          
##                                           
##        'Positive' Class : high            
## 
#ROC curve
PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

# saveRDS(finalModel1, "para_eco_content_final_model.rda")

# eco_content_roc <- data.frame(PRROC_obj$curve) 
# colnames(eco_content_roc) <- c("False Positive Rate","True Positive Rate","threshold")
# write.csv(eco_content_roc, "G:/Research/JoanH/20181022_SMP Project/RDP/graphit/data/eco_content_roc_data.csv")


##calculate the logloss as an additional performance metric
ModelMetrics::logLoss(validation$target,validation$rf_final)
## [1] 0.7545834

3.2.1.4 Algorithem comparison

Comparing model performance using validation dataset with different algorithms, including Naive Byers, GBM, GLMnet, GLM, SVM. We add comment ‘eval=FALSE’ in this chunk to su

Test model performance using different algorithem

### 1) Model using NB - Add this section on 22/12/2020 to address Callan Winsor's comments
set.seed(7) 
baseModel <- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.


testmodelnb <- train(target ~., data= training_short, method="nb")
finalPredictions <- predict(testmodelnb, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$nb_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$nb_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$nb_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

###2) Model using GBM
set.seed(7) 
baseModel <- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.
testmodelGbm <- train(target ~., data= training_short, method="gbm")
finalPredictions <- predict(testmodelGbm, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$gbm_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$gbm_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$gbm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 


###3) Model using Glm
set.seed(7) 
testmodelglm <- train(target ~ ., training_short, method="glmnet",
                      trControl=trainControl(classProbs=TRUE))

finalPredictions <- predict(testmodelglm, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$glm_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$glm_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$glm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

###4) Model using svm
set.seed(7) 
testmodelsvm <- train(target ~ ., training_short, method="svmRadial",
                      trControl=trainControl(classProbs=TRUE))
finalPredictions <- predict(testmodelsvm, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$svm_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$svm_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$svm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 


###5) logistic modelsusing glm
#### all variables includedCheck the performance results using the logistic model with FK variable only
set.seed(7) 
testmodelglmnormal <- train(target ~ ., training_short, method="glm",
                      trControl=trainControl(classProbs=TRUE))
finalPredictions <- predict(testmodelglmnormal, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$glmnormal <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$glmnormal>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$glmnormal, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 


####6) logistic model with FK only - Check the performance results using the logistic model with FK variable only
set.seed(7) 
testmodelglmfk <- train(target ~ fk_grade_level, training_short, method="glm",
                      trControl=trainControl(classProbs=TRUE))
finalPredictions <- predict(testmodelglmfk, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$glmfk <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$glmfk>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$glmfk, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

3.2.2 Model 2 Non-Economists - Content

Build the reasoning model from non-economist perspective.

3.2.2.1 Data and models

#step 1 prepare dataset
model_para_noneco$index <- row.names(model_para_noneco)

model_data2 <- model_para_noneco %>% filter(content_label !="middle") %>% ungroup() %>%
  mutate(target=case_when(content_label=="high" ~ "high",
                          content_label=="low" ~ "low")) %>% 
  dplyr::select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label)

model_data <- function_model_data_process(model_data2)

set.seed(1234)
validationIndex <- createDataPartition(model_data$target, p=0.75, list=FALSE) 
validation <- model_data[-validationIndex,] 
training <- model_data[validationIndex,] 

print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is:  457 and columns is:  273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is:  344 and columns is:  273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is:  113 and columns is:  273"
# #step 2: build the basic RF model with default value 
# set.seed(7) 
# 
# baseModel <- randomForest(target~., training, mtry=round(sqrt(ncol(training)),0), ntree=500) #no need to change cut-off here as we can change it in the prediction stage.
# finalPredictions <- predict(baseModel, validation, type = "Prob") %>% as.data.frame()
# validation$rf_final <- finalPredictions$high
# #set the threshold for building confusion matrix table
# validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
# confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
# 
# #ROC curve
# PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
# PRROC_obj$auc <- round(PRROC_obj$auc,3)
# plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 
# 
# saveRDS(baseModel, "para_noneco_content_base_model.rda")

# Step 3: using the optimal variables for the final model

opt_variable <- read.csv("./data_input/optvariables_para_noneco_content0.4.csv")
opt_var <- opt_variable$x[1:65]
training_short <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short$target <- training$target
set.seed(1234) 
baseModel <- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.

finalModel2 <- readRDS("./data_input/model/para_noneco_content_final_model.rda") #import the pre-saved model so the results of confusion matrix is the same as shown in the 

finalPredictions <- predict(finalModel2, validation, type = "Prob") %>% as.data.frame()
validation$rf_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high   41  18
##       low    16  38
##                                           
##                Accuracy : 0.6991          
##                  95% CI : (0.6057, 0.7818)
##     No Information Rate : 0.5044          
##     P-Value [Acc > NIR] : 2.081e-05       
##                                           
##                   Kappa : 0.398           
##                                           
##  Mcnemar's Test P-Value : 0.8638          
##                                           
##             Sensitivity : 0.7193          
##             Specificity : 0.6786          
##          Pos Pred Value : 0.6949          
##          Neg Pred Value : 0.7037          
##              Prevalence : 0.5044          
##          Detection Rate : 0.3628          
##    Detection Prevalence : 0.5221          
##       Balanced Accuracy : 0.6989          
##                                           
##        'Positive' Class : high            
## 
#ROC curve
PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

# saveRDS(finalModel2, "para_noneco_content_final_model.rda")

# noneco_content_roc_dt <- data.frame(PRROC_obj$curve)
# colnames(noneco_content_roc_dt)<- c("False Positive Rate",
#                                     "True Positive Rate","threshold")
# 
# write.csv(noneco_content_roc_dt,"G:/Research/JoanH/20181022_SMP Project/RDP/graphit/data/noneco_content_roc_data.csv")
##calculate the logloss as an additional performance metric
ModelMetrics::logLoss(validation$target,validation$rf_final)
## [1] 0.8198992

3.2.2.2 Model comparison

Test the model performance using validation dataset and different algorithms.

## 1) NB
testmodelnb <- train(target ~., data= training_short, method="nb")
finalPredictions <- predict(testmodelnb, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$nb_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$nb_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$nb_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 


##2) GBM
set.seed(1234)
testmodelGbm <- train(target ~., data= training_short, method="gbm") #no need to change cut-off here as we can change it in the prediction stage.

finalPredictions <- predict(testmodelGbm, validation, type = "prob") %>% as.data.frame()
validation$gbm_final <- finalPredictions$high
validation$predict_label <- ifelse(validation$gbm_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$gbm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

###3)Glm

set.seed(1234)
testmodelGlm <- train(target ~., data= training_short, method="glm") #no need to change cut-off here as we can change it in the prediction stage.

finalPredictions <- predict(testmodelGlm, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$glm_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$glm_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$glm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

###4) Model using svm
set.seed(1234)
testmodelsvm <- train(target ~., data= training_short, method="svmRadial",
                      trControl=trainControl(classProbs=TRUE)) #no need to change cut-off here as we can change it in the prediction stage.

finalPredictions <- predict(testmodelsvm, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$svm_final <- finalPredictions$high

#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$svm_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$svm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

###5) Model using glmnet
set.seed(1234)
testmodelglmnet <- train(target ~., data= training_short, method="glmnet",
                      trControl=trainControl(classProbs=TRUE)) #no need to change cut-off here as we can change it in the prediction stage.

finalPredictions <- predict(testmodelglmnet, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$glmnet_final <- finalPredictions$high

#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$glmnet_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))

PRROC_obj <- roc.curve(scores.class0 = validation$glmnet_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

###6) GLM with FK

set.seed(1234)
testmodelglmfk <- train(target ~ fk_grade_level, data= training, method="glm",
                      trControl=trainControl(classProbs=TRUE)) #no need to change cut-off here as we can change it in the prediction stage.

finalPredictions <- predict(testmodelglmfk, validation, type = "prob") %>% as.data.frame()
#ROC curve
validation$glmfk_final <- finalPredictions$high
validation$predict_label <- ifelse(validation$glmfk_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
PRROC_obj <- roc.curve(scores.class0 = validation$glmfk_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

3.2.3 Model 3 Economist - Readability

Build the readability model from economist perspective.

#Step 1: model dataset preparation

model_para_eco$index <- row.names(model_para_eco)
model_data3 <- model_para_eco %>% filter(clarity_label !="middle") %>% ungroup() %>%
  mutate(target=case_when(clarity_label=="high" ~ "high",
                          clarity_label=="low" ~ "low")) %>%
  dplyr::select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label)

model_data <- model_data3

model_data <- function_model_data_process(model_data3)

set.seed(1234)
validationIndex <- createDataPartition(model_data$target, p=0.75, list=FALSE) 
validation <- model_data[-validationIndex,] 
training <- model_data[validationIndex,] 

print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is:  306 and columns is:  273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is:  230 and columns is:  273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is:  76 and columns is:  273"
#step 2: basic model
# set.seed(1234) 
# 
# baseModel <- randomForest(target~., training, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
 
# finalPredictions <- predict(finalModel3, validation, type = "Prob") %>% as.data.frame()
# validation$rf_final <- finalPredictions$high
# #set the threshold for building confusion matrix table
# validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
# confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
# 
# #ROC curve
# PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
# PRROC_obj$auc <- round(PRROC_obj$auc,3)
# plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 
# 

## using the optimal variables for the final model

opt_variable <- read.csv("./data_input/optvariables_para_eco_clarity0.4.csv")
opt_var <- opt_variable$x[1:35]

training_short <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short$target <- training$target

set.seed(1234) 
baseModel <- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.
#saveRDS(baseModel, "para_eco_clarity_base_model.rda")

finalModel3 <- readRDS("./data_input/model/para_eco_clarity_final_model.rda") #import the pre-saved model so the results of confusion matrix is the same as shown in the 

finalPredictions <- predict(finalModel3, validation, type = "Prob") %>% as.data.frame()
validation$rf_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high   28  11
##       low    10  27
##                                           
##                Accuracy : 0.7237          
##                  95% CI : (0.6091, 0.8201)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 6.031e-05       
##                                           
##                   Kappa : 0.4474          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7368          
##             Specificity : 0.7105          
##          Pos Pred Value : 0.7179          
##          Neg Pred Value : 0.7297          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3684          
##    Detection Prevalence : 0.5132          
##       Balanced Accuracy : 0.7237          
##                                           
##        'Positive' Class : high            
## 
#ROC curve
PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

##calculate the logloss as an additional performance metric
ModelMetrics::logLoss(validation$target,validation$rf_final)
## [1] 0.8000629

3.2.4 Model 4 Non-Economists - Readability

Build the readability model from non-economist perspective.

# step 1 prepare dataset
model_para_noneco$index <- row.names(model_para_noneco)
model_data2 <- model_para_noneco %>% filter(clarity_label !="middle") %>% ungroup() %>%
  mutate(target=case_when(clarity_label=="high" ~ "high",
                          clarity_label=="low" ~ "low")) %>%
  dplyr::select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label)

model_data <- function_model_data_process(model_data2)

set.seed(1234)
validationIndex <- createDataPartition(model_data$target, p=0.75, list=FALSE) 
validation <- model_data[-validationIndex,] 
training <- model_data[validationIndex,] 

print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is:  462 and columns is:  273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is:  347 and columns is:  273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is:  115 and columns is:  273"
# step 2 build the basic model using default settings
# set.seed(226) 
# 
# training1 <- training %>% dplyr::select(-FRES_score) #remove FRES_score as it is correlated with FK
# 
# baseModel <- randomForest(target~., training1, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
# finalPredictions <- predict(baseModel, validation, type = "Prob") %>% as.data.frame()
# validation$rf_final <- finalPredictions$high
# #set the threshold for building confusion matrix table
# validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
# confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
# 
# #ROC curve
# PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
# PRROC_obj$auc <- round(PRROC_obj$auc,3)
# plot(PRROC_obj, 
#      xlab="False Positive Rate (1 - Specificity)", 
#      ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

# saveRDS(baseModel, "para_noneco_clarity_base_model.rda")

#Step 3 Using the optimal variables for the final model

opt_variable <- read.csv("./data_input/optvariables_para_noneco_clarity0.4.csv")
opt_var <- opt_variable$x
training_short <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short$target <- training$target

set.seed(789) 
baseModel <- randomForest(target~., training_short, mtry=round(sqrt(ncol(training_short)),0), ntree=250) #no need to change cut-off here as we can change it in the prediction stage.
finalModel4 <- readRDS("./data_input/model/para_noneco_clarity_final_model.rda")


finalPredictions <- predict(finalModel4, validation, type = "Prob") %>% as.data.frame()
validation$rf_final <- finalPredictions$high
#set the threshold for building confusion matrix table
validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high   48  27
##       low    13  27
##                                           
##                Accuracy : 0.6522          
##                  95% CI : (0.5577, 0.7386)
##     No Information Rate : 0.5304          
##     P-Value [Acc > NIR] : 0.005494        
##                                           
##                   Kappa : 0.2912          
##                                           
##  Mcnemar's Test P-Value : 0.039833        
##                                           
##             Sensitivity : 0.7869          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.6400          
##          Neg Pred Value : 0.6750          
##              Prevalence : 0.5304          
##          Detection Rate : 0.4174          
##    Detection Prevalence : 0.6522          
##       Balanced Accuracy : 0.6434          
##                                           
##        'Positive' Class : high            
## 
#ROC curve
PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj$auc <- round(PRROC_obj$auc,3)
plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T) 

##calculate the logloss as an additional performance metric
ModelMetrics::logLoss(validation$target,validation$rf_final)
## [1] 0.6052968

4 Feature Importance

In this section, we extract top features for our models using feature importance function. The output is discussed in the section 6.3 of the paper.A snap shot of the result is shown in the chart below.

## Read models first
model_eco_content <- readRDS("./data_input/model/para_eco_content_final_model.rda")
model_eco_clarity <- readRDS("./data_input/model/para_eco_clarity_final_model.rda")
model_noneco_content <- readRDS("./data_input/model/para_noneco_content_final_model.rda")
model_noneco_clarity <- readRDS("./data_input/model/para_noneco_clarity_final_model.rda")

## Extract feature list

feature_list_table <- function(data_input){
  table_feature <- as.data.frame(data_input$importance)
  table_feature$feature <- row.names(table_feature)
  table_feature <- table_feature %>% arrange(desc(MeanDecreaseGini)) %>% dplyr::select(feature,MeanDecreaseGini)
  table_feature
}

## Generate feature ranking table for each models
model_eco_content_feature <- 
  feature_list_table(model_eco_content) %>% mutate(model_name = "model_eco_content")
model_eco_clarity_feature <- 
  feature_list_table(model_eco_clarity) %>% mutate(model_name = "model_eco_clarity")
model_noneco_content_feature <- 
  feature_list_table(model_noneco_content) %>% mutate(model_name ="model_noneco_content")
model_noneco_clarity_feature <- 
  feature_list_table(model_noneco_clarity) %>% mutate(model_name ="model_noneco_clarity")

## combine the feature table for graphs

feature_graph_dt <-
  rbind(
    model_eco_content_feature[1:5,],
    model_eco_clarity_feature[1:5,],
    model_noneco_content_feature[1:5,],
    model_noneco_clarity_feature[1:5,])


library(ggplot2)
# Basic barplot
p<- 
  ggplot(data=feature_graph_dt, aes(x=feature, y=MeanDecreaseGini)) +
  geom_bar(stat="identity", fill = "blue") +
  xlab("Feature List") +
  ylab ("Importance") +
  coord_flip() +
  facet_wrap(~ model_name, ncol=2)
p

5 Session information

The session information for this program is shown as below.

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4    questionr_0.7.4     klaR_0.6-15        
##  [4] MASS_7.3-53         PRROC_1.3.1         randomForest_4.6-14
##  [7] caret_6.0-86        ggplot2_3.3.3       lattice_0.20-41    
## [10] dplyr_1.0.6         tidyr_1.1.3         e1071_1.7-6        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2           viridisLite_0.4.0    splines_4.0.3       
##  [4] foreach_1.5.1        prodlim_2019.11.13   shiny_1.6.0         
##  [7] assertthat_0.2.1     highr_0.9            stats4_4.0.3        
## [10] yaml_2.2.1           ipred_0.9-11         pillar_1.6.0        
## [13] glue_1.4.2           pROC_1.17.0.1        digest_0.6.27       
## [16] promises_1.2.0.1     rvest_1.0.0          colorspace_2.0-1    
## [19] recipes_0.1.16       htmltools_0.5.1.1    httpuv_1.6.1        
## [22] Matrix_1.2-18        plyr_1.8.6           timeDate_3043.102   
## [25] pkgconfig_2.0.3      labelled_2.8.0       haven_2.4.1         
## [28] purrr_0.3.4          xtable_1.8-4         webshot_0.5.2       
## [31] scales_1.1.1         svglite_2.0.0        later_1.2.0         
## [34] gower_0.2.2          lava_1.6.9           tibble_3.1.1        
## [37] proxy_0.4-25         combinat_0.0-8       farver_2.1.0        
## [40] generics_0.1.0       ellipsis_0.3.2       withr_2.4.2         
## [43] nnet_7.3-14          survival_3.2-7       magrittr_2.0.1      
## [46] crayon_1.4.1         mime_0.10            evaluate_0.14       
## [49] fansi_0.4.2          nlme_3.1-149         xml2_1.3.2          
## [52] forcats_0.5.1        class_7.3-17         tools_4.0.3         
## [55] data.table_1.14.0    hms_1.0.0            lifecycle_1.0.0     
## [58] stringr_1.4.0        munsell_0.5.0        compiler_4.0.3      
## [61] systemfonts_1.0.1    rlang_0.4.11         grid_4.0.3          
## [64] iterators_1.0.13     rstudioapi_0.13      miniUI_0.1.1.1      
## [67] labeling_0.4.2       rmarkdown_2.8        gtable_0.3.0        
## [70] ModelMetrics_1.2.2.2 codetools_0.2-16     DBI_1.1.1           
## [73] reshape2_1.4.4       R6_2.5.0             lubridate_1.7.10    
## [76] knitr_1.33           fastmap_1.1.0        utf8_1.2.1          
## [79] stringi_1.5.3        Rcpp_1.0.6           vctrs_0.3.8         
## [82] rpart_4.1-15         tidyselect_1.1.1     xfun_0.22