In this code, we build four models using the random forest algorithm. The whole program is organised as follows. Section 2 is the preparation. Section 3 describes the process of building models, and the last section extracts the top features for each model.
Upload all libraries in this step. You may need to install some packages if this is your first time using them.
<-c("e1071","tidyr","dplyr","caret", "randomForest","ggplot2","PRROC", "klaR","questionr","kableExtra")
load.lib<-load.lib[!load.lib %in% installed.packages()]
install.lib for(lib in install.lib) install.packages(lib,dependencies=TRUE)
lapply(load.lib,require,character.only = TRUE)
<- dplyr::select select
We import the two datasets in this section for using in building the final dataset for modeling. The first one is the feature table for the whole 1000 sample paragraphs. The second dataset is the survey results table, in which we use the survey results to construct the labels for each paragraph. After that, we join the two table and save the output for using later.
#text feature data - 1000 paragraphs
<- readRDS("./data_input/text_feature_base_1000.rds")
text_feature_1000 #survey results data - 833 paragraphs, 1655 records
<- read.csv("./data_input/survey_table_final.csv")
survey_response_score
# merge the above two table together for the final data
## create a para index by extracting the first 10 words from each paragraph
$para <- as.character(survey_response_score$para)
survey_response_score$paragraph <- as.character(text_feature_1000$paragraph)
text_feature_1000
#using the unique question_index column to match the data.
$question_index <-
survey_response_scorepaste(survey_response_score$survey_group, survey_response_score$question_group, survey_response_score$index.x, sep = "_")
#merge the two datasets
<- left_join(text_feature_1000, survey_response_score, by = "question_index")
merge_data <- merge_data %>% select(question_index,paragraph, para) #generate this dataset to check if the data merge properly para_check
Prepare model data by removing extract columns that captured the paragraph information as those variables are not used as input variables for building models.
Cleaning input data for removing paragraphs that were not rated in the survey, as well as those paragraphs with invalid survey ratings (for example, some respondents rated all paragraphs same scores).
#Remove paragraphs that were not rated
<- merge_data %>% filter(!is.na(Question_Viewed))
model_base_data
#check if the number of record is 1965
print(paste("The number of survey response in the model_base_data table is: ", nrow(model_base_data),";", " and the number of record in the response table survey_response_score is: ", nrow(survey_response_score), ".", "The two numbers are supposed to be the same.", sep = ""))
## [1] "The number of survey response in the model_base_data table is: 1965; and the number of record in the response table survey_response_score is: 1965.The two numbers are supposed to be the same."
Some respondents (for example, Respondent_ID=11309405502) give the same score to all paragraphs. Those scores are invalid and thus are removed from the final model datasets.
<-
model_base_data %>% filter(!is.na(content.scale) & ! is.na(clarity.scale)) %>%
model_base_data arrange(Respondent_ID)
print(paste("The number of the model data using response record is: ", nrow(model_base_data), ".", sep = ""))
## [1] "The number of the model data using response record is: 1924."
Remove duplicated columns and and transpose three columns (word_pos.word1, word_pos.word2, word_pos.word3) into a numeric ones.
##remove duplicated variables
<- model_base_data %>% select(-word_pos)
model_base_data
## change the category variables (word_pos.word1, word_pos.word2, word_pos.word3) into numeric ones
<- model_base_data %>%
word_pos_1 ::select(question_index, word_pos.word1) %>%
dplyrgroup_by(question_index) %>%
::count(word_pos.word1) %>%
dplyrspread(word_pos.word1, n) %>% select(-'$') %>% unique()
colnames(word_pos_1)[2:ncol(word_pos_1)] <- paste("pos_word1",colnames(word_pos_1)[2:ncol(word_pos_1)], sep = "_")
<- model_base_data %>%
word_pos_2 ::select(question_index, word_pos.word2) %>%
dplyrgroup_by(question_index) %>%
::count(word_pos.word2) %>%
dplyrspread(word_pos.word2, n) %>% #
# count() %>%
# spread(word_pos.word2, freq) %>% ## commented out this part of code as it doesn't work anymore due to conflicts with plyr on 28/10/2020 (same changes applied to code word_pos_1 and word_pos_3 too!)
select(-',')
colnames(word_pos_2)[2:ncol(word_pos_2)] <- paste("pos_word2",colnames(word_pos_2)[2:ncol(word_pos_2)], sep = "_")
<- model_base_data %>%
word_pos_3 ::select(question_index, word_pos.word3) %>%
dplyrgroup_by(question_index) %>%
::count(word_pos.word3) %>%
dplyrspread(word_pos.word3, n) %>%
#count() %>%
#spread(word_pos.word3, freq) %>%
select(-',')
colnames(word_pos_3)[2:ncol(word_pos_3)] <- paste("pos_word3",colnames(word_pos_3)[2:ncol(word_pos_3)], sep = "_")
## remove the category variables, and then attach the numeric variables
<- model_base_data %>% select(-word_pos.word1, -word_pos.word2, -word_pos.word3)
model_base_data
## attach the numeric variables back to the model base table
<- model_base_data %>%
model_base_data left_join(.,word_pos_1, by ="question_index") %>%
left_join(.,word_pos_2, by ="question_index") %>%
left_join(.,word_pos_3, by ="question_index")
## replace all NAs with 0
is.na(model_base_data)] <- 0 #replace NAs with 0s - only for numeric variables
model_base_data[
#table a look of the model base data
A snapshot of the table is shown as below.
%>% head() %>% kbl() %>%
model_base_data kable_paper() %>%
scroll_box(width = "100%", height = "200px")
index.x.x | year.x | month.x | issue.x | paragraph | source.x | source_group.x | survey_group.x | question_group.x | question_index | paragraph_clean | word_count_stats | sentence_count | readability_stats.sylls | readability_stats.polys | fk_grade_level | FRES_score | comma_count | punc_count | digit_count | index.y.x | CC | CD | DT | IN | JJ | JJR | MD | NN | NNS | PRP | PRP$ | RB | RBR | RP | TO | VB | VBD | VBG | VBN | VBP | VBZ | WP | POS | WRB | EX | WDT | JJS | RBS | PDT | NNP | NNPS | FW | WP$ | SYM | UH | pos_prop_CC | pos_prop_CD | pos_prop_DT | pos_prop_IN | pos_prop_JJ | pos_prop_JJR | pos_prop_MD | pos_prop_NN | pos_prop_NNS | pos_prop_PRP | pos_prop_PRP$ | pos_prop_RB | pos_prop_RBR | pos_prop_RP | pos_prop_TO | pos_prop_VB | pos_prop_VBD | pos_prop_VBG | pos_prop_VBN | pos_prop_VBP | pos_prop_VBZ | pos_prop_WP | pos_prop_POS | pos_prop_WRB | pos_prop_EX | pos_prop_WDT | pos_prop_JJS | pos_prop_RBS | pos_prop_PDT | pos_prop_NNP | pos_prop_NNPS | pos_prop_FW | pos_prop_WP$ | pos_prop_SYM | pos_prop_UH | word_choose | sent1st_clue_Attitudinal | sent1st_clue_connective | sent1st_clue_Contrast | sent1st_clue_detail | sent1st_clue_emphasis | sent1st_clue_inference | sent1st_clue_reformulation | sent1st_clue_summary | sentlast_clue_Attitudinal | sentlast_clue_connective | sentlast_clue_Contrast | sentlast_clue_detail | sentlast_clue_emphasis | sentlast_clue_inference | sentlast_clue_reformulation | sentlast_clue_summary | sentmiddle_clue_Attitudinal | sentmiddle_clue_connective | sentmiddle_clue_Contrast | sentmiddle_clue_detail | sentmiddle_clue_emphasis | sentmiddle_clue_inference | sentmiddle_clue_reformulation | sentmiddle_clue_summary | sent_1st_word_, | sent_1st_word_CC | sent_1st_word_CD | sent_1st_word_DT | sent_1st_word_DT_POS | sent_1st_word_EX | sent_1st_word_EX_POS | sent_1st_word_FW | sent_1st_word_IN | sent_1st_word_JJ | sent_1st_word_JJR | sent_1st_word_JJS | sent_1st_word_LS | sent_1st_word_MD | sent_1st_word_NN | sent_1st_word_NN_MD | sent_1st_word_NN_POS | sent_1st_word_NNP | sent_1st_word_NNPS | sent_1st_word_NNS | sent_1st_word_NNS_POS | sent_1st_word_PDT | sent_1st_word_PRP | sent_1st_word_PRP$ | sent_1st_word_PRP_POS | sent_1st_word_RB | sent_1st_word_RBR | sent_1st_word_RBS | sent_1st_word_RP | sent_1st_word_TO | sent_1st_word_UH | sent_1st_word_VB | sent_1st_word_VBD | sent_1st_word_VBG | sent_1st_word_VBN | sent_1st_word_VBP | sent_1st_word_VBZ | sent_1st_word_WDT | sent_1st_word_WP | sent_1st_word_WRB | sent_1st_parse_ADJP | sent_1st_parse_ADVP | sent_1st_parse_NP | sent_1st_parse_PP | sent_1st_parse_S | sent_1st_parse_SBAR | sent_1st_parse_SINV | sent_1st_parse_VP | sent_1st_parse_WHADVP | sent_1st_parse_WHNP | sent_1st_parse_WHPP | sent_1st_parse_SBARQ | sent_1st_parse_SQ | sent_1st_parse_0 | sent_last_parse_ADJP | sent_last_parse_ADVP | sent_last_parse_NP | sent_last_parse_PP | sent_last_parse_S | sent_last_parse_SBAR | sent_last_parse_SINV | sent_last_parse_VP | sent_last_parse_WHADVP | sent_last_parse_WHNP | sent_last_parse_WHPP | sent_last_parse_SBARQ | sent_last_parse_SQ | sent_last_parse_0 | ADJP | ADVP | NP | PP | S | SBAR | SINV | VP | WHADVP | WHNP | WHPP | SBARQ | SQ | 0 | X | index.x.y | year.y | month.y | issue.y | para | source.y | source_group.y | survey_group.y | question_group.y | para_part_extraction | scoed_flag_final | para_unique_index | rank | index.y.y | Respondent_ID | economic_literacy | economics_education | education_index | economic_job | completion_flag | Question_Viewed | Clarity | Content | Q_Index | duplicate_flag | Score_Times | policy_score_job | policy_score_2_dimension | original_job_type | record_score_by | para_scored_by | content.scale | clarity.scale | unique_para_scored | pos_word1_CC | pos_word1_CD | pos_word1_DT | pos_word1_EX | pos_word1_IN | pos_word1_JJ | pos_word1_JJR | pos_word1_JJS | pos_word1_MD | pos_word1_NN | pos_word1_NN_POS | pos_word1_NNS | pos_word1_NNS_POS | pos_word1_PDT | pos_word1_PRP | pos_word1_PRP$ | pos_word1_PRP_POS | pos_word1_RB | pos_word1_RBR | pos_word1_RBS | pos_word1_RP | pos_word1_TO | pos_word1_VB | pos_word1_VBD | pos_word1_VBG | pos_word1_VBN | pos_word1_VBP | pos_word1_VBZ | pos_word1_WP | pos_word1_WRB | pos_word2_CC | pos_word2_CD | pos_word2_DT | pos_word2_EX | pos_word2_IN | pos_word2_JJ | pos_word2_JJR | pos_word2_JJS | pos_word2_MD | pos_word2_NN | pos_word2_NN_MD | pos_word2_NN_POS | pos_word2_NNP | pos_word2_NNS | pos_word2_NNS_POS | pos_word2_PRP | pos_word2_PRP$ | pos_word2_RB | pos_word2_RBR | pos_word2_RBS | pos_word2_RP | pos_word2_TO | pos_word2_VB | pos_word2_VBD | pos_word2_VBG | pos_word2_VBN | pos_word2_VBP | pos_word2_VBP_RB | pos_word2_VBZ | pos_word2_WDT | pos_word2_WRB | pos_word3_CC | pos_word3_CD | pos_word3_DT | pos_word3_EX | pos_word3_FW | pos_word3_IN | pos_word3_JJ | pos_word3_JJR | pos_word3_JJS | pos_word3_MD | pos_word3_NN | pos_word3_NN_POS | pos_word3_NNS | pos_word3_NNS_POS | pos_word3_PRP | pos_word3_PRP$ | pos_word3_PRP_VBP | pos_word3_RB | pos_word3_RBR | pos_word3_TO | pos_word3_VB | pos_word3_VBD | pos_word3_VBG | pos_word3_VBN | pos_word3_VBP | pos_word3_VBZ | pos_word3_WP | pos_word3_WRB |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
408 | 2010 | aug | intro | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | 4_smp_intro_2006_2019 | G4 | 3 | 1 | 3_1_83 | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | 54 | 2 | 100 | 14 | 16.79185 | 22.76333 | 6 | 2 | 0 | 408 | 1 | 0 | 6 | 8 | 9 | 0 | 0 | 12 | 2 | 0 | 1 | 2 | 0 | 0 | 2 | 1 | 0 | 1 | 4 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.85 | 0.00 | 11.11 | 14.81 | 16.67 | 0.00 | 0.00 | 22.22 | 3.70 | 0.00 | 1.85 | 3.70 | 0.00 | 0.00 | 3.70 | 1.85 | 0.00 | 1.85 | 7.41 | 3.70 | 5.56 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | the | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 4 | 2 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2 | 5 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 6 | 7 | 2 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 796 | 83 | 2010 | aug | intro | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | 4_smp_intro_2006_2019 | G4 | 3 | 1 | s set to decline as the various stimulus projects | Yes | p_48 | 91 | 119 | 11295443595 | 1 | None | 0 | No | 1 | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | 3 | 3 | Q3 | Duplicate | 4 | NP | NP | 0 | Non-economist | NP&P | 1.5 | 0.7745967 | both | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
427 | 2006 | feb | intro | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | 4_smp_intro_2006_2019 | G4 | 3 | 2 | 3_2_81 | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | 111 | 6 | 195 | 25 | 12.35473 | 39.43588 | 3 | 6 | 8 | 427 | 5 | 2 | 11 | 14 | 13 | 0 | 0 | 17 | 9 | 1 | 0 | 7 | 1 | 0 | 4 | 4 | 1 | 2 | 10 | 5 | 3 | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.42 | 1.77 | 9.73 | 12.39 | 11.50 | 0.00 | 0.00 | 15.04 | 7.96 | 0.88 | 0.00 | 6.19 | 0.88 | 0.00 | 3.54 | 3.54 | 0.88 | 1.77 | 8.85 | 4.42 | 2.65 | 0.00 | 0.00 | 0.00 | 1.77 | 0.00 | 0.88 | 0.88 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | global | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 36 | 14 | 16 | 0 | 0 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 846 | 81 | 2006 | feb | intro | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | 4_smp_intro_2006_2019 | G4 | 3 | 2 | average in 2005, and most observers expect this to | Yes | p_111 | 212 | 119 | 11295443595 | 1 | None | 0 | No | 1 | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | 4 | 3 | Q3 | Duplicate | 4 | NP | NP | 0 | Non-economist | NP | 1.5 | 1.7428425 | Non-economist | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
455 | 2019 | is-all-economics-local-speech-by-andy-haldan | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | 9_boe_speeches | G7 | 3 | 3 | 3_3_142 | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | 61 | 4 | 118 | 16 | 13.18373 | 27.70379 | 5 | 6 | 2 | 455 | 3 | 3 | 7 | 3 | 10 | 0 | 0 | 9 | 8 | 1 | 3 | 4 | 0 | 0 | 3 | 2 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.84 | 4.84 | 11.29 | 4.84 | 16.13 | 0.00 | 0.00 | 14.52 | 12.90 | 1.61 | 4.84 | 6.45 | 0.00 | 0.00 | 4.84 | 3.23 | 1.61 | 3.23 | 1.61 | 1.61 | 1.61 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | the | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 4 | 21 | 3 | 6 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 893 | 142 | 2019 | 0 | is-all-economics-local-speech-by-andy-haldan | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace. Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | 9_boe_speeches | G7 | 3 | 3 | time. This transformed inventory management at Wal | Yes | p_183 | 333 | 119 | 11295443595 | 1 | None | 0 | No | 1 | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace. Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | 3 | 2 | Q3 | Duplicate | 2 | NP | NP | 0 | Non-economist | NP | 0.0 | 0.7745967 | Non-economist | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
477 | 2019 | na | Can Kyriakos Mitsotakis ensure the Greek economy starts growing again_ - To hell and back | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | 10_economist | G9 | 3 | 4 | 3_4_168 | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | 94 | 5 | 153 | 16 | 10.94838 | 50.05300 | 2 | 7 | 0 | 477 | 0 | 1 | 12 | 12 | 8 | 1 | 1 | 20 | 5 | 2 | 0 | 6 | 0 | 2 | 3 | 3 | 5 | 4 | 2 | 0 | 5 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 1.06 | 12.77 | 12.77 | 8.51 | 1.06 | 1.06 | 21.28 | 5.32 | 2.13 | 0.00 | 6.38 | 0.00 | 2.13 | 3.19 | 3.19 | 5.32 | 4.26 | 2.13 | 0.00 | 5.32 | 1.06 | 0.00 | 1.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | even | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 5 | 1 | 1 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 | 2 | 4 | 2 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 4 | 31 | 11 | 15 | 5 | 0 | 22 | 1 | 1 | 0 | 0 | 0 | 0 | 925 | 168 | 2019 | na | Can Kyriakos Mitsotakis ensure the Greek economy starts growing again_ - To hell and back | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | 10_economist | G9 | 3 | 4 | mising to solve the problem. A lawyer recalls bein | Yes | p_252 | 454 | 119 | 11295443595 | 1 | None | 0 | No | 1 | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | 3 | 2 | Q3 | Duplicate | 3 | NP | NP | 0 | Non-economist | NP&P | 0.0 | 0.7745967 | both | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
497 | 2019 | na | Firms that analyse climate risks are the latest hot property - Sunny days | SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | 10_economist | G9 | 3 | 5 | 3_5_161 | SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | 97 | 6 | 176 | 21 | 12.12531 | 36.92480 | 5 | 10 | 15 | 497 | 1 | 7 | 10 | 17 | 8 | 0 | 0 | 23 | 8 | 3 | 0 | 4 | 0 | 0 | 3 | 1 | 5 | 3 | 3 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.99 | 6.93 | 9.90 | 16.83 | 7.92 | 0.00 | 0.00 | 22.77 | 7.92 | 2.97 | 0.00 | 3.96 | 0.00 | 0.00 | 2.97 | 0.99 | 4.95 | 2.97 | 2.97 | 0.00 | 2.97 | 0.99 | 0.00 | 0.00 | 0.00 | 0.99 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | soon | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 12 | 5 | 2 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2 | 4 | 2 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 5 | 45 | 18 | 12 | 2 | 0 | 17 | 0 | 1 | 0 | 0 | 0 | 0 | 953 | 161 | 2019 | na | Firms that analyse climate risks are the latest hot property - Sunny days | SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | 10_economist | G9 | 3 | 5 | ement of a net-zero goal. Ms Mazzacurati is still | Yes | p_316 | 575 | 119 | 11295443595 | 1 | None | 0 | No | 1 | Soon after Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | 2 | 2 | Q3 | Duplicate | 5 | NP | NP | 0 | Non-economist | NP&P | 0.0 | -0.1936492 | both | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
502 | 2019 | April | The Global Financial Environment | Financial Stability Review – April 2019 | RBA | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | 1_frs | G1 | 3 | 6 | 3_6_7 | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | 51 | 2 | 88 | 9 | 14.71578 | 34.97603 | 1 | 5 | 0 | 502 | 2 | 1 | 1 | 6 | 12 | 0 | 0 | 14 | 5 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 0 | 2 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.85 | 1.92 | 1.92 | 11.54 | 23.08 | 0.00 | 0.00 | 26.92 | 9.62 | 0.00 | 1.92 | 0.00 | 0.00 | 0.00 | 1.92 | 1.92 | 3.85 | 0.00 | 3.85 | 3.85 | 0.00 | 0.00 | 1.92 | 0.00 | 0.00 | 0.00 | 1.92 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | financial | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 2 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 2 | 4 | 3 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 4 | 5 | 3 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 959 | 7 | 2019 | April | The Global Financial Environment | Financial Stability Review – April 2019 | RBA | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | 1_frs | G1 | 3 | 6 | rve Bank of New Zealand (RBNZ) noted that risks ha | Yes | p_380 | 695 | 119 | 11295443595 | 1 | None | 0 | No | 1 | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | 2 | 2 | Q3 | Duplicate | 2 | NP | NP | 0 | Non-economist | NP | 0.0 | -0.1936492 | Non-economist | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Construct modeling dataset by removing columns that will not be put as input variables in the final model, such as text source, year, paragraph, etc.
#remove unneeded paragraphs
<- model_base_data %>%
model_response_base1 ::select(-X, -index.x.x, -index.x.y, -index.y.x, -index.y.y, -year.x,
dplyr-year.y, -month.y, -issue.y, -para, -source.y, -source_group.y,
-survey_group.y, -question_group.y, -para_part_extraction,-rank)
print(paste("The number of the model data using response record is: ",
nrow(model_response_base1), "." , sep = ""))
## [1] "The number of the model data using response record is: 1924."
<- model_response_base1 %>% ungroup() %>%
model_para_base1 group_by(question_index) %>%
::mutate(content.scale.avg = mean(content.scale), clarity.scale.avg = mean(clarity.scale)) %>%
dplyr::select(-Respondent_ID, -economic_literacy, -economics_education,
dplyr-education_index, -economic_job, -completion_flag, -Q_Index, -policy_score_job,
-policy_score_2_dimension, -original_job_type, -record_score_by,
-content.scale, -clarity.scale, -Clarity, -Content) %>% unique()
print(paste("The number of the model data using unique paragraph as an entry is: ", nrow(model_para_base1), ".", sep = ""))
## [1] "The number of the model data using unique paragraph as an entry is: 825."
%>% head() %>% kbl() %>%
model_para_base1 kable_paper() %>%
scroll_box(width = "100%", height = "200px")
month.x | issue.x | paragraph | source.x | source_group.x | survey_group.x | question_group.x | question_index | paragraph_clean | word_count_stats | sentence_count | readability_stats.sylls | readability_stats.polys | fk_grade_level | FRES_score | comma_count | punc_count | digit_count | CC | CD | DT | IN | JJ | JJR | MD | NN | NNS | PRP | PRP$ | RB | RBR | RP | TO | VB | VBD | VBG | VBN | VBP | VBZ | WP | POS | WRB | EX | WDT | JJS | RBS | PDT | NNP | NNPS | FW | WP$ | SYM | UH | pos_prop_CC | pos_prop_CD | pos_prop_DT | pos_prop_IN | pos_prop_JJ | pos_prop_JJR | pos_prop_MD | pos_prop_NN | pos_prop_NNS | pos_prop_PRP | pos_prop_PRP$ | pos_prop_RB | pos_prop_RBR | pos_prop_RP | pos_prop_TO | pos_prop_VB | pos_prop_VBD | pos_prop_VBG | pos_prop_VBN | pos_prop_VBP | pos_prop_VBZ | pos_prop_WP | pos_prop_POS | pos_prop_WRB | pos_prop_EX | pos_prop_WDT | pos_prop_JJS | pos_prop_RBS | pos_prop_PDT | pos_prop_NNP | pos_prop_NNPS | pos_prop_FW | pos_prop_WP$ | pos_prop_SYM | pos_prop_UH | word_choose | sent1st_clue_Attitudinal | sent1st_clue_connective | sent1st_clue_Contrast | sent1st_clue_detail | sent1st_clue_emphasis | sent1st_clue_inference | sent1st_clue_reformulation | sent1st_clue_summary | sentlast_clue_Attitudinal | sentlast_clue_connective | sentlast_clue_Contrast | sentlast_clue_detail | sentlast_clue_emphasis | sentlast_clue_inference | sentlast_clue_reformulation | sentlast_clue_summary | sentmiddle_clue_Attitudinal | sentmiddle_clue_connective | sentmiddle_clue_Contrast | sentmiddle_clue_detail | sentmiddle_clue_emphasis | sentmiddle_clue_inference | sentmiddle_clue_reformulation | sentmiddle_clue_summary | sent_1st_word_, | sent_1st_word_CC | sent_1st_word_CD | sent_1st_word_DT | sent_1st_word_DT_POS | sent_1st_word_EX | sent_1st_word_EX_POS | sent_1st_word_FW | sent_1st_word_IN | sent_1st_word_JJ | sent_1st_word_JJR | sent_1st_word_JJS | sent_1st_word_LS | sent_1st_word_MD | sent_1st_word_NN | sent_1st_word_NN_MD | sent_1st_word_NN_POS | sent_1st_word_NNP | sent_1st_word_NNPS | sent_1st_word_NNS | sent_1st_word_NNS_POS | sent_1st_word_PDT | sent_1st_word_PRP | sent_1st_word_PRP$ | sent_1st_word_PRP_POS | sent_1st_word_RB | sent_1st_word_RBR | sent_1st_word_RBS | sent_1st_word_RP | sent_1st_word_TO | sent_1st_word_UH | sent_1st_word_VB | sent_1st_word_VBD | sent_1st_word_VBG | sent_1st_word_VBN | sent_1st_word_VBP | sent_1st_word_VBZ | sent_1st_word_WDT | sent_1st_word_WP | sent_1st_word_WRB | sent_1st_parse_ADJP | sent_1st_parse_ADVP | sent_1st_parse_NP | sent_1st_parse_PP | sent_1st_parse_S | sent_1st_parse_SBAR | sent_1st_parse_SINV | sent_1st_parse_VP | sent_1st_parse_WHADVP | sent_1st_parse_WHNP | sent_1st_parse_WHPP | sent_1st_parse_SBARQ | sent_1st_parse_SQ | sent_1st_parse_0 | sent_last_parse_ADJP | sent_last_parse_ADVP | sent_last_parse_NP | sent_last_parse_PP | sent_last_parse_S | sent_last_parse_SBAR | sent_last_parse_SINV | sent_last_parse_VP | sent_last_parse_WHADVP | sent_last_parse_WHNP | sent_last_parse_WHPP | sent_last_parse_SBARQ | sent_last_parse_SQ | sent_last_parse_0 | ADJP | ADVP | NP | PP | S | SBAR | SINV | VP | WHADVP | WHNP | WHPP | SBARQ | SQ | 0 | scoed_flag_final | para_unique_index | Question_Viewed | duplicate_flag | Score_Times | para_scored_by | unique_para_scored | pos_word1_CC | pos_word1_CD | pos_word1_DT | pos_word1_EX | pos_word1_IN | pos_word1_JJ | pos_word1_JJR | pos_word1_JJS | pos_word1_MD | pos_word1_NN | pos_word1_NN_POS | pos_word1_NNS | pos_word1_NNS_POS | pos_word1_PDT | pos_word1_PRP | pos_word1_PRP$ | pos_word1_PRP_POS | pos_word1_RB | pos_word1_RBR | pos_word1_RBS | pos_word1_RP | pos_word1_TO | pos_word1_VB | pos_word1_VBD | pos_word1_VBG | pos_word1_VBN | pos_word1_VBP | pos_word1_VBZ | pos_word1_WP | pos_word1_WRB | pos_word2_CC | pos_word2_CD | pos_word2_DT | pos_word2_EX | pos_word2_IN | pos_word2_JJ | pos_word2_JJR | pos_word2_JJS | pos_word2_MD | pos_word2_NN | pos_word2_NN_MD | pos_word2_NN_POS | pos_word2_NNP | pos_word2_NNS | pos_word2_NNS_POS | pos_word2_PRP | pos_word2_PRP$ | pos_word2_RB | pos_word2_RBR | pos_word2_RBS | pos_word2_RP | pos_word2_TO | pos_word2_VB | pos_word2_VBD | pos_word2_VBG | pos_word2_VBN | pos_word2_VBP | pos_word2_VBP_RB | pos_word2_VBZ | pos_word2_WDT | pos_word2_WRB | pos_word3_CC | pos_word3_CD | pos_word3_DT | pos_word3_EX | pos_word3_FW | pos_word3_IN | pos_word3_JJ | pos_word3_JJR | pos_word3_JJS | pos_word3_MD | pos_word3_NN | pos_word3_NN_POS | pos_word3_NNS | pos_word3_NNS_POS | pos_word3_PRP | pos_word3_PRP$ | pos_word3_PRP_VBP | pos_word3_RB | pos_word3_RBR | pos_word3_TO | pos_word3_VB | pos_word3_VBD | pos_word3_VBG | pos_word3_VBN | pos_word3_VBP | pos_word3_VBZ | pos_word3_WP | pos_word3_WRB | content.scale.avg | clarity.scale.avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aug | intro | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | 4_smp_intro_2006_2019 | G4 | 3 | 1 | 3_1_83 | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | 54 | 2 | 100 | 14 | 16.79185 | 22.76333 | 6 | 2 | 0 | 1 | 0 | 6 | 8 | 9 | 0 | 0 | 12 | 2 | 0 | 1 | 2 | 0 | 0 | 2 | 1 | 0 | 1 | 4 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.85 | 0.00 | 11.11 | 14.81 | 16.67 | 0.00 | 0.00 | 22.22 | 3.70 | 0.00 | 1.85 | 3.70 | 0.00 | 0.00 | 3.70 | 1.85 | 0.00 | 1.85 | 7.41 | 3.70 | 5.56 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | the | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 4 | 2 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2 | 5 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 6 | 7 | 2 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | Yes | p_48 | The recent data suggest that the Australian economy has been growing at around its average pace due, in part, to a strong contribution from public investment. Over the period ahead, public investment is set to decline as the various stimulus projects are completed, but a strengthening in private demand, particularly business investment, is expected. | Duplicate | 4 | NP&P | both | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1837511 | 0.2222012 |
feb | intro | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | 4_smp_intro_2006_2019 | G4 | 3 | 2 | 3_2_81 | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | 111 | 6 | 195 | 25 | 12.35473 | 39.43588 | 3 | 6 | 8 | 5 | 2 | 11 | 14 | 13 | 0 | 0 | 17 | 9 | 1 | 0 | 7 | 1 | 0 | 4 | 4 | 1 | 2 | 10 | 5 | 3 | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.42 | 1.77 | 9.73 | 12.39 | 11.50 | 0.00 | 0.00 | 15.04 | 7.96 | 0.88 | 0.00 | 6.19 | 0.88 | 0.00 | 3.54 | 3.54 | 0.88 | 1.77 | 8.85 | 4.42 | 2.65 | 0.00 | 0.00 | 0.00 | 1.77 | 0.00 | 0.88 | 0.88 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | global | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 36 | 14 | 16 | 0 | 0 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | Yes | p_111 | Global economic conditions are continuing to provide a favourable environment for the Australian economy. The expansion underway over the past few years was initially led by the United States and China, but it has become more broadly based over time. Conditions in Japan have improved markedly over the past year, and there has been good growth elsewhere in east Asia and in other emerging economic regions. There have also been some signs of improvement recently in the euro area. Growth in world GDP is estimated to have been well above average in 2005, and most observers expect this to continue in 2006. The most recent economic indicators have tended to strengthen these expectations. | Duplicate | 4 | NP | Non-economist | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.3411905 | 0.6515765 |
is-all-economics-local-speech-by-andy-haldan | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | 9_boe_speeches | G7 | 3 | 3 | 3_3_142 | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace.34 Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | 61 | 4 | 118 | 16 | 13.18373 | 27.70379 | 5 | 6 | 2 | 3 | 3 | 7 | 3 | 10 | 0 | 0 | 9 | 8 | 1 | 3 | 4 | 0 | 0 | 3 | 2 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.84 | 4.84 | 11.29 | 4.84 | 16.13 | 0.00 | 0.00 | 14.52 | 12.90 | 1.61 | 4.84 | 6.45 | 0.00 | 0.00 | 4.84 | 3.23 | 1.61 | 3.23 | 1.61 | 1.61 | 1.61 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | the | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 4 | 21 | 3 | 6 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Yes | p_183 | The data necessary to calibrate these digital twins is considerable, but also increasingly commonplace. Twenty years ago, the large American retailer Walmart created a system allowing them, and their suppliers, to monitor their complex supply-chain in close to real time. This transformed inventory management at Walmart and its suppliers. Twenty years on, these real-time data systems are rapidly becoming the norm. | Duplicate | 2 | NP | Non-economist | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.6244692 | |
na | Can Kyriakos Mitsotakis ensure the Greek economy starts growing again_ - To hell and back | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | 10_economist | G9 | 3 | 4 | 3_4_168 | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | 94 | 5 | 153 | 16 | 10.94838 | 50.05300 | 2 | 7 | 0 | 0 | 1 | 12 | 12 | 8 | 1 | 1 | 20 | 5 | 2 | 0 | 6 | 0 | 2 | 3 | 3 | 5 | 4 | 2 | 0 | 5 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 1.06 | 12.77 | 12.77 | 8.51 | 1.06 | 1.06 | 21.28 | 5.32 | 2.13 | 0.00 | 6.38 | 0.00 | 2.13 | 3.19 | 3.19 | 5.32 | 4.26 | 2.13 | 0.00 | 5.32 | 1.06 | 0.00 | 1.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | even | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 5 | 1 | 1 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 | 2 | 4 | 2 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 4 | 31 | 11 | 15 | 5 | 0 | 22 | 1 | 1 | 0 | 0 | 0 | 0 | Yes | p_252 | Even if businesses could access credit easily, a thicket of regulations deters expansion. It keeps foreign investors out, too: the stock of inward direct investment relative to GDP is much lower than in the rest of the EU. The government is trying to improve Greece’s image. One investor recounts how a friend who complained about red tape on Facebook was rung up by a senior civil servant promising to solve the problem. A lawyer recalls being phoned up late on a Saturday by a minister inquiring if the approvals he needed had been received. | Duplicate | 3 | NP&P | both | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.6828015 | -0.5403628 |
na | Firms that analyse climate risks are the latest hot property - Sunny days | SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | 10_economist | G9 | 3 | 5 | 3_5_161 | SOON AFTER Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | 97 | 6 | 176 | 21 | 12.12531 | 36.92480 | 5 | 10 | 15 | 1 | 7 | 10 | 17 | 8 | 0 | 0 | 23 | 8 | 3 | 0 | 4 | 0 | 0 | 3 | 1 | 5 | 3 | 3 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.99 | 6.93 | 9.90 | 16.83 | 7.92 | 0.00 | 0.00 | 22.77 | 7.92 | 2.97 | 0.00 | 3.96 | 0.00 | 0.00 | 2.97 | 0.99 | 4.95 | 2.97 | 2.97 | 0.00 | 2.97 | 0.99 | 0.00 | 0.00 | 0.00 | 0.99 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | soon | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 12 | 5 | 2 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2 | 4 | 2 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 5 | 45 | 18 | 12 | 2 | 0 | 17 | 0 | 1 | 0 | 0 | 0 | 0 | Yes | p_316 | Soon after Hurricane Sandy battered Manhattan in 2012, Emilie Mazzacurati founded a firm in California to analyse the risks posed by climate change to business. She called it Four Twenty Seven, after the state’s target of lowering annual greenhouse-gas emissions to the equivalent of 427m tonnes of carbon dioxide by 2020. That reference quickly became outdated. The target was adjusted for technical reasons two years later, and rendered moot in 2018 by the announcement of a net-zero goal. Ms Mazzacurati is still happy with the name, though. “That is the risk of doing business in an uncertain climate,” she says. | Duplicate | 5 | NP&P | both | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.2293331 | 0.7157260 |
April | The Global Financial Environment | Financial Stability Review – April 2019 | RBA | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | 1_frs | G1 | 3 | 6 | 3_6_7 | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | 51 | 2 | 88 | 9 | 14.71578 | 34.97603 | 1 | 5 | 0 | 2 | 1 | 1 | 6 | 12 | 0 | 0 | 14 | 5 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 0 | 2 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.85 | 1.92 | 1.92 | 11.54 | 23.08 | 0.00 | 0.00 | 26.92 | 9.62 | 0.00 | 1.92 | 0.00 | 0.00 | 0.00 | 1.92 | 1.92 | 3.85 | 0.00 | 3.85 | 3.85 | 0.00 | 0.00 | 1.92 | 0.00 | 0.00 | 0.00 | 1.92 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | financial | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 2 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 2 | 4 | 3 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 4 | 5 | 3 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | Yes | p_380 | Financial stability risks in New Zealand are of key interest given Australian banks own New Zealand’s four major banks. In its latest Financial Stability Report, the Reserve Bank of New Zealand (RBNZ) noted that risks had eased but that high household and dairy sector debt continue to be large domestic vulnerabilities. | Duplicate | 2 | NP | Non-economist | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5139362 | 0.6959227 |
For models are generated in this section.They are:
As discussed in section 5.2 of the paper, we remove ambiguous paragraphs with standardised survey scores between -0.4 and 0.4. So responses with scores equal or higher than 0.4 are labeled as “high”, and those rated equal or lower than -0.4 rated as “low”. For those paragaphs with ratings between -0.4 and 0.4, we label them as “middle”.
= 0.4
high.cutoff = -0.4 low.cutoff
Construct two columns (content_lable an clarity_lable) for each paragraph. The sample size for each model is shown as:
<- model_para_base1 %>% mutate(
model_para_base1 content_label= case_when(content.scale.avg >= high.cutoff ~ "high",
<= low.cutoff ~ "low",
content.scale.avg < high.cutoff & content.scale.avg > low.cutoff ~ "middle"
content.scale.avg
),clarity_label= case_when(clarity.scale.avg >= high.cutoff ~ "high",
<= low.cutoff ~ "low",
clarity.scale.avg < high.cutoff & clarity.scale.avg > low.cutoff ~ "middle"
clarity.scale.avg
))#Check the count of paragraph by each label
%>% ungroup() %>% group_by(content_label) %>% dplyr::summarise(para_count = n()) %>%
model_para_base1 kbl(caption = "Composition of sample pararaphs by reasoning (content) labels") %>%
kable_classic(full_width = F, html_font = "Cambria")
content_label | para_count |
---|---|
high | 248 |
low | 251 |
middle | 326 |
#Check the count of paragraph by each label
%>% ungroup() %>% group_by(clarity_label) %>% dplyr::summarise(para_count = n()) %>%
model_para_base1 kbl(caption = "Composition of sample pararaphs by readability (clarity) labels") %>%
kable_classic(full_width = F, html_font = "Cambria")
clarity_label | para_count |
---|---|
high | 264 |
low | 241 |
middle | 320 |
Separate our model data into two parts: economist and non-economist.
<- model_para_base1 %>% filter(unique_para_scored=="Economist"|unique_para_scored=="both") %>%
model_para_eco select(-'0',-scoed_flag_final, -para_unique_index,-duplicate_flag,-Score_Times, -para_scored_by, -unique_para_scored,
-Question_Viewed, -para_scored_by,
-paragraph_clean, -question_group.x, -survey_group.x, -source_group.x, -source.x, -paragraph, -issue.x,
-month.x,-word_choose)
<- model_para_base1 %>% filter(unique_para_scored=="Non-economist"|unique_para_scored=="both") %>%
model_para_noneco select(-'0',-scoed_flag_final, -para_unique_index,-duplicate_flag,-Score_Times, -para_scored_by, -unique_para_scored,
-Question_Viewed, -para_scored_by,
-paragraph_clean, -question_group.x, -survey_group.x, -source_group.x, -source.x, -paragraph, -issue.x,
-month.x,-word_choose)
#check the number
print(paste("The number of paragraphs that are rated by economists is: ", nrow(model_para_eco), sep = ""))
## [1] "The number of paragraphs that are rated by economists is: 518"
print(paste("The number of paragraphs that are rated by non-economists is: ", nrow(model_para_noneco), sep = ""))
## [1] "The number of paragraphs that are rated by non-economists is: 772"
Four models are built separately using RF algorithm. To address over-fitting concerns, we build models using 75 per cent of sample data as training dataset and use the rest 25 per cent to evaluate model performance. Three metrics are reported here for testing model performance, which are confusion matrix, ROC curve with AUC value, and LogLoss.
Prepare model data for building the economics reasoning (content) model.
<- function(model_data){
function_model_data_process $target <- as.factor(model_data$target)
model_datanames(model_data) <- gsub(x = names(model_data), pattern ='[$]', replacement = 'ds')
names(model_data) <- gsub(x = names(model_data), pattern ='[-]', replacement = 'dash')
if ("sent_1st_word_," %in% colnames(model_data) ) {
<- model_data %>% dplyr::select(-'sent_1st_word_,')} else
model_data
model_data
model_data
}# step1: split data into training and validation dataset
$index <- row.names(model_para_eco)
model_para_eco
<- model_para_eco %>% filter(content_label !="middle") %>%
model_data2 ungroup() %>%
mutate(target=case_when(content_label=="high" ~ "high",content_label=="low" ~ "low")) %>%
select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label) %>%
unique()
<- function_model_data_process(model_data2)
model_data
set.seed(1234)
<- createDataPartition(model_data$target, p=0.75, list=FALSE)
validationIndex <- model_data[-validationIndex,]
validation <- model_data[validationIndex,]
training
print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is: 295 and columns is: 273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is: 222 and columns is: 273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is: 73 and columns is: 273"
%>% head() %>% kbl() %>%
training kable_paper() %>%
scroll_box(width = "100%", height = "200px")
word_count_stats | sentence_count | readability_stats.sylls | readability_stats.polys | fk_grade_level | FRES_score | comma_count | punc_count | digit_count | CC | CD | DT | IN | JJ | JJR | MD | NN | NNS | PRP | PRPds | RB | RBR | RP | TO | VB | VBD | VBG | VBN | VBP | VBZ | WP | POS | WRB | EX | WDT | JJS | RBS | PDT | NNP | NNPS | FW | WPds | SYM | UH | pos_prop_CC | pos_prop_CD | pos_prop_DT | pos_prop_IN | pos_prop_JJ | pos_prop_JJR | pos_prop_MD | pos_prop_NN | pos_prop_NNS | pos_prop_PRP | pos_prop_PRPds | pos_prop_RB | pos_prop_RBR | pos_prop_RP | pos_prop_TO | pos_prop_VB | pos_prop_VBD | pos_prop_VBG | pos_prop_VBN | pos_prop_VBP | pos_prop_VBZ | pos_prop_WP | pos_prop_POS | pos_prop_WRB | pos_prop_EX | pos_prop_WDT | pos_prop_JJS | pos_prop_RBS | pos_prop_PDT | pos_prop_NNP | pos_prop_NNPS | pos_prop_FW | pos_prop_WPds | pos_prop_SYM | pos_prop_UH | sent1st_clue_Attitudinal | sent1st_clue_connective | sent1st_clue_Contrast | sent1st_clue_detail | sent1st_clue_emphasis | sent1st_clue_inference | sent1st_clue_reformulation | sent1st_clue_summary | sentlast_clue_Attitudinal | sentlast_clue_connective | sentlast_clue_Contrast | sentlast_clue_detail | sentlast_clue_emphasis | sentlast_clue_inference | sentlast_clue_reformulation | sentlast_clue_summary | sentmiddle_clue_Attitudinal | sentmiddle_clue_connective | sentmiddle_clue_Contrast | sentmiddle_clue_detail | sentmiddle_clue_emphasis | sentmiddle_clue_inference | sentmiddle_clue_reformulation | sentmiddle_clue_summary | sent_1st_word_CC | sent_1st_word_CD | sent_1st_word_DT | sent_1st_word_DT_POS | sent_1st_word_EX | sent_1st_word_EX_POS | sent_1st_word_FW | sent_1st_word_IN | sent_1st_word_JJ | sent_1st_word_JJR | sent_1st_word_JJS | sent_1st_word_LS | sent_1st_word_MD | sent_1st_word_NN | sent_1st_word_NN_MD | sent_1st_word_NN_POS | sent_1st_word_NNP | sent_1st_word_NNPS | sent_1st_word_NNS | sent_1st_word_NNS_POS | sent_1st_word_PDT | sent_1st_word_PRP | sent_1st_word_PRPds | sent_1st_word_PRP_POS | sent_1st_word_RB | sent_1st_word_RBR | sent_1st_word_RBS | sent_1st_word_RP | sent_1st_word_TO | sent_1st_word_UH | sent_1st_word_VB | sent_1st_word_VBD | sent_1st_word_VBG | sent_1st_word_VBN | sent_1st_word_VBP | sent_1st_word_VBZ | sent_1st_word_WDT | sent_1st_word_WP | sent_1st_word_WRB | sent_1st_parse_ADJP | sent_1st_parse_ADVP | sent_1st_parse_NP | sent_1st_parse_PP | sent_1st_parse_S | sent_1st_parse_SBAR | sent_1st_parse_SINV | sent_1st_parse_VP | sent_1st_parse_WHADVP | sent_1st_parse_WHNP | sent_1st_parse_WHPP | sent_1st_parse_SBARQ | sent_1st_parse_SQ | sent_1st_parse_0 | sent_last_parse_ADJP | sent_last_parse_ADVP | sent_last_parse_NP | sent_last_parse_PP | sent_last_parse_S | sent_last_parse_SBAR | sent_last_parse_SINV | sent_last_parse_VP | sent_last_parse_WHADVP | sent_last_parse_WHNP | sent_last_parse_WHPP | sent_last_parse_SBARQ | sent_last_parse_SQ | sent_last_parse_0 | ADJP | ADVP | NP | PP | S | SBAR | SINV | VP | WHADVP | WHNP | WHPP | SBARQ | SQ | pos_word1_CC | pos_word1_CD | pos_word1_DT | pos_word1_EX | pos_word1_IN | pos_word1_JJ | pos_word1_JJR | pos_word1_JJS | pos_word1_MD | pos_word1_NN | pos_word1_NN_POS | pos_word1_NNS | pos_word1_NNS_POS | pos_word1_PDT | pos_word1_PRP | pos_word1_PRPds | pos_word1_PRP_POS | pos_word1_RB | pos_word1_RBR | pos_word1_RBS | pos_word1_RP | pos_word1_TO | pos_word1_VB | pos_word1_VBD | pos_word1_VBG | pos_word1_VBN | pos_word1_VBP | pos_word1_VBZ | pos_word1_WP | pos_word1_WRB | pos_word2_CC | pos_word2_CD | pos_word2_DT | pos_word2_EX | pos_word2_IN | pos_word2_JJ | pos_word2_JJR | pos_word2_JJS | pos_word2_MD | pos_word2_NN | pos_word2_NN_MD | pos_word2_NN_POS | pos_word2_NNP | pos_word2_NNS | pos_word2_NNS_POS | pos_word2_PRP | pos_word2_PRPds | pos_word2_RB | pos_word2_RBR | pos_word2_RBS | pos_word2_RP | pos_word2_TO | pos_word2_VB | pos_word2_VBD | pos_word2_VBG | pos_word2_VBN | pos_word2_VBP | pos_word2_VBP_RB | pos_word2_VBZ | pos_word2_WDT | pos_word2_WRB | pos_word3_CC | pos_word3_CD | pos_word3_DT | pos_word3_EX | pos_word3_FW | pos_word3_IN | pos_word3_JJ | pos_word3_JJR | pos_word3_JJS | pos_word3_MD | pos_word3_NN | pos_word3_NN_POS | pos_word3_NNS | pos_word3_NNS_POS | pos_word3_PRP | pos_word3_PRPds | pos_word3_PRP_VBP | pos_word3_RB | pos_word3_RBR | pos_word3_TO | pos_word3_VB | pos_word3_VBD | pos_word3_VBG | pos_word3_VBN | pos_word3_VBP | pos_word3_VBZ | pos_word3_WP | pos_word3_WRB | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
94 | 5 | 153 | 16 | 10.948383 | 50.05300 | 2 | 7 | 0 | 0 | 1 | 12 | 12 | 8 | 1 | 1 | 20 | 5 | 2 | 0 | 6 | 0 | 2 | 3 | 3 | 5 | 4 | 2 | 0 | 5 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 1.06 | 12.77 | 12.77 | 8.51 | 1.06 | 1.06 | 21.28 | 5.32 | 2.13 | 0.00 | 6.38 | 0.00 | 2.13 | 3.19 | 3.19 | 5.32 | 4.26 | 2.13 | 0.00 | 5.32 | 1.06 | 0.00 | 1.06 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 5 | 1 | 1 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 | 2 | 4 | 2 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 4 | 31 | 11 | 15 | 5 | 0 | 22 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | low |
114 | 8 | 185 | 13 | 9.116623 | 55.08178 | 5 | 17 | 21 | 4 | 8 | 14 | 18 | 12 | 0 | 2 | 34 | 10 | 0 | 1 | 4 | 0 | 0 | 2 | 2 | 5 | 0 | 2 | 1 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.25 | 6.50 | 11.38 | 14.63 | 9.76 | 0.00 | 1.63 | 27.64 | 8.13 | 0.00 | 0.81 | 3.25 | 0.00 | 0.00 | 1.63 | 1.63 | 4.07 | 0.00 | 1.63 | 0.81 | 1.63 | 0.00 | 0.81 | 0.00 | 0.00 | 0.81 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 14 | 6 | 1 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 5 | 1 | 1 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 4 | 43 | 16 | 11 | 2 | 0 | 16 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | low |
48 | 3 | 93 | 11 | 13.512500 | 26.68250 | 4 | 4 | 18 | 2 | 6 | 8 | 9 | 3 | 1 | 0 | 17 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.70 | 11.11 | 14.81 | 16.67 | 5.56 | 1.85 | 0.00 | 31.48 | 1.85 | 0.00 | 0.00 | 3.70 | 1.85 | 0.00 | 0.00 | 0.00 | 5.56 | 0.00 | 0.00 | 0.00 | 1.85 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 11 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 19 | 8 | 3 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | low |
108 | 5 | 170 | 14 | 11.408074 | 51.74433 | 5 | 8 | 0 | 4 | 0 | 14 | 10 | 11 | 0 | 2 | 23 | 2 | 1 | 1 | 10 | 1 | 0 | 5 | 11 | 1 | 1 | 0 | 1 | 8 | 0 | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.60 | 0.00 | 12.61 | 9.01 | 9.91 | 0.00 | 1.80 | 20.72 | 1.80 | 0.90 | 0.90 | 9.01 | 0.90 | 0.00 | 4.50 | 9.91 | 0.90 | 0.90 | 0.00 | 0.90 | 7.21 | 0.00 | 0.90 | 1.80 | 0.90 | 0.90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 5 | 2 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 6 | 1 | 2 | 1 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 32 | 8 | 15 | 6 | 0 | 29 | 2 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | high |
201 | 8 | 343 | 41 | 14.345068 | 36.96596 | 10 | 14 | 27 | 3 | 7 | 19 | 44 | 19 | 2 | 0 | 48 | 20 | 0 | 1 | 8 | 1 | 1 | 4 | 3 | 5 | 3 | 12 | 5 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.44 | 3.35 | 9.09 | 21.05 | 9.09 | 0.96 | 0.00 | 22.97 | 9.57 | 0.00 | 0.48 | 3.83 | 0.48 | 0.48 | 1.91 | 1.44 | 2.39 | 1.44 | 5.74 | 2.39 | 0.96 | 0.00 | 0.48 | 0.00 | 0.00 | 0.48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 10 | 5 | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 3 | 4 | 2 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 6 | 70 | 31 | 20 | 6 | 0 | 30 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | low |
170 | 7 | 277 | 26 | 13.108487 | 44.33676 | 8 | 10 | 0 | 1 | 0 | 21 | 26 | 13 | 7 | 2 | 37 | 9 | 5 | 1 | 10 | 0 | 0 | 7 | 7 | 0 | 3 | 5 | 3 | 10 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.59 | 0.00 | 12.35 | 15.29 | 7.65 | 4.12 | 1.18 | 21.76 | 5.29 | 2.94 | 0.59 | 5.88 | 0.00 | 0.00 | 4.12 | 4.12 | 0.00 | 1.76 | 2.94 | 1.76 | 5.88 | 0.59 | 0.00 | 0.59 | 0.59 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 12 | 5 | 3 | 2 | 0 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 8 | 60 | 23 | 19 | 9 | 0 | 33 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | high |
Build the basic model without turning hyper-parameters. The model performance is evaluated using confusion matrix and ROC-AUC curve.
# Step 2: build the basic model
set.seed(226)
<- randomForest(target~., training, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
baseModel <- predict(baseModel, validation, type = "Prob") %>% as.data.frame()
finalPredictions $rf_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 32 13
## low 5 23
##
## Accuracy : 0.7534
## 95% CI : (0.6386, 0.8468)
## No Information Rate : 0.5068
## P-Value [Acc > NIR] : 1.411e-05
##
## Kappa : 0.5053
##
## Mcnemar's Test P-Value : 0.09896
##
## Sensitivity : 0.8649
## Specificity : 0.6389
## Pos Pred Value : 0.7111
## Neg Pred Value : 0.8214
## Prevalence : 0.5068
## Detection Rate : 0.4384
## Detection Prevalence : 0.6164
## Balanced Accuracy : 0.7519
##
## 'Positive' Class : high
##
#ROC curve
<- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
Build the basic model using optimal variables and hyperparameters. The model performance is evaluated using confusion matrix and ROC-AUC curve.
#Step 3: build the final model using optimal variables and hyperparameters
<- read.csv("./data_input/optvariables_para_eco_content0.4.csv")
opt_variable <- opt_variable$x
opt_var <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short $target <- training$target
training_short
##Final model
set.seed(7)
<- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.
baseModel
<- readRDS("./data_input/model/para_eco_content_final_model.rda") #import the pre-saved model so the results of confusion matrix is the same as shown in the paper
finalModel1 <- predict(finalModel1, validation, type = "Prob") %>% as.data.frame()
finalPredictions $rf_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 33 13
## low 4 23
##
## Accuracy : 0.7671
## 95% CI : (0.6535, 0.8581)
## No Information Rate : 0.5068
## P-Value [Acc > NIR] : 4.513e-06
##
## Kappa : 0.5326
##
## Mcnemar's Test P-Value : 0.05235
##
## Sensitivity : 0.8919
## Specificity : 0.6389
## Pos Pred Value : 0.7174
## Neg Pred Value : 0.8519
## Prevalence : 0.5068
## Detection Rate : 0.4521
## Detection Prevalence : 0.6301
## Balanced Accuracy : 0.7654
##
## 'Positive' Class : high
##
#ROC curve
<- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
# saveRDS(finalModel1, "para_eco_content_final_model.rda")
# eco_content_roc <- data.frame(PRROC_obj$curve)
# colnames(eco_content_roc) <- c("False Positive Rate","True Positive Rate","threshold")
# write.csv(eco_content_roc, "G:/Research/JoanH/20181022_SMP Project/RDP/graphit/data/eco_content_roc_data.csv")
##calculate the logloss as an additional performance metric
::logLoss(validation$target,validation$rf_final) ModelMetrics
## [1] 0.7545834
Comparing model performance using validation dataset with different algorithms, including Naive Byers, GBM, GLMnet, GLM, SVM. We add comment ‘eval=FALSE’ in this chunk to su
Test model performance using different algorithem
### 1) Model using NB - Add this section on 22/12/2020 to address Callan Winsor's comments
set.seed(7)
<- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.
baseModel
<- train(target ~., data= training_short, method="nb")
testmodelnb <- predict(testmodelnb, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$nb_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$nb_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$nb_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###2) Model using GBM
set.seed(7)
<- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.
baseModel <- train(target ~., data= training_short, method="gbm")
testmodelGbm <- predict(testmodelGbm, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$gbm_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$gbm_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$gbm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###3) Model using Glm
set.seed(7)
<- train(target ~ ., training_short, method="glmnet",
testmodelglm trControl=trainControl(classProbs=TRUE))
<- predict(testmodelglm, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$glm_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$glm_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$glm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###4) Model using svm
set.seed(7)
<- train(target ~ ., training_short, method="svmRadial",
testmodelsvm trControl=trainControl(classProbs=TRUE))
<- predict(testmodelsvm, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$svm_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$svm_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$svm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###5) logistic modelsusing glm
#### all variables includedCheck the performance results using the logistic model with FK variable only
set.seed(7)
<- train(target ~ ., training_short, method="glm",
testmodelglmnormal trControl=trainControl(classProbs=TRUE))
<- predict(testmodelglmnormal, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$glmnormal <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$glmnormal>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$glmnormal, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
####6) logistic model with FK only - Check the performance results using the logistic model with FK variable only
set.seed(7)
<- train(target ~ fk_grade_level, training_short, method="glm",
testmodelglmfk trControl=trainControl(classProbs=TRUE))
<- predict(testmodelglmfk, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$glmfk <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$glmfk>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$glmfk, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
Build the reasoning model from non-economist perspective.
#step 1 prepare dataset
$index <- row.names(model_para_noneco)
model_para_noneco
<- model_para_noneco %>% filter(content_label !="middle") %>% ungroup() %>%
model_data2 mutate(target=case_when(content_label=="high" ~ "high",
=="low" ~ "low")) %>%
content_label::select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label)
dplyr
<- function_model_data_process(model_data2)
model_data
set.seed(1234)
<- createDataPartition(model_data$target, p=0.75, list=FALSE)
validationIndex <- model_data[-validationIndex,]
validation <- model_data[validationIndex,]
training
print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is: 457 and columns is: 273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is: 344 and columns is: 273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is: 113 and columns is: 273"
# #step 2: build the basic RF model with default value
# set.seed(7)
#
# baseModel <- randomForest(target~., training, mtry=round(sqrt(ncol(training)),0), ntree=500) #no need to change cut-off here as we can change it in the prediction stage.
# finalPredictions <- predict(baseModel, validation, type = "Prob") %>% as.data.frame()
# validation$rf_final <- finalPredictions$high
# #set the threshold for building confusion matrix table
# validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
# confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
#
# #ROC curve
# PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
# PRROC_obj$auc <- round(PRROC_obj$auc,3)
# plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
#
# saveRDS(baseModel, "para_noneco_content_base_model.rda")
# Step 3: using the optimal variables for the final model
<- read.csv("./data_input/optvariables_para_noneco_content0.4.csv")
opt_variable <- opt_variable$x[1:65]
opt_var <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short $target <- training$target
training_shortset.seed(1234)
<- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
baseModel
<- readRDS("./data_input/model/para_noneco_content_final_model.rda") #import the pre-saved model so the results of confusion matrix is the same as shown in the
finalModel2
<- predict(finalModel2, validation, type = "Prob") %>% as.data.frame()
finalPredictions $rf_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 41 18
## low 16 38
##
## Accuracy : 0.6991
## 95% CI : (0.6057, 0.7818)
## No Information Rate : 0.5044
## P-Value [Acc > NIR] : 2.081e-05
##
## Kappa : 0.398
##
## Mcnemar's Test P-Value : 0.8638
##
## Sensitivity : 0.7193
## Specificity : 0.6786
## Pos Pred Value : 0.6949
## Neg Pred Value : 0.7037
## Prevalence : 0.5044
## Detection Rate : 0.3628
## Detection Prevalence : 0.5221
## Balanced Accuracy : 0.6989
##
## 'Positive' Class : high
##
#ROC curve
<- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
# saveRDS(finalModel2, "para_noneco_content_final_model.rda")
# noneco_content_roc_dt <- data.frame(PRROC_obj$curve)
# colnames(noneco_content_roc_dt)<- c("False Positive Rate",
# "True Positive Rate","threshold")
#
# write.csv(noneco_content_roc_dt,"G:/Research/JoanH/20181022_SMP Project/RDP/graphit/data/noneco_content_roc_data.csv")
##calculate the logloss as an additional performance metric
::logLoss(validation$target,validation$rf_final) ModelMetrics
## [1] 0.8198992
Test the model performance using validation dataset and different algorithms.
## 1) NB
<- train(target ~., data= training_short, method="nb")
testmodelnb <- predict(testmodelnb, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$nb_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$nb_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$nb_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
##2) GBM
set.seed(1234)
<- train(target ~., data= training_short, method="gbm") #no need to change cut-off here as we can change it in the prediction stage.
testmodelGbm
<- predict(testmodelGbm, validation, type = "prob") %>% as.data.frame()
finalPredictions $gbm_final <- finalPredictions$high
validation$predict_label <- ifelse(validation$gbm_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$gbm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###3)Glm
set.seed(1234)
<- train(target ~., data= training_short, method="glm") #no need to change cut-off here as we can change it in the prediction stage.
testmodelGlm
<- predict(testmodelGlm, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$glm_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$glm_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$glm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###4) Model using svm
set.seed(1234)
<- train(target ~., data= training_short, method="svmRadial",
testmodelsvm trControl=trainControl(classProbs=TRUE)) #no need to change cut-off here as we can change it in the prediction stage.
<- predict(testmodelsvm, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$svm_final <- finalPredictions$high
validation
#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$svm_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$svm_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###5) Model using glmnet
set.seed(1234)
<- train(target ~., data= training_short, method="glmnet",
testmodelglmnet trControl=trainControl(classProbs=TRUE)) #no need to change cut-off here as we can change it in the prediction stage.
<- predict(testmodelglmnet, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$glmnet_final <- finalPredictions$high
validation
#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$glmnet_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$glmnet_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
###6) GLM with FK
set.seed(1234)
<- train(target ~ fk_grade_level, data= training, method="glm",
testmodelglmfk trControl=trainControl(classProbs=TRUE)) #no need to change cut-off here as we can change it in the prediction stage.
<- predict(testmodelglmfk, validation, type = "prob") %>% as.data.frame()
finalPredictions #ROC curve
$glmfk_final <- finalPredictions$high
validation$predict_label <- ifelse(validation$glmfk_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
<- roc.curve(scores.class0 = validation$glmfk_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
Build the readability model from economist perspective.
#Step 1: model dataset preparation
$index <- row.names(model_para_eco)
model_para_eco<- model_para_eco %>% filter(clarity_label !="middle") %>% ungroup() %>%
model_data3 mutate(target=case_when(clarity_label=="high" ~ "high",
=="low" ~ "low")) %>%
clarity_label::select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label)
dplyr
<- model_data3
model_data
<- function_model_data_process(model_data3)
model_data
set.seed(1234)
<- createDataPartition(model_data$target, p=0.75, list=FALSE)
validationIndex <- model_data[-validationIndex,]
validation <- model_data[validationIndex,]
training
print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is: 306 and columns is: 273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is: 230 and columns is: 273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is: 76 and columns is: 273"
#step 2: basic model
# set.seed(1234)
#
# baseModel <- randomForest(target~., training, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
# finalPredictions <- predict(finalModel3, validation, type = "Prob") %>% as.data.frame()
# validation$rf_final <- finalPredictions$high
# #set the threshold for building confusion matrix table
# validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
# confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
#
# #ROC curve
# PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
# PRROC_obj$auc <- round(PRROC_obj$auc,3)
# plot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
#
## using the optimal variables for the final model
<- read.csv("./data_input/optvariables_para_eco_clarity0.4.csv")
opt_variable <- opt_variable$x[1:35]
opt_var
<- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short $target <- training$target
training_short
set.seed(1234)
<- randomForest(target~., training_short, mtry=round(sqrt(ncol(training)),0), ntree=100) #no need to change cut-off here as we can change it in the prediction stage.
baseModel #saveRDS(baseModel, "para_eco_clarity_base_model.rda")
<- readRDS("./data_input/model/para_eco_clarity_final_model.rda") #import the pre-saved model so the results of confusion matrix is the same as shown in the
finalModel3
<- predict(finalModel3, validation, type = "Prob") %>% as.data.frame()
finalPredictions $rf_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 28 11
## low 10 27
##
## Accuracy : 0.7237
## 95% CI : (0.6091, 0.8201)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 6.031e-05
##
## Kappa : 0.4474
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7368
## Specificity : 0.7105
## Pos Pred Value : 0.7179
## Neg Pred Value : 0.7297
## Prevalence : 0.5000
## Detection Rate : 0.3684
## Detection Prevalence : 0.5132
## Balanced Accuracy : 0.7237
##
## 'Positive' Class : high
##
#ROC curve
<- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
##calculate the logloss as an additional performance metric
::logLoss(validation$target,validation$rf_final) ModelMetrics
## [1] 0.8000629
Build the readability model from non-economist perspective.
# step 1 prepare dataset
$index <- row.names(model_para_noneco)
model_para_noneco<- model_para_noneco %>% filter(clarity_label !="middle") %>% ungroup() %>%
model_data2 mutate(target=case_when(clarity_label=="high" ~ "high",
=="low" ~ "low")) %>%
clarity_label::select(-question_index, -index, -content.scale.avg, -clarity.scale.avg, -clarity_label, -content_label)
dplyr
<- function_model_data_process(model_data2)
model_data
set.seed(1234)
<- createDataPartition(model_data$target, p=0.75, list=FALSE)
validationIndex <- model_data[-validationIndex,]
validation <- model_data[validationIndex,]
training
print(paste("For whole dataset, the number of rows is: ", nrow(model_data), "and columns is: ", ncol(model_data)))
## [1] "For whole dataset, the number of rows is: 462 and columns is: 273"
print(paste("For training data, the number of rows is: ", nrow(training), "and columns is: ", ncol(training)))
## [1] "For training data, the number of rows is: 347 and columns is: 273"
print(paste("For validation data, the number of rows is: ", nrow(validation), "and columns is: ", ncol(validation)))
## [1] "For validation data, the number of rows is: 115 and columns is: 273"
# step 2 build the basic model using default settings
# set.seed(226)
#
# training1 <- training %>% dplyr::select(-FRES_score) #remove FRES_score as it is correlated with FK
#
# baseModel <- randomForest(target~., training1, mtry=round(sqrt(ncol(training)),0), ntree=300) #no need to change cut-off here as we can change it in the prediction stage.
# finalPredictions <- predict(baseModel, validation, type = "Prob") %>% as.data.frame()
# validation$rf_final <- finalPredictions$high
# #set the threshold for building confusion matrix table
# validation$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
# confusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
#
# #ROC curve
# PRROC_obj <- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
# PRROC_obj$auc <- round(PRROC_obj$auc,3)
# plot(PRROC_obj,
# xlab="False Positive Rate (1 - Specificity)",
# ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
# saveRDS(baseModel, "para_noneco_clarity_base_model.rda")
#Step 3 Using the optimal variables for the final model
<- read.csv("./data_input/optvariables_para_noneco_clarity0.4.csv")
opt_variable <- opt_variable$x
opt_var <- training[colnames(training) %in% opt_var] %>% as.data.frame()
training_short $target <- training$target
training_short
set.seed(789)
<- randomForest(target~., training_short, mtry=round(sqrt(ncol(training_short)),0), ntree=250) #no need to change cut-off here as we can change it in the prediction stage.
baseModel <- readRDS("./data_input/model/para_noneco_clarity_final_model.rda")
finalModel4
<- predict(finalModel4, validation, type = "Prob") %>% as.data.frame()
finalPredictions $rf_final <- finalPredictions$high
validation#set the threshold for building confusion matrix table
$predict_label <- ifelse(validation$rf_final>=0.5,"high","low")
validationconfusionMatrix(as.factor(validation$predict_label), as.factor(validation$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 48 27
## low 13 27
##
## Accuracy : 0.6522
## 95% CI : (0.5577, 0.7386)
## No Information Rate : 0.5304
## P-Value [Acc > NIR] : 0.005494
##
## Kappa : 0.2912
##
## Mcnemar's Test P-Value : 0.039833
##
## Sensitivity : 0.7869
## Specificity : 0.5000
## Pos Pred Value : 0.6400
## Neg Pred Value : 0.6750
## Prevalence : 0.5304
## Detection Rate : 0.4174
## Detection Prevalence : 0.6522
## Balanced Accuracy : 0.6434
##
## 'Positive' Class : high
##
#ROC curve
<- roc.curve(scores.class0 = validation$rf_final, weights.class0= as.numeric(validation$target),curve=T)
PRROC_obj $auc <- round(PRROC_obj$auc,3)
PRROC_objplot(PRROC_obj, xlab="False Positive Rate (1 - Specificity)", ylab="True Positive Rate (Recall/Sensitivity)",max.plot = TRUE, min.plot = TRUE, legend= T)
##calculate the logloss as an additional performance metric
::logLoss(validation$target,validation$rf_final) ModelMetrics
## [1] 0.6052968
In this section, we extract top features for our models using feature importance function. The output is discussed in the section 6.3 of the paper.A snap shot of the result is shown in the chart below.
## Read models first
<- readRDS("./data_input/model/para_eco_content_final_model.rda")
model_eco_content <- readRDS("./data_input/model/para_eco_clarity_final_model.rda")
model_eco_clarity <- readRDS("./data_input/model/para_noneco_content_final_model.rda")
model_noneco_content <- readRDS("./data_input/model/para_noneco_clarity_final_model.rda")
model_noneco_clarity
## Extract feature list
<- function(data_input){
feature_list_table <- as.data.frame(data_input$importance)
table_feature $feature <- row.names(table_feature)
table_feature<- table_feature %>% arrange(desc(MeanDecreaseGini)) %>% dplyr::select(feature,MeanDecreaseGini)
table_feature
table_feature
}
## Generate feature ranking table for each models
<-
model_eco_content_feature feature_list_table(model_eco_content) %>% mutate(model_name = "model_eco_content")
<-
model_eco_clarity_feature feature_list_table(model_eco_clarity) %>% mutate(model_name = "model_eco_clarity")
<-
model_noneco_content_feature feature_list_table(model_noneco_content) %>% mutate(model_name ="model_noneco_content")
<-
model_noneco_clarity_feature feature_list_table(model_noneco_clarity) %>% mutate(model_name ="model_noneco_clarity")
## combine the feature table for graphs
<-
feature_graph_dt rbind(
1:5,],
model_eco_content_feature[1:5,],
model_eco_clarity_feature[1:5,],
model_noneco_content_feature[1:5,])
model_noneco_clarity_feature[
library(ggplot2)
# Basic barplot
<-
pggplot(data=feature_graph_dt, aes(x=feature, y=MeanDecreaseGini)) +
geom_bar(stat="identity", fill = "blue") +
xlab("Feature List") +
ylab ("Importance") +
coord_flip() +
facet_wrap(~ model_name, ncol=2)
p
The session information for this program is shown as below.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.3.4 questionr_0.7.4 klaR_0.6-15
## [4] MASS_7.3-53 PRROC_1.3.1 randomForest_4.6-14
## [7] caret_6.0-86 ggplot2_3.3.3 lattice_0.20-41
## [10] dplyr_1.0.6 tidyr_1.1.3 e1071_1.7-6
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 viridisLite_0.4.0 splines_4.0.3
## [4] foreach_1.5.1 prodlim_2019.11.13 shiny_1.6.0
## [7] assertthat_0.2.1 highr_0.9 stats4_4.0.3
## [10] yaml_2.2.1 ipred_0.9-11 pillar_1.6.0
## [13] glue_1.4.2 pROC_1.17.0.1 digest_0.6.27
## [16] promises_1.2.0.1 rvest_1.0.0 colorspace_2.0-1
## [19] recipes_0.1.16 htmltools_0.5.1.1 httpuv_1.6.1
## [22] Matrix_1.2-18 plyr_1.8.6 timeDate_3043.102
## [25] pkgconfig_2.0.3 labelled_2.8.0 haven_2.4.1
## [28] purrr_0.3.4 xtable_1.8-4 webshot_0.5.2
## [31] scales_1.1.1 svglite_2.0.0 later_1.2.0
## [34] gower_0.2.2 lava_1.6.9 tibble_3.1.1
## [37] proxy_0.4-25 combinat_0.0-8 farver_2.1.0
## [40] generics_0.1.0 ellipsis_0.3.2 withr_2.4.2
## [43] nnet_7.3-14 survival_3.2-7 magrittr_2.0.1
## [46] crayon_1.4.1 mime_0.10 evaluate_0.14
## [49] fansi_0.4.2 nlme_3.1-149 xml2_1.3.2
## [52] forcats_0.5.1 class_7.3-17 tools_4.0.3
## [55] data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0
## [58] stringr_1.4.0 munsell_0.5.0 compiler_4.0.3
## [61] systemfonts_1.0.1 rlang_0.4.11 grid_4.0.3
## [64] iterators_1.0.13 rstudioapi_0.13 miniUI_0.1.1.1
## [67] labeling_0.4.2 rmarkdown_2.8 gtable_0.3.0
## [70] ModelMetrics_1.2.2.2 codetools_0.2-16 DBI_1.1.1
## [73] reshape2_1.4.4 R6_2.5.0 lubridate_1.7.10
## [76] knitr_1.33 fastmap_1.1.0 utf8_1.2.1
## [79] stringi_1.5.3 Rcpp_1.0.6 vctrs_0.3.8
## [82] rpart_4.1-15 tidyselect_1.1.1 xfun_0.22