Collaborative research causes problems for research assessments because of the difficulty in fairly crediting its authors.Whilst splitting the rewards for an article amongst its authors has the greatest surface-level ...Collaborative research causes problems for research assessments because of the difficulty in fairly crediting its authors.Whilst splitting the rewards for an article amongst its authors has the greatest surface-level fairness,many important evaluations assign full credit to each author,irrespective of team size.The underlying rationales for this are labour reduction and the need to incentivise collaborative work because it is necessary to solve many important societal problems.This article assesses whether full counting changes results compared to fractional counting in the case of the UK’s Research Excellence Framework(REF)2021.For this assessment,fractional counting reduces the number of journal articles to as little as 10%of the full counting value,depending on the Unit of Assessment(UoA).Despite this large difference,allocating an overall grade point average(GPA)based on full counting or fractional counting gives results with a median Pearson correlation within UoAs of 0.98.The largest changes are for Archaeology(r=0.84)and Physics(r=0.88).There is a weak tendency for higher scoring institutions to lose from fractional counting,with the loss being statistically significant in 5 of the 34 UoAs.Thus,whilst the apparent over-weighting of contributions to collaboratively authored outputs does not seem too problematic from a fairness perspective overall,it may be worth examining in the few UoAs in which it makes the most difference.展开更多
Purpose:Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.Design/methodology/approach:Test the extent to which ChatGPT-4 can assess ...Purpose:Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.Design/methodology/approach:Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework(REF)2021 to create a research evaluation ChatGPT.This was applied to 51 of my own articles and compared against my own quality judgements.Findings:ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria.Its overall scores have weak correlations with my self-evaluation scores of the same documents(averaging r=0.281 over 15 iterations,with 8 being statistically significantly different from 0).In contrast,the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509.Thus,averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores.The positive correlation may be due to ChatGPT being able to extract the author’s significance,rigour,and originality claims from inside each paper.If my weakest articles are removed,then the correlation with average scores(r=0.200)falls below statistical significance,suggesting that ChatGPT struggles to make fine-grained evaluations.Research limitations:The data is self-evaluations of a convenience sample of articles from one academic in one field.Practical implications:Overall,ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks.Research evaluators,including journal editors,should therefore take steps to control its use.Originality/value:This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.展开更多
基金This study was funded by Research England,Scottish Funding Council,Higher Education Funding Council for Wales,and Department for the Economy,Northern Ireland as part of the Future Research Assessment Programme(https://www.jisc.ac.uk/future-research-assessment-programme).
文摘Collaborative research causes problems for research assessments because of the difficulty in fairly crediting its authors.Whilst splitting the rewards for an article amongst its authors has the greatest surface-level fairness,many important evaluations assign full credit to each author,irrespective of team size.The underlying rationales for this are labour reduction and the need to incentivise collaborative work because it is necessary to solve many important societal problems.This article assesses whether full counting changes results compared to fractional counting in the case of the UK’s Research Excellence Framework(REF)2021.For this assessment,fractional counting reduces the number of journal articles to as little as 10%of the full counting value,depending on the Unit of Assessment(UoA).Despite this large difference,allocating an overall grade point average(GPA)based on full counting or fractional counting gives results with a median Pearson correlation within UoAs of 0.98.The largest changes are for Archaeology(r=0.84)and Physics(r=0.88).There is a weak tendency for higher scoring institutions to lose from fractional counting,with the loss being statistically significant in 5 of the 34 UoAs.Thus,whilst the apparent over-weighting of contributions to collaboratively authored outputs does not seem too problematic from a fairness perspective overall,it may be worth examining in the few UoAs in which it makes the most difference.
文摘Purpose:Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.Design/methodology/approach:Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework(REF)2021 to create a research evaluation ChatGPT.This was applied to 51 of my own articles and compared against my own quality judgements.Findings:ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria.Its overall scores have weak correlations with my self-evaluation scores of the same documents(averaging r=0.281 over 15 iterations,with 8 being statistically significantly different from 0).In contrast,the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509.Thus,averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores.The positive correlation may be due to ChatGPT being able to extract the author’s significance,rigour,and originality claims from inside each paper.If my weakest articles are removed,then the correlation with average scores(r=0.200)falls below statistical significance,suggesting that ChatGPT struggles to make fine-grained evaluations.Research limitations:The data is self-evaluations of a convenience sample of articles from one academic in one field.Practical implications:Overall,ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks.Research evaluators,including journal editors,should therefore take steps to control its use.Originality/value:This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.