Genre Aware Language Modeling for Indonesian Academic Writing: Building and Evaluating IndoSciBERT
DOI:
https://doi.org/10.61978/lingua.v3i3.992Keywords:
Rhetorical Classification, Genre Aware NLP, Indonesian Academic Writing, IndoSciBERT, Rhetorical Annotation, Academic CorpusAbstract
This study introduces a genre-annotated academic corpus for Indonesian and evaluates IndoSciBERT, a domain-specific NLP model trained on this resource. To address the scarcity of rhetorical datasets in low-resource languages, we compiled a 52,300-document corpus from DOAJ and SINTA-indexed journals (2015–2025) and annotated 5,200 paragraphs using the CARS and Argumentative Zoning frameworks. IndoSciBERT was then fine-tuned for rhetorical classification. We employed GROBID for PDF to TEI conversion, TEITOK for annotation, and SIPEBI/KBBI for spelling normalization. The IndoSciBERT model was benchmarked against IndoBERT on rhetorical classification tasks. IndoSciBERT achieved an F1 score of 0.82 and an accuracy of 84.2%, outperforming the baseline model and showing strong reliability in distinguishing rhetorical moves. These results affirm the value of domain-specific modeling for educational applications. The annotated corpus not only supports genre analysis, pedagogy, and automated writing feedback, but also establishes a foundation for inclusive NLP. In particular, this work makes a distinct contribution by offering a sustainable path to enhance academic literacy in Bahasa Indonesia through intelligent, genre-aware tools.
References
Alliheedi, M., Mercer, R. E., & Cohen, R. (2019). Annotation of Rhetorical Moves in Biochemistry Articles. https://doi.org/10.18653/v1/w19-4514 DOI: https://doi.org/10.18653/v1/W19-4514
Amnuai, W. (2019). Analyses of Rhetorical Moves and Linguistic Realizations in Accounting Research Article Abstracts Published in International and Thai-Based Journals. Sage Open, 9(1). https://doi.org/10.1177/2158244018822384 DOI: https://doi.org/10.1177/2158244018822384
Argyroulis, V. (2022). Investigating Student Motivation in the Use of Corpus Concordancing in ESP Learning at University Level. Esp Today, 10(1), 71–98. https://doi.org/10.18485/esptoday.2022.10.1.4 DOI: https://doi.org/10.18485/esptoday.2022.10.1.4
Auni, L., & Manan, A. (2023). A Contrastive Analysis of Morphological and Syntactic Aspects of English and Indonesian Adjectives. Studies in English Language and Education, 10(1), 403–423. https://doi.org/10.24815/siele.v10i1.27401 DOI: https://doi.org/10.24815/siele.v10i1.27401
Batubara, S. F., & Fithriani, R. (2023). Exploring Efl Stuednts’ Challenges in Academic Writing: The Case of Indonesian Higher Education. Jurnal Onoma Pendidikan Bahasa Dan Sastra, 9(1), 704–709. https://doi.org/10.30605/onoma.v9i1.2605 DOI: https://doi.org/10.30605/onoma.v9i1.2605
Cardoso, H. L., Sousa‐Silva, R., Carvalho, P., & Martins, B. (2023). Argumentation Models and Their Use in Corpus Annotation: Practice, Prospects, and Challenges. Natural Language Engineering, 29(4), 1150–1187. https://doi.org/10.1017/s1351324923000062 DOI: https://doi.org/10.1017/S1351324923000062
Carter, M. (2021). The Construction of Value in Science Research Articles: A Quantitative Study of Topoi Used in Introductions. Written Communication, 38(2), 311–346. https://doi.org/10.1177/0741088320983364 DOI: https://doi.org/10.1177/0741088320983364
Chang, T.-S. (2016). The Use of Genre-Based Cycle With L1 Rhetorical Structure in L2 Writing Class: An Exploratory Study. Studies in English Language Teaching, 4(4), 518. https://doi.org/10.22158/selt.v4n4p518 DOI: https://doi.org/10.22158/selt.v4n4p518
Chaufan, C. (2025). How Did Ontario Healthcare Institutions Implement and Legitimize Covid-19 Vaccine Mandates? A Mixed-Methods Study Protocol. https://doi.org/10.1101/2025.05.11.25327408 DOI: https://doi.org/10.1101/2025.05.11.25327408
Cotos, E., & Chung, Y. (2018). Domain Description: Validating the Interpretation of the TOEFL iBT® Speaking Scores for International Teaching Assistant Screening and Certification Purposes. Ets Research Report Series, 2018(1), 1–24. https://doi.org/10.1002/ets2.12233 DOI: https://doi.org/10.1002/ets2.12233
Dardjito, H. (2019). Students’ Metacognitive Reading Awareness and Academic English Reading Comprehension in EFL Context. International Journal of Instruction, 12(4), 611–624. https://doi.org/10.29333/iji.2019.12439a DOI: https://doi.org/10.29333/iji.2019.12439a
Gere, A. R., Limlamai, N., Wilson, E., Saylor, K., & Pugh, R. (2018). Writing and Conceptual Learning in Science: An Analysis of Assignments. Written Communication, 36(1), 99–135. https://doi.org/10.1177/0741088318804820 DOI: https://doi.org/10.1177/0741088318804820
Gigliotti, R. A., Ruben, B. D., Goldthwaite, C., & Strom, B. L. (2020). The Collaborative Design of a Faculty Administrator Leadership Development Program in Academic Health: Concepts and Applications. International Journal of Leadership in Education, 27(1), 85–98. https://doi.org/10.1080/13603124.2020.1823487 DOI: https://doi.org/10.1080/13603124.2020.1823487
Hendricks, L., Herrington, C., & Schoellman, T. (2021). College Quality and Attendance Patterns: A Long-Run View. American Economic Journal Macroeconomics, 13(1), 184–215. https://doi.org/10.1257/mac.20190154 DOI: https://doi.org/10.1257/mac.20190154
Inácio, M. L., Cabezudo, M. A. S., Ramisch, R., Felippo, A. D., & Pardo, T. A. S. (2023). The AMR-PT Corpus and the Semantic Annotation of Challenging Sentences From Journalistic and Opinion Texts. Delta Documentação De Estudos Em Lingüística Teórica E Aplicada, 39(3). https://doi.org/10.1590/1678-460x202339355159 DOI: https://doi.org/10.1590/1678-460x202339355159
Indarti, D. (2018). Patterns of Rhetorical Organization in the Jakarta Post Opinion Articles. Studies in English Language and Education, 5(1), 69–84. https://doi.org/10.24815/siele.v5i1.8535 DOI: https://doi.org/10.24815/siele.v5i1.8535
Joshi, B., Symeonidou, A., Danish, S. M., & Hermsen, F. (2023). An End-to-End Pipeline for Bibliography Extraction From Scientific Articles. https://doi.org/10.18653/v1/2023.wiesp-1.12 DOI: https://doi.org/10.18653/v1/2023.wiesp-1.12
Jwa, S. (2020). Korean EFL Students’ Argumentative Writing in L1 and L2: A Comparative Move Analysis Study. English Teaching Practice & Critique, 19(2), 217–230. https://doi.org/10.1108/etpc-01-2019-0010 DOI: https://doi.org/10.1108/ETPC-01-2019-0010
Karpenko-Seccombe, T. (2018). Practical Concordancing for Upper-Intermediate and Advanced Academic Writing: Ready-to-Use Teaching and Learning Materials. Journal of English for Academic Purposes, 36, 135–141. https://doi.org/10.1016/j.jeap.2018.10.001 DOI: https://doi.org/10.1016/j.jeap.2018.10.001
Kaya, Ö. F. (2022). Using Corpora for Language Teaching and Assessment in L2 Writing: A Narrative Review. Focus on Elt Journal, 46–62. https://doi.org/10.14744/felt.2022.4.3.4 DOI: https://doi.org/10.14744/felt.2022.4.3.4
Kenny, J., Karliner, L., Kerlikowske, K., Kaplan, C. P., Fernandez-Lamothe, A., & Burke, N. J. (2020). Organization Communication Factors and Abnormal Mammogram Follow-Up: A Qualitative Study Among Ethnically Diverse Women Across Three Healthcare Systems. Journal of General Internal Medicine, 35(10), 3000–3006. https://doi.org/10.1007/s11606-020-05972-2 DOI: https://doi.org/10.1007/s11606-020-05972-2
Lammers, A., Goedhart, M., & Avraamidou, L. (2019). Reading and Synthesising Science Texts Using a Scientific Argumentation Model by Undergraduate Biology Students. International Journal of Science Education, 41(16), 2323–2346. https://doi.org/10.1080/09500693.2019.1675197 DOI: https://doi.org/10.1080/09500693.2019.1675197
Lauscher, A., Glavaš, G., Ponzetto, S. P., & Eckert, K. (2018). Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications With Neural Multi-Task Learning Models. https://doi.org/10.18653/v1/d18-1370 DOI: https://doi.org/10.18653/v1/D18-1370
Lou, W., He, J., Xu, Q., Zhu, Z., Lu, Q., & Zhu, Y. (2023). Rhetorical Structure Parallels research Topic In LIS articles: A temporal Bibliometrics Examination. Aslib Journal of Information Management, 76(3), 416–434. https://doi.org/10.1108/ajim-08-2022-0370 DOI: https://doi.org/10.1108/AJIM-08-2022-0370
Masela, M., & Subekti, A. S. (2021). Auditory and Kinaesthetic Learning Styles and L2 Achievement: A Correlational Study. Englisia Journal of Language Education and Humanities, 8(2), 41. https://doi.org/10.22373/ej.v8i2.7529 DOI: https://doi.org/10.22373/ej.v8i2.7529
Masyitha, B. M., Widiati, U., & Laksmi, E. D. (2021). Linguistic Interdependence Hypothesis: Does the Positive Transfer in Writing Skill Occur in Lower Level Students? Jurnal Pendidikan Teori Penelitian Dan Pengembangan, 5(11), 1606. https://doi.org/10.17977/jptpp.v5i11.14173 DOI: https://doi.org/10.17977/jptpp.v5i11.14173
Mohamad, H. A., Wahab, N. H. A., Nath, P. R., Zolkapli, R. B. M., Mohaini, M. L., Rashid, M. H. A., Soopar, A. A., Abdullah, A., & Norazizan, H. (2023). Analytical Appeals of Move Structures in Systematically Organising and Communicating Research Ideas. International Journal of Academic Research in Economics and Management Sciences, 12(3). https://doi.org/10.6007/ijarems/v12-i3/19301 DOI: https://doi.org/10.6007/IJAREMS/v12-i3/19301
Nasir, M., & Mchechesi, I. A. (2022). Geographical Distance Is the New Hyperparameter: A Case Study of Finding the Optimal Pre-Trained Language for English-isiZulu Machine Translation. https://doi.org/10.48550/arxiv.2205.08621 DOI: https://doi.org/10.18653/v1/2022.mia-1.1
Nhamo, G., & Chapungu, L. (2024). Seven Years of Embracing the Sustainable Development Goals: Perspectives From University of South Africa’s Academic Staff. Frontiers in Education, 9. https://doi.org/10.3389/feduc.2024.1354916 DOI: https://doi.org/10.3389/feduc.2024.1354916
Ni Putu Ines Marylena Candra Manik, & Suputra, K. D. (2023). Students’ Perception of Teacher’s Bilingual Language Use in an English Classroom. Ijils, 1(1), 41–48. https://doi.org/10.25078/ijils.v1i1.2456 DOI: https://doi.org/10.25078/ijils.v1i1.2456
Purwati, D., & Silvia, A. (2021). Indonesian Learners in Australian Education Environment: Perceptions, Challenges, and Resilience. Journal of Educational Management and Instruction (Jemin), 1(1), 1–8. https://doi.org/10.22515/jemin.v1i1.3467 DOI: https://doi.org/10.22515/jemin.v1i1.3467
Shaw, P., & Pecorari, D. (2024). Types of Intertextuality in Chairman’s Statements. Nordic Journal of English Studies, 12(S1), 37–64. https://doi.org/10.35360/njes.275 DOI: https://doi.org/10.35360/njes.275
Shum, S. B., Sándor, A. D., Goldsmith, R., Bass, R., & McWilliams, M. (2017). Towards Reflective Writing Analytics: Rationale, Methodology and Preliminary Results. Journal of Learning Analytics, 4(1). https://doi.org/10.18608/jla.2017.41.5 DOI: https://doi.org/10.18608/jla.2017.41.5
Singh, U., Watson, R., & Nair, C. S. (2022). Across Continents: A Comparison of African and Australian Academics’ Online Preparedness. Perspectives in Education, 40(1), 39–61. https://doi.org/10.18820/2519593x/pie.v40.i1.3 DOI: https://doi.org/10.18820/2519593X/pie.v40.i1.3
Sueb, S., Aminin, Z., Zuhri, F., Rosyid, A., Hartanti, L. P., & Harti, L. M. S. (2022). Rhetorical Moves Used in Thesis Proposal Writing: A Reflective Study of ELT Students. https://doi.org/10.2991/assehr.k.211229.055 DOI: https://doi.org/10.2991/assehr.k.211229.055
Troyan, F. J., Sembiante, S. F., & King, N. (2019). A Case for a Functional Linguistic Knowledge Base in World Language Teacher Education. Foreign Language Annals, 52(3), 644–669. https://doi.org/10.1111/flan.12410 DOI: https://doi.org/10.1111/flan.12410
Viera, R. T. (2019). Analysis of Abstracts in Scientific Papers Written in English Using Corpora. Script Journal Journal of Linguistic and English Teaching, 4(2), 112–124. https://doi.org/10.24903/sj.v4i2.323 DOI: https://doi.org/10.24903/sj.v4i2.323
Vyas, H., & Panara, K. (2016). Tantraguna – The Ancient Criteria for Scientific Writing. Ayu (An International Quarterly Journal of Research in Ayurveda), 37(3), 158. https://doi.org/10.4103/ayu.ayu_25_16 DOI: https://doi.org/10.4103/ayu.AYU_25_16
Yang, Y. F., & Ren, H. (2025). The Efficacy of the Corpus-Based Error Correction Method on Revision in Writing Classrooms. Plos One, 20(3), e0317574. https://doi.org/10.1371/journal.pone.0317574 DOI: https://doi.org/10.1371/journal.pone.0317574
Yundayani, A., Emzir, E., & Rafli, Z. (2017). Need Analysis: The Writing Skill Instructional Material Context for Academic Purposes. English Review Journal of English Education, 6(1), 59. https://doi.org/10.25134/erjee.v6i1.771 DOI: https://doi.org/10.25134/erjee.v6i1.771



