Deteksi Kemiripan Dokumen Menggunakan Cosine Similarity Berdasarkan Representasi Teks Count Vectorizer Dan TF IDF
DOI:
https://doi.org/10.21927/ijubi.v7i2.5170Abstract
Tujuan mata kuliah skripsi atau tugas akhir menumbuhkan budaya berpikir kritis, dan menunjukan kemampuan untuk memecahkan permasalahan dengan konstruksi logis dari penelitian. Akan tetapi, dari banyaknya manfaat tersebut, ada beberapa permasalahan yang juga muncul dikarenakan mata kuliah ini. Plagiarisme adalah masalah umum. Mengambil karya orang lain, termasuk pendapat mereka sendiri, dan membuatnya seperti karya sendiri adalah plagiarisme. Langkah pertama dalam penggunaan teknologi adalah mendeteksi kesamaan dokumen sejak dini. Dalam hal ini, dokumen yang harus dikumpulkan oleh mahasiswa selama proses pengajuan judul skripsi mereka adalah abstrak. Ketika digunakan, algoritma cosine similarity adalah algoritma yang efisien secara komputasi karena sangat mudah dipahami dan dapat digunakan dengan data berskala besar. Penelitian ini dilakukan dengan dua pendekatan representasi teks yaitu dengan menggunakan TF-IDF dan Count Vectorizer. Data korpus yang digunakan dalam penelitian ini adalah 1600 data dokumen abstrak skripsi mahasiswa, dengan pengujian menggunakan 30 data untuk melihat kinerja algoritma cosine similarity dalam mendeteksi kesamaan dokumen abstrak. Hasil penelitian menunjukkan bahwa pendekatan representasi teks TF-IDF mendapatkan kesamaan di angka 7,72861 dan Count Vectorizer mendapatkan hasil di angka 16,85541 atau punya gap sebesar 9,1268 dengan keunggulan Count Vectorizer. Hal ini disebabkan Count Vectorizer menghitung frekuensi kata tanpa mempertimbangkan apakah kata tersebut umum atau jarang, sehingga kata-kata umum tetap berkontribusi penuh terhadap similarity.References
Pemerintah Indonesia, “Undang-Undang Nomor 4 Tahun 2014 Tentang Penyelenggaraan Pendidikan Tinggi dan Pengelolaan Perguruan Tinggi,†Standar Nasional Pendidikan, p. 37, 2014, [Online]. Available: https://peraturan.bpk.go.id/Home/Details/5441/pp-no-4-tahun-2014
Kementerian Pendidikan dan Kebudayaan, Permendikbud Nomor 3 Tahun 2020. www.kemdikbud.go.id, 2020.
A. Kleebayoon and V. Wiwanitkit, “Artificial Intelligence, Chatbots, Plagiarism and Basic Honesty: Comment,†Cell Mol Bioeng, vol. 16, no. 2, pp. 173–174, Apr. 2023, doi: 10.1007/s12195-023-00759-x.
V. Chandere, S. Satish, and R. Lakshminarayanan, “Online plagiarism detection tools in the digital age: A review,†Ann Rom Soc Cell Biol, vol. 25, no. 1, pp. 7110–7119, 2021, [Online]. Available: https://annalsofrscb.ro/index.php/journal/article/view/881
K. W. G. A. P. P. H. S. D. P. W. D. H. R. S. K. N. M. A. P. P. Musthofa Galih Pradana, Information Retrieval. Penamuda, 2024.
A. Kulkarni and A. Shivananda, Natural Language Processing Recipes. 2021. doi: 10.1007/978-1-4842-7351-7.
Raymond S. T. Lee, Natural Language Processing: A Textbook with Python Implementation. Springer, 2023.
Thushan Ganegedara, Natural Language Processing with TensorFlow - Second Edition. Packt Publishing, 2022.
J. Wang and Y. Dong, “Measurement of text similarity: A survey,†Information (Switzerland), vol. 11, no. 9, pp. 1–17, 2020, doi: 10.3390/info11090421.
M. M. Danyal, S. S. Khan, M. Khan, S. Ullah, M. B. Ghaffar, and W. Khan, “Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer,†Soc Netw Anal Min, vol. 14, no. 1, p. 87, Apr. 2024, doi: 10.1007/s13278-024-01250-9.
A. Wendland, M. Zenere, and J. Niemann, “Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique,†2021, pp. 289–300. doi: 10.1007/978-3-030-85521-5_19.
G. M. Raza, Z. S. Butt, S. Latif, and A. Wahid, “Sentiment Analysis on COVID Tweets: An Experimental Analysis on the Impact of Count Vectorizer and TF-IDF on Sentiment Predictions using Deep Learning Models,†in 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2), IEEE, May 2021, pp. 1–6. doi: 10.1109/ICoDT252288.2021.9441508.
K. M. Suryaningrum, “Comparison of the TF-IDF Method with the Count Vectorizer to Classify Hate Speech,†Engineering, MAthematics and Computer Science (EMACS) Journal, vol. 5, no. 2, pp. 79–83, May 2023, doi: 10.21512/emacsjournal.v5i2.9978.
T. Ahmed, S. F. Mukta, T. Al Mahmud, S. Al Hasan, and M. Gulzar Hussain, “Bangla Text Emotion Classification using LR, MNB and MLP with TF-IDF & CountVectorizer,†in 2022 26th International Computer Science and Engineering Conference (ICSEC), IEEE, Dec. 2022, pp. 275–280. doi: 10.1109/ICSEC56337.2022.10049341.
H. D. Abubakar and M. Umar, “Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec,†SLU Journal of Science and Technology, vol. 4, no. 1 & 2, pp. 27–33, Aug. 2022, doi: 10.56471/slujst.v4i.266.
A. Gupta and U. Sharma, “Machine Learning Based Aspect Category Detection for Hindi Data Using TF-IDF and Count Vectorization,†in 2024 2nd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), IEEE, Mar. 2024, pp. 39–44. doi: 10.1109/DICCT61038.2024.10532960.
M. Singhal, N. Singhal, S. Khera, A. Upmanyu, and P. Nagrath, “Improvisation of Reddit flair detection using TF-IDF and countvectorizer,†2023, p. 020003. doi: 10.1063/5.0181369.
Sajid Khan, Mehmoon Anwar, Huma Qayyum, Farooq Ali, and Marriam Nawaz, “Fake News Classification using Machine Learning: Count Vectorizer and Support Vector Machine,†Journal of Computing & Biomedical Informatics, vol. 4, no. 01, Jan. 2023, doi: 10.56979/401/2022/85.
Downloads
Additional Files
Published
Issue
Section
License
COPYRIGHT TRANSFER FORM
The copyright to this article is transferred to Alma Ata University Press if and when the article is accepted for publication. The undersigned hereby transfers any and all rights in and to the paper including without limitation all copyrights to AAU Press. The undersigned hereby represents and warrants that the paper is original and that he/she is the author of the paper, except for material that is clearly identified as to its original source, with permission notices from the copyright owners where required. The undersigned represents that he/she has the power and authority to make and execute this assignment.
We declare that:
1. This paper has not been published in the same form elsewhere.
2. It will not be submitted anywhere else for publication prior to acceptance/rejection by this Journal.
3. A copyright permission is obtained for materials published elsewhere and which require this permission for reproduction.
Furthermore, I/We hereby transfer the unlimited rights of publication of the above mentioned paper in whole to AAU Press. The copyright transfer covers the exclusive right to reproduce and distribute the article, including reprints, translations, photographic reproductions, microform, electronic form (offline, online) or any other reproductions of similar nature.
The corresponding author signs for and accepts responsibility for releasing this material on behalf of any and all co-authors. This agreement is to be signed by at least one of the authors who have obtained the assent of the co-author(s) where applicable. After submission of this agreement signed by the corresponding author, changes of authorship or in the order of the authors listed will not be accepted.
Retained Rights/Terms and Conditions
Â
1. Authors retain all proprietary rights in any process, procedure, or article of manufacture described in the Work.
2. Authors may reproduce or authorize others to reproduce the Work or derivative works for the authors personal use or for company use, provided that the source and the AAU Press copyright notice are indicated, the copies are not used in any way that implies AAU Press endorsement of a product or service of any employer, and the copies themselves are not offered for sale.
3. Although authors are permitted to re-use all or portions of the Work in other works, this does not include granting third-party requests for reprinting, republishing, or other types of re-use.














