Comparative Analysis of Transformer Models in Object Detection and Relationship Determination on COCO Dataset

 Raihan Atsal Hafizh (Telkom University, Bandung, Indonesia)
 (*)Kemas Rahmat Saleh Wiharja Mail (Telkom University, Bandung, Indonesia)
 Muhammad Arya Fikriansyah (Telkom University, Bandung, Indonesia)

(*) Corresponding Author

Submitted: December 15, 2023; Published: January 10, 2024


This research investigates the integration of object detection and relationship prediction models to enhance image interpretability, addressing the core question: What challenges necessitate a Comparative Analysis of Object Detection and Transformer Models in Relationship Determination? A robust object detection model exhibits commendable performance, especially at lower Intersection over Union (IoU) thresholds and for larger objects, laying a solid foundation for subsequent analyses. The transformer models, including GIT, GPT-2, and PromptCap, are evaluated for their language generation capabilities, showcasing noteworthy performance metrics, including novel keyword-based metrics. The study transparently addresses limitations related to dataset constraints and potential challenges in model generalization, offering a clear rationale for the research. The evaluation of both object detection and transformer models provides valuable insights into the dynamic interplay between visual and linguistic understanding in image comprehension. By candidly acknowledging limitations, including data constraints and model generalization, this research paves the way for future refinements, addressing identified limitations and exploring broader application domains. The comprehensive approach to understanding the interplay between visual and textual elements contributes to the evolving landscape of computer vision and natural language processing research.


Object Detection; Relationship Prediction; Transformer Model; Image Understanding; Image Comprehension

Full Text:


Article Metrics

Abstract view : 168 times
PDF - 58 times


B. Dai, Y. Zhang, and D. Lin, Detecting Visual Relationships with Deep Relational Networks. 2017. doi: 10.1109/CVPR.2017.352.

K. Chen and K. Forbus, Visual Relation Detection using Hybrid Analogical Learning, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 801808, May 2021, doi: 10.1609/aaai.v35i1.16162.

Y. Cui and M. Farazi, VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection. 2022. doi: 10.48550/arXiv.2206.09111.

S. Shah and J. Tembhurne, Object detection using convolutional neural networks and transformer-based models: a review, Journal of Electrical Systems and Information Technology, vol. 10, no. 1, p. 54, 2023, doi: 10.1186/s43067-023-00123-z.

E. Arkin, N. Yadikar, X. Xu, A. Aysa, and K. Ubul, A survey: object detection methods from CNN to transformer, Multimed Tools Appl, vol. 82, no. 14, pp. 2135321383, 2023, doi: 10.1007/s11042-022-13801-3.

A. Borji, Complementary datasets to COCO for object detection. 2022. doi: 10.48550/arXiv.2206.11473.

T.-Y. Lin et al., Microsoft COCO: Common Objects in Context, May 2014.

Y. Wang, J. Xu, and Y. Sun, End-to-End Transformer Based Model for Image Captioning, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 25852594, Dec. 2022, doi: 10.1609/aaai.v36i3.20160.

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans Pattern Anal Mach Intell, vol. 39, no. 6, pp. 11371149, 2017, doi: 10.1109/TPAMI.2016.2577031.

X. Amatriain, Transformer models: an introduction and catalog. Dec. 2023. doi: 10.48550/arXiv.2302.07730.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation, in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds., Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311318. doi: 10.3115/1073083.1073135.

P. Anderson, B. Fernando, M. Johnson, and S. Gould, SPICE: Semantic Propositional Image Caption Evaluation, Jul. 2016.

A. Lavie and A. Agarwal, METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments, pp. 228231, Jul. 2007.

C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries, in Text Summarization Branches Out, Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 7481. [Online]. Available:

P. Wiriyathammabhum, D. Summers Stay, C. Fermller, and Y. Aloimonos, Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics, ACM Comput Surv, vol. 49, pp. 144, Dec. 2016, doi: 10.1145/3009906.

X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, Oriented R-CNN for Object Detection, Aug. 2021.

C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, Visual Relationship Detection with Language Priors, Jul. 2016.

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, End-to-End Object Detection with Transformers, May 2020.

T. Shehzadi, K. A. Hashmi, D. Stricker, and M. Z. Afzal, Object Detection with Transformers: A Review, Jun. 2023.

N. R. Ananda, K. R. S. Wiharja, and Moch. A. Bijaksana, Sentiment Analysis on Banking Chatbot using Graph-based Machine Learning Model, in 2023 International Conference on Data Science and Its Applications (ICoDSA), IEEE, Aug. 2023, pp. 310315. doi: 10.1109/ICoDSA58501.2023.10276448.

R. M. Haralick, K. Shanmugam, and I. Dinstein, Textural Features for Image Classification, IEEE Trans Syst Man Cybern, vol. SMC-3, no. 6, pp. 610621, Nov. 1973, doi: 10.1109/TSMC.1973.4309314.

J. Cheng et al., Visual Relationship Detection: A Survey, IEEE Trans Cybern, vol. 52, no. 8, pp. 84538466, Aug. 2022, doi: 10.1109/TCYB.2022.3142013.

J. Peng, Y. Zhang, and W. Huang, Visual Relationship Detection With Image Position and Feature Information Embedding and Fusion, IEEE Access, vol. 10, pp. 117170117176, 2022, doi: 10.1109/ACCESS.2022.3219207.

J. Wang et al., GIT: A Generative Image-to-text Transformer for Vision and Language, ArXiv, vol. abs/2205.14100, 2022, [Online]. Available:

W. M. C. Kwok and Kwok, Image Captioning by ViT/BERT, ViT/GPT. 2023.

Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo, PromptCap: Prompt-Guided Task-Aware Image Captioning, ArXiv, vol. abs/2211.09699, 2022, [Online]. Available:

J. Sun, S. Lapuschkin, W. Samek, and A. Binder, Explain and improve: LRP-inference fine-tuning for image captioning models, Information Fusion, vol. 77, pp. 233246, 2022, doi:

Y. Chen et al., Towards Learning Universal Hyperparameter Optimizers with Transformers. 2022.

Q. Dong, Z. Tu, H. Liao, Y. Zhang, V. Mahadevan, and S. Soatto, Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries. 2021. doi: 10.1109/ICCV48922.2021.00353.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Comparative Analysis of Transformer Models in Object Detection and Relationship Determination on COCO Dataset


  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

STMIK Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.