Comparative Analysis of Transformer Models in Object Detection and Relationship Determination on COCO Dataset
DOI:
https://doi.org/10.30865/mib.v8i1.7158Keywords:
Object Detection, Relationship Prediction, Transformer Model, Image Understanding, Image ComprehensionAbstract
This research investigates the integration of object detection and relationship prediction models to enhance image interpretability, addressing the core question: What challenges necessitate a Comparative Analysis of Object Detection and Transformer Models in Relationship Determination? A robust object detection model exhibits commendable performance, especially at lower Intersection over Union (IoU) thresholds and for larger objects, laying a solid foundation for subsequent analyses. The transformer models, including GIT, GPT-2, and PromptCap, are evaluated for their language generation capabilities, showcasing noteworthy performance metrics, including novel keyword-based metrics. The study transparently addresses limitations related to dataset constraints and potential challenges in model generalization, offering a clear rationale for the research. The evaluation of both object detection and transformer models provides valuable insights into the dynamic interplay between visual and linguistic understanding in image comprehension. By candidly acknowledging limitations, including data constraints and model generalization, this research paves the way for future refinements, addressing identified limitations and exploring broader application domains. The comprehensive approach to understanding the interplay between visual and textual elements contributes to the evolving landscape of computer vision and natural language processing research.References
B. Dai, Y. Zhang, and D. Lin, Detecting Visual Relationships with Deep Relational Networks. 2017. doi: 10.1109/CVPR.2017.352.
K. Chen and K. Forbus, “Visual Relation Detection using Hybrid Analogical Learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 801–808, May 2021, doi: 10.1609/aaai.v35i1.16162.
Y. Cui and M. Farazi, VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection. 2022. doi: 10.48550/arXiv.2206.09111.
S. Shah and J. Tembhurne, “Object detection using convolutional neural networks and transformer-based models: a review,” Journal of Electrical Systems and Information Technology, vol. 10, no. 1, p. 54, 2023, doi: 10.1186/s43067-023-00123-z.
E. Arkin, N. Yadikar, X. Xu, A. Aysa, and K. Ubul, “A survey: object detection methods from CNN to transformer,” Multimed Tools Appl, vol. 82, no. 14, pp. 21353–21383, 2023, doi: 10.1007/s11042-022-13801-3.
A. Borji, Complementary datasets to COCO for object detection. 2022. doi: 10.48550/arXiv.2206.11473.
T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” May 2014.
Y. Wang, J. Xu, and Y. Sun, “End-to-End Transformer Based Model for Image Captioning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2585–2594, Dec. 2022, doi: 10.1609/aaai.v36i3.20160.
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans Pattern Anal Mach Intell, vol. 39, no. 6, pp. 1137–1149, 2017, doi: 10.1109/TPAMI.2016.2577031.
X. Amatriain, “Transformer models: an introduction and catalog.” Dec. 2023. doi: 10.48550/arXiv.2302.07730.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds., Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. doi: 10.3115/1073083.1073135.
P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” Jul. 2016.
A. Lavie and A. Agarwal, “METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments,” pp. 228–231, Jul. 2007.
C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out, Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
P. Wiriyathammabhum, D. Summers Stay, C. Fermüller, and Y. Aloimonos, “Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics,” ACM Comput Surv, vol. 49, pp. 1–44, Dec. 2016, doi: 10.1145/3009906.
X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented R-CNN for Object Detection,” Aug. 2021.
C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual Relationship Detection with Language Priors,” Jul. 2016.
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” May 2020.
T. Shehzadi, K. A. Hashmi, D. Stricker, and M. Z. Afzal, “Object Detection with Transformers: A Review,” Jun. 2023.
N. R. Ananda, K. R. S. Wiharja, and Moch. A. Bijaksana, “Sentiment Analysis on Banking Chatbot using Graph-based Machine Learning Model,” in 2023 International Conference on Data Science and Its Applications (ICoDSA), IEEE, Aug. 2023, pp. 310–315. doi: 10.1109/ICoDSA58501.2023.10276448.
R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features for Image Classification,” IEEE Trans Syst Man Cybern, vol. SMC-3, no. 6, pp. 610–621, Nov. 1973, doi: 10.1109/TSMC.1973.4309314.
J. Cheng et al., “Visual Relationship Detection: A Survey,” IEEE Trans Cybern, vol. 52, no. 8, pp. 8453–8466, Aug. 2022, doi: 10.1109/TCYB.2022.3142013.
J. Peng, Y. Zhang, and W. Huang, “Visual Relationship Detection With Image Position and Feature Information Embedding and Fusion,” IEEE Access, vol. 10, pp. 117170–117176, 2022, doi: 10.1109/ACCESS.2022.3219207.
J. Wang et al., “GIT: A Generative Image-to-text Transformer for Vision and Language,” ArXiv, vol. abs/2205.14100, 2022, [Online]. Available: https://api.semanticscholar.org/CorpusID:249152323
W. M. C. Kwok and Kwok, Image Captioning by ViT/BERT, ViT/GPT. 2023.
Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo, “PromptCap: Prompt-Guided Task-Aware Image Captioning,” ArXiv, vol. abs/2211.09699, 2022, [Online]. Available: https://api.semanticscholar.org/CorpusID:253581319
J. Sun, S. Lapuschkin, W. Samek, and A. Binder, “Explain and improve: LRP-inference fine-tuning for image captioning models,” Information Fusion, vol. 77, pp. 233–246, 2022, doi: https://doi.org/10.1016/j.inffus.2021.07.008.
Y. Chen et al., Towards Learning Universal Hyperparameter Optimizers with Transformers. 2022.
Q. Dong, Z. Tu, H. Liao, Y. Zhang, V. Mahadevan, and S. Soatto, Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries. 2021. doi: 10.1109/ICCV48922.2021.00353.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).