Hovhannes Tamoyan


PhD Student at the
Technical University of Darmstadt


Prof. Iryna Gurevych

working on

Natural Language Processing
LLM Self-Awareness, Synthetic Data Generation, and Code Generation

Research Tools

🦁 UrarTU

Machine learning framework designed with a user-friendly YAML-based configuration system. It features streamlined Slurm job submission and a flexible architecture, allowing users to effortlessly orchestrate complex machine learning workflows.

🗂️ OrganizeNoc

All-in-one extension suite tailored for researchers, designed to enhance the management and exploration of academic materials. Key features encompass metadata extraction, PDF highlight processing, and AI-powered paper queries.

🐈‍⬛ tmynNLP

Natural language processing pipeline, equipped with intuitive abstractions. Supports a diverse array of NLP tasks, enabling seamless and efficient processing across various tasks.

Featured Publications

LLM Roleplay: Simulating Human-Chatbot Interaction

Hovhannes Tamoyan et al.

We propose LLM-Roleplay, a fast and cheap method for generating human-chatbot dialogues, using large language models (LLMs). By embodying textually described personas, LLM-Roleplay generates diverse, multi-turn dialogues that replicate the nuances of real conversational exchanges between human and chatbot. We evaluate this approach by comparing natural human-chatbot dialogues across different sociodemographic groups with those produced by our method. The findings demonstrate that LLM-Roleplay effectively mimics human-chatbot interactions, achieving a high indistinguishability rate between generated and real dialogues.
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Armen Aghajanyan*, Lili Yu*, Bowen Shi*, Ramakanth Pasunuru*, Hovhannes Tamoyan, Luke Zettlemoyer et al.

Introducing CM3Leon - a powerful, retrieval-augmented, token-based, decoder-only multi-modal language model. It excels in generating and infilling both text and images. CM3Leon leverages the CM3 multi-modal architecture and showcases the significant advantages of scaling up and fine-tuning using diverse instruction-style data. It stands out as the first multi-modal model trained with a recipe adapted from text-only language models. Incorporating a large-scale retrieval-augmented pretraining stage and a multi-task supervised fine-tuning (SFT) stage, CM3Leon achieves state-of-the-art performance in text-to-image generation with 5 times less training compute than comparable methods, boasting a zero-shot MS-COCO FID of 4.88.
BARTSmiles: Generative Masked Language Models for Molecular Representations

Gayane Chilingaryan*, Hovhannes Tamoyan*, Ani Tevosyan*, et al.

We present a BART-like model: BARTSmiles, pre-trained on molecular representations. We quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new SOTA on 11 tasks. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules.
YerevaNN’s Systems for WMT20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs

Karen Hambardzumyan, Hovhannes Tamoyan and Hrant Khachatrian

We provide systems for en-ru and en-de language pairs for the WMT20 Biomedical Machine Translation shared task. For the en-ru pair, our submissions achieve the best BLEU scores, with en→ru direction outperforming the other systems by a significant margin. We explain most of the improvements by our heavy data preprocessing pipeline which attempts to fix poorly aligned sentences in the parallel data.