Hovhannes Tamoyan

tamohannes

PhD Student at the
Technical University of Darmstadt

supervisor

Prof. Iryna Gurevych

working on
Natural Language Processing

Code Generation, LLM Benchmarking,
Synthetic Data, Agent Systems

Featured Publications

LLM Roleplay: Simulating Human-Chatbot Interaction

Hovhannes Tamoyan et al.

We propose LLM-Roleplay, a fast and cheap method for generating human-chatbot interaction logs (dialogues), using large language models (LLMs). By embodying textually described personas, LLM-Roleplay generates diverse, multi-turn dialogues that replicate the nuances of real conversational exchanges between human and chatbot. We evaluate this approach by comparing natural human-chatbot dialogues across different sociodemographic groups with those produced by our method. The findings demonstrate that LLM-Roleplay effectively mimics human-chatbot interactions, achieving a high indistinguishability rate (44% for Mixtral 8x7B out of 50%) between generated and real dialogues.
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Armen Aghajanyan*, Lili Yu*, Bowen Shi*, Ramakanth Pasunuru*, Hovhannes Tamoyan, Luke Zettlemoyer et al.

Introducing CM3Leon - a powerful, retrieval-augmented, token-based, decoder-only multi-modal language model. It excels in generating and infilling both text and images. CM3Leon leverages the CM3 multi-modal architecture and showcases the significant advantages of scaling up and fine-tuning using diverse instruction-style data. It stands out as the first multi-modal model trained with a recipe adapted from text-only language models. Incorporating a large-scale retrieval-augmented pretraining stage and a multi-task supervised fine-tuning (SFT) stage, CM3Leon achieves state-of-the-art performance in text-to-image generation with 5 times less training compute than comparable methods, boasting a zero-shot MS-COCO FID of 4.88.
BARTSmiles: Generative Masked Language Models for Molecular Representations

Gayane Chilingaryan*, Hovhannes Tamoyan*, Ani Tevosyan*, et al.

We present a BART-like model: BARTSmiles, pre-trained on molecular representations. We quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new SOTA on 11 tasks. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules.
YerevaNN’s Systems for WMT20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs

Karen Hambardzumyan, Hovhannes Tamoyan and Hrant Khachatrian

We provide systems for en-ru and en-de language pairs for the WMT20 Biomedical Machine Translation shared task. For the en-ru pair, our submissions achieve the best BLEU scores, with en→ru direction outperforming the other systems by a significant margin. We explain most of the improvements by our heavy data preprocessing pipeline which attempts to fix poorly aligned sentences in the parallel data.

Research Tools and Frameworks

🦁 UrarTU

Machine learning framework designed with YAML-based configurations, featuring effortless Slurm job submission, experiment tracking, and built-in mainstream functionalities.

🗂️ OrganizeNoc

Browser extension suite for researchers, designed to manage academic papers library. Key features include metadata extraction, highlight processing, citation BibTeX export, and AI-powered querying.

🐈‍⬛ tmynNLP

Natural language processing pipeline, equipped with mainstream abstractions. Supports a diverse array of NLP tasks.

Workshops

🧪 Experiment Tracking

Explore the essentials of experiment tracking in ML research. This presentation covers the key requirements for effective tools and offers an in-depth exploration of Aim. It includes a practical demonstration where Aim is integrated into an NMT system's fine-tuning pipeline.

🔥 PyTorch Optimization

This session introduces various strategies, including efficient dataloader usage, parallel computation, operator fusing, and many more to enhance the speed and efficiency of your PyTorch code.