June Meetup

RSVP

Location

QualityMinds GmbH, Chiemgaustraße 116, 81549 München

Diego Miguel Lozano: I’m Feeling Lucky: The Past, Present and Future of Search

KNOWRON is a Munich-based startup that develops an AI-powered personal assistant to help deskless workers spend less time searching and more time finding. In this talk, we will cover the history of search with special emphasis on our own learnings in the last few years, in which new techniques such as Dense Passage Retrieval or Hybrid Search have greatly advanced the field of Information Retrieval.

Haris Jabbar: MorphPiece: A Linguistic Tokenizer for Large Language Models [Paper]

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. This work proposes a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks, compared to the OpenAI GPT-2 model. Specifically the model is evaluated on language modeling tasks, zero-shot performance on GLUE Benchmark with various prompt templates, massive text embedding benchmark (MTEB) for supervised and unsupervised performance, and lastly with another morphological tokenization scheme (FLOTA, Hoffmann et al., 2022). We find that the model trained on MorphPiece outperforms GPT-2 on most evaluations, at times with considerable margin, despite being trained for about half the training iterations.

Agenda

  • 18:00 Doors open
  • 18:30 - 18:45 Organizers & Host Welcome and Introduction
  • 18:45 - 19:30 Diego Miguel Lozano: I’m Feeling Lucky: The Past, Present and Future of Search
  • 19:45 - 20:30 Haris Jabbar: MorphPiece: A Linguistic Tokenizer for Large Language Models
  • 20:30 - 22.00 Get Together With Food & Drinks

Speakers

Haris Jabbar ><

Haris Jabbar holds Masters degrees in Computer Engineering (from Pakistan) and Computational Science and Engineering (from TU Munich). Currently he is pursuing PhD from LMU with a focus on inducing linguistic bias in foundation language models. This presentation is about his work on including linguistic artifacts at tokenization level.

Diego Miguel Lozano ><

Diego Miguel Lozano – has been a NLP Engineer at KNOWRON since the early days. He is also about to complete his MSc. Informatics at TUM, with focus on NLP.