Mehdi Nemlaghi

Some blog posts

View My GitHub Profile

Blog series introduction: why embeddings matter in generative AI area

The year 2023 marks an overlooked anniversary: it’s been 10 years since Word2Vec was released. This paper presented semi-supervised approaches to train dense representation of a word based on its surrounding companions. Looking further, the very idea of looking at context when studying language is not quite new: it dates back to John Rupert Firth, an English linguist, who wrote, back in 1957 :  

“You shall know a word by the company it keeps

Since 2013, words and sentence embeddings are vehicles that revolutionized the way we process exabytes worth of textual data, and multiple applications have emerged to leverage them: recommender systems, sentiment analysis classifiers, search engines…

Moreover, on the research aspect, text-based machine learning growth has at least a yearly copernican revolution, including transformers, which enforced the context-based embeddings.

Embeddings are becoming the backbone of many tasks encompassing various levels of expertise.

History and formulas are fine, but just give me the codes!

This series of blog posts aims at providing help for effectively selecting, evaluating and deploying and storing pre-trained embeddings with the help of AWS services. I’ll briefly introduce some concepts and I’ll provide codes, notebooks and infrastructure as a code (IaC) with CDK so that you can deploy your own state-of-the-art embedding system in a repeatable, robust manner. In order to do so, your companion for this blog series will be the following Github repository, simply called cloud-embeddings 😁.

Word embeddings: a historical perspective

Word embeddings are a vehicle that allow computers to turn words into vectors, and thus to perform operations on words. By extension, sentence embeddings turn a sentence into a vector. Said otherwise, we can see it as a mapping function between a word - or, more interestingly, a sequence of words - and a vector. Embeddings are aimed at capturing semantic relationships between words. We won’t deep dive into this part today, but the interested reader can find some foundational links below to gain a historical perspective on this.

A historical perspective on embeddings evolution

To summarize, embeddings are the backbone of current AI hype, especially in their text-based flavor.

Suggested reading order

While parts are aimed at being loosely coupled, here’s a humble suggestion of reading order.

Suggested reading order for this series