Introduction

For those looking to minimize time spent searching for recipes, Google is simply not sufficient. It fails to return succinct results, and does not adequately address search personalization, such as filtering out recipes that include a certain ingredient. In our project, we will utilize object detection models and large language models to enhance recipe search and add assistance throughout the cooking process. Samsung, a major fridge retailer, has been developing fridges that contain internal cameras. These cameras, paired with various artificial intelligence (AI) image classifiers, will recognize what items are stored, create an internal inventory, then create personalized recipe recommendations. Fittingly named ‘smart fridges’ also have the ability to learn about the user, such as how quickly they use an ingredient and are also able to create shopping lists according to the internal inventory. These features are all powered by Samsung Food “the ultimate cooking app for recipe saving, meal planning, grocery shopping, and recipe sharing”. Launched in 2023, the technology has been in development for a few years and has been integrated into their fridges. As novel as these smart fridges may be, the average consumer may not want to spend thousands of dollars on a new fridge. Creating a free application that will simplify and optimize recipe search will impact a wider range of people.

We hope to create an interactive application where a user can input pictures of their groceries to communicate with our AI chat bot for guidance in generating cooking recipes ideas and executing the dish. Our project will perform the following steps in data analysis and result production:

  1. Object detection from a user inputted image
  2. Search using the list of ingredients returned from the previous step in a database of recipes
  3. Display possible recipes to cook, involving recipe summarization and further interaction

Methods

Pre-Processing

Aside from merging the datasets as previously mentioned, there was little data-preprocessing that we implemented. We used a named entity recognition (NER) model to extract the key ingredients from each recipe that cannot be substituted for other ingredients. For example, if the recipe is fettuccine alfredo, parmesan cheese cannot be substituted for cheddar cheese and the type of pasta used should be fettuccine.

Object Detection Model

The model we used to determine individual ingredients in an image is the GPT-4 Vision model. We also tried two open source models, ResNet-50 and YOLO, however, GPT-4 performed significantly better. GPT-4 is able to accurately identify almost every ingredient in the images we have tested. However, a limitation of the model is that it is unable to detect the quantity of ingredients present other than counting.

Search

We tested 3 different options for our search model: semantic search, TF-IDF, and exact keyword matches, and evaluated them qualitatively. We compared how each search algorithm performed by comparing the ingredients in the recipe returned against our list of input ingredients. From our tests, we found that semantic search made too many assumptions about our inputs, and returned recipes used our ingredients, but also required many other ingredients that were not included in the input. Using \texttt{str.contains()} seemed to have to opposite effect; the recipes returned matched our input, but were extremely simple, often being beverages that only required 2-3 ingredients. TF-IDF was in the middle of the spectrum: the recipes matched our input list, but made some ingredient substitutions. In order to correct for this effect, we employed a NER model to extract key ingredients that must be included in the returned recipes.

Results

Search Evaluation

Search is inherently hard to evaluate due to two conflicting factors: usefulness and accuracy. To first account for accuracy, we calculated the cosine similarity scores between our ingredient query list and the ingredients utilized in the recipe. However we also felt that it was important to qualitatively examine the results of our search based on if the returned recipe could be feasibly cooked with the ingredients in the query. We will test a number of different ingredient queries, count how many matches there are between the recipe's ingredients and the query's ingredients, and calculate a “feasibility” score. These matches are also subjective on the criteria of with the ingredients in the query, will the final product closely resemble the target recipe? This subjective evaluation allows for ingredient substitution and takes into account common spices or condiments found in a typical kitchen, such as as salt, pepper, sugar, water, and others.

We created 4 sample ingredient queries to evaluate our search. Typically, the cosine similarities scores were considerably lower than our feasibility score.

Comparison between cosine similarity and feasibility
Bar chart comparing cosine similarity and feasibility scores

Application

We used Streamlit to host our application due to its features that align with our product goals. Users can upload their own images, or use camera input within the Streamlit site. Streamlit was primarily used as a front-end service for user interaction/experience. Additionally, this is where we hosted our GPT-4 chatbot.

Conclusion

Our project, PicToPlate, addresses the practical need for simplifying and optimizing the recipe search experience for home cooks. By leveraging cutting-edge technologies such as GPT-4 Vision, fine-tuned search models, and retrieval augmented generation, we have developed an innovative solution that allows users to input images of their groceries, receive personalized recipe recommendations, and obtain real-time guidance through an AI chatbot. We are happy with what we accomplished in 10 short weeks.

Throughout the development process, we encountered challenges such as accurately detecting ingredient quantities, balancing search accuracy with usefulness, and considering UI/UX when building an application. Additionally, due to how multifaceted and ambitious our project was, it was difficult to decide how to allocate time and effort to the different parts of the whole task.

In the future, there are many improvements to be made to our product. We could try developing our own object detection model specifically suited to recognizing food items. Additionally, we would like to continue fine tuning our search to achieve higher cosine similarity scores. Lastly, there are many UI/UX elements to be improved upon, such as decreasing the amount of loading/wait time on the application.