# EDA and Clustering for automated diet chart

# Motivation

Health and wellness is a complex thing, and needs a holistic approach in your lifestyle to achieve and maintain it. Food is an important pillar for your health and fitness. Complexity increases as we move into details of food items to classify what is a healthy and unhealthy, specially under the constraint of individual’s health and fitness goals.

Food that might be suggested for a particular fitness goal might not be the optimal for other fitness goals. For example, High carbohydrate food might be preferred in the cases where weight gain is a fitness goal, however, things might be opposite when it comes to weight loss. There are thousands and millions of food items in the world, and our task is to classify and suggest food items while looking at user’s health and fitness goals.

# Problem Statement

The problem statement we had, was to prepare diet chart for users based on their goals. Every goal has its own calorie requirement, percentage of primary nutrients (**carbohydrate, fat, protein, and fibre**). It made a lot of sense in this context to group foods together based on those primary nutrients.

How did we do it ? Let’s just dive right in 🤿.

- Reading, Understanding and Visualisation.
- Preparing data for modelling
- Creating model.
- Verifying accuracy of our model.

Let’s get started by Reading and Understanding Data.

## Loading and understanding data

We can see a summary of kind of data as below.

In the above dataset we had around 1900 data-points and 88 features. Out of which we choose only few attributes like **foodName, carbs, protein, fat, fibre, weight, calorie, saturatedFat, and volume.**

Several features in our dataset had missing values, there can be 2 reasons for it:

- It was intensionally left out, as some foodItems don’t contain any such attribute. It simply means missing values represents zero. Specially for features like carbs, fat, protein, and fibre.
- There was some error collecting data and during data entry those values were skipped.

for case 1, we can replace missing values with zero, and for case 2. we can remove those data-points. As imputaion in that case may lead to some bias. We could also use WOE (Weight of Evidence) to impute values.

We also noticed that in our data some foods were liquid and some were solid, therefore, the measurement in case of solid foods were weights and volume for those that were liquid. So, there were cases where weights cell was empty but there was a corresponding value in volume column. Hence, we came up with a derived metrics that was supposed to be a combination of both weight and volume.

Next, the calorie in the food items contained calories from all the minerals and nutrients components-although in a very minuscule amount. But, since we are only concerned about only few of those nutritional component so we calculated calorie according to a standard formulae and this was out another derived metrics.

calorie = 4*carbs + 9*fat + 4*protein

# Standardising Values

Features like carbs, fat, protein and fibre are in grams but for our analysis, we need to convert and standardise those to calorie equivalent. And since fibre is not a contributor in calorie, we convert it to corresponding content per unit weight/volume of food item.

Its very important in clustering algorithm for data not to be correlated. But as we can see from the heatmap, as calorie increases so does fat, carbs and protein. In order to remove that correlation we took a ratio with calculated calorie.

Now once our data is clean and correlations are handled lets move to next step i.e. ‘The Modelling’.

# Clustering

**What is Clustering ?**

Cluster is a task of grouping a set of objects in such a way that objects in same group (cluster) are more similar (in some sense) to each other than to those in another groups (cluster). Its a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics.

Its an **Unsupervised Learning, **as we don’t provide any labels to data and we are trying to divide data into subgroups based on features provided.

## What is K-means Clustering

**K-Means **is a centroid based algorithm, or a distance based algorithm, where we calculate distances to assign a point to cluster. In K-Means, each cluster is associated with a centroid. It tries to make intra-cluster datapoints as similar as possible while keeping the cluster as far as possible.

It involves following steps:

- Choose number of clusters, lets say
**k.**This is the K in KMeans. - Select k random points in data as centroid.
- Measure the distance between first data point and k initial clusters.
- Assign the first point to nearest cluster. And do the same step 3 and 4 for the rest data points. And once all the points are in cluster we move to next step.
- Calculate the mean of each cluster i.e. the centroid.
- Now measure the distance from the new centroid and repeat step 3 to 6. Once the clustering didn’t change at all during the last iteration. we are done.

We can assess the quality of clustering by adding the variation within each cluster. Since, KMeans clustering can’t see the best clustering, its only option is to keep track of clusters, and their total variance. And do the whole thing over again with different starting points.

Since, KMeans rely heavily on the distance its very important for our features to be scaled with mean around zero and with unit standard deviation. And the best feature scaling technique we can use in this case is **Standardisation.**

The next question is **what should be the value of k ?**

For this we will use what is called **Elbow curve **method. It gives good idea what k value should be based on Sum of Squared distance. We pick k at the spot where SSE starts to flatten out and forming an elbow.

we get the above curve. From this we can say that optimal cluster and value of K should be around 4.

# Analysis of clustering

We are using **Silhouette Analysis **to understand the performance of our clustering. This can be performed to determine the degree of separation between clusters. For example:

- Compute the average distance from all data points in the same cluster (ai).
- Compute the average distance from all data points in the closest cluster (bi).
- Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

- If it is 0 –> the sample is very close to the neighboring clusters.
- It it is 1 –> the sample is far away from the neighboring clusters.
- It it is -1 –> the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters. Lets analyse the silhouette score in our case.

We get result as:

{2: 0.31757107035913174, 3: 0.34337412758235525, 4: 0.3601443169380033, 5: 0.2970926954241235, 6: 0.29883645610373294, 7: 0.3075310165352718, 8: 0.313105441606524, 9: 0.2902622193837789, 10: 0.29641563619062317}

We can clearly see that for k = 4 we have the highest value of silhouette score. Hence 4 as an optimal value of K is a good choice for us.

Once we have k; we performed K-Means and formulated our cluster.

Next, we have prediction for values. Let’s say, we get nutrition composition for a specific goal. What we do, is scale that data in format that out model accepts and predict the cluster of the corresponding given composition.

y_pred = model.predict([food_item])

label_index = np.where(model.labels_ == y_pred[0])

As we get the label_index we filter out our food from our data and calculate the euclidian distance of each food item for the given composition.

dist = [np.linalg.norm(df[index] — food_item) for index in label_index[0]]

By this way, we can have the food items that are very closely related to the provided composition. And hence, we can prepare the diet the way we want. Like if we want to further filter out the data obtained from clustering into veg/NonVeg type etc we can perform those filtering.

The above content is an outcome of our experience while working with above problem statement.

Please do feel free to reach out and comment in case of any feedback and suggestion.