EDA and Clustering for automated diet chart


Health and wellness is a complex thing, and needs a holistic approach in your lifestyle to achieve and maintain it. Food is an important pillar for your health and fitness. Complexity increases as we move into details of food items to classify what is a healthy and unhealthy, specially under the constraint of individual’s health and fitness goals.

Food that might be suggested for a particular fitness goal might not be the optimal for other fitness goals. For example, High carbohydrate food might be preferred in the cases where weight gain is a fitness goal, however, things might be opposite when it comes to weight loss. There are thousands and millions of food items in the world, and our task is to classify and suggest food items while looking at user’s health and fitness goals.

Problem Statement

The problem statement we had, was to prepare diet chart for users based on their goals. Every goal has its own calorie requirement, percentage of primary nutrients (carbohydrate, fat, protein, and fibre). It made a lot of sense in this context to group foods together based on those primary nutrients.

How did we do it ? Let’s just dive right in 🤿.

Let’s get started by Reading and Understanding Data.

Loading and understanding data

We can see a summary of kind of data as below.

In the above dataset we had around 1900 data-points and 88 features. Out of which we choose only few attributes like foodName, carbs, protein, fat, fibre, weight, calorie, saturatedFat, and volume.

Several features in our dataset had missing values, there can be 2 reasons for it:

for case 1, we can replace missing values with zero, and for case 2. we can remove those data-points. As imputaion in that case may lead to some bias. We could also use WOE (Weight of Evidence) to impute values.

We also noticed that in our data some foods were liquid and some were solid, therefore, the measurement in case of solid foods were weights and volume for those that were liquid. So, there were cases where weights cell was empty but there was a corresponding value in volume column. Hence, we came up with a derived metrics that was supposed to be a combination of both weight and volume.

Next, the calorie in the food items contained calories from all the minerals and nutrients components-although in a very minuscule amount. But, since we are only concerned about only few of those nutritional component so we calculated calorie according to a standard formulae and this was out another derived metrics.

calorie = 4*carbs + 9*fat + 4*protein

Standardising Values

Features like carbs, fat, protein and fibre are in grams but for our analysis, we need to convert and standardise those to calorie equivalent. And since fibre is not a contributor in calorie, we convert it to corresponding content per unit weight/volume of food item.

Its very important in clustering algorithm for data not to be correlated. But as we can see from the heatmap, as calorie increases so does fat, carbs and protein. In order to remove that correlation we took a ratio with calculated calorie.

Heatmap of Nutrient components

Now once our data is clean and correlations are handled lets move to next step i.e. ‘The Modelling’.


What is Clustering ?

Cluster is a task of grouping a set of objects in such a way that objects in same group (cluster) are more similar (in some sense) to each other than to those in another groups (cluster). Its a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics.

Its an Unsupervised Learning, as we don’t provide any labels to data and we are trying to divide data into subgroups based on features provided.

What is K-means Clustering

K-Means is a centroid based algorithm, or a distance based algorithm, where we calculate distances to assign a point to cluster. In K-Means, each cluster is associated with a centroid. It tries to make intra-cluster datapoints as similar as possible while keeping the cluster as far as possible.

It involves following steps:

We can assess the quality of clustering by adding the variation within each cluster. Since, KMeans clustering can’t see the best clustering, its only option is to keep track of clusters, and their total variance. And do the whole thing over again with different starting points.

Since, KMeans rely heavily on the distance its very important for our features to be scaled with mean around zero and with unit standard deviation. And the best feature scaling technique we can use in this case is Standardisation.

The next question is what should be the value of k ?

For this we will use what is called Elbow curve method. It gives good idea what k value should be based on Sum of Squared distance. We pick k at the spot where SSE starts to flatten out and forming an elbow.

Elbow Curve

we get the above curve. From this we can say that optimal cluster and value of K should be around 4.

Analysis of clustering

We are using Silhouette Analysis to understand the performance of our clustering. This can be performed to determine the degree of separation between clusters. For example:

  • Compute the average distance from all data points in the same cluster (ai).
  • Compute the average distance from all data points in the closest cluster (bi).
  • Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

  • If it is 0 –> the sample is very close to the neighboring clusters.
  • It it is 1 –> the sample is far away from the neighboring clusters.
  • It it is -1 –> the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters. Lets analyse the silhouette score in our case.

We get result as:

{2: 0.31757107035913174, 3: 0.34337412758235525, 4: 0.3601443169380033, 5: 0.2970926954241235, 6: 0.29883645610373294, 7: 0.3075310165352718, 8: 0.313105441606524, 9: 0.2902622193837789, 10: 0.29641563619062317}

We can clearly see that for k = 4 we have the highest value of silhouette score. Hence 4 as an optimal value of K is a good choice for us.

Once we have k; we performed K-Means and formulated our cluster.

Next, we have prediction for values. Let’s say, we get nutrition composition for a specific goal. What we do, is scale that data in format that out model accepts and predict the cluster of the corresponding given composition.

y_pred = model.predict([food_item])
label_index = np.where(model.labels_ == y_pred[0])

As we get the label_index we filter out our food from our data and calculate the euclidian distance of each food item for the given composition.

dist = [np.linalg.norm(df[index] — food_item) for index in label_index[0]]

By this way, we can have the food items that are very closely related to the provided composition. And hence, we can prepare the diet the way we want. Like if we want to further filter out the data obtained from clustering into veg/NonVeg type etc we can perform those filtering.

The above content is an outcome of our experience while working with above problem statement.

Please do feel free to reach out and comment in case of any feedback and suggestion.

Data Scientist — Voyager— Gamer(PS)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store