This project is focused on data gathered from the public health agency of France. The data information and description can be found on the OpenFoodData website: OpenFood.

The main tasks of the project were to:

1) Process the dataset to identify relevant variables for further processing. Automate these processes to avoid repeating these operations. The program should work if the database is slightly modified (e.g. adding entries).

2) Throughout the analysis, produce visualizations to better understand the data. Perform a univariate/multivariate analysis for each variable of interest to summarise its behavior.

3) Confirm or refute hypotheses using descriptive and explanatory multivariate analysis. Perform appropriate statistical tests (one descriptive, one explanatory) to check the significance of the results.

In the figure below, I describe what is realized in this work.

this is a placeholder image
Figure 1: The tasks for the Future Vision Transport project

Nevertheless, the first thing to do was to select a problem and describe it.

Problem description:

We all have our own preferences when it comes to food. There are some who are interested in meat or fish, others who are interested in cheese, and still others who are interested in beverages. In addition, we wish to maintain our health by consuming the best foods.

It is, however, difficult to make an informed decision due to the wide variety of these foods accompanied by a myriad of nutrition facts.

Below are some nutrition facts and information to assist you in making the right food choices:

Our bodies store excess calories as body fat when we consume more calories than we burn. We may gain weight if this trend continues.

It is essential for your body to have fats in order to provide energy to your cells and support their function. By keeping your body warm, they also protect your organs. It is important to note that fats help the body absorb some nutrients as well as produce important hormones.

  • Carbohydrates are your body’s main source of energy: They help fuel your brain, kidneys, heart muscles, and central nervous system. For instance, fiber is a carbohydrate that aids digestion, helps you feel full, and keeps blood cholesterol levels in check.

  • Sugars higher blood pressure, inflammation, weight gain, diabetes, and fatty liver disease — are all linked to an increased risk of heart attack and stroke.

  • The human body requires a small amount of sodium to conduct nerve impulses, contract and relax muscles, and maintain the proper balance of water and minerals. It is estimated that we need about 500 mg of sodium daily for these vital functions.

We have to take control of the foods we are consuming. An active lifestyle can require more energy. Others need to eat less fatty foods and live a less active life, like having an inactive job. It is possible for others with certain desires to be restricted from eating foods containing some nutrients, such as sugars, salts, or additives.

In my data analysis project, I propose to analyze the nutrition of food facts. I also propose an application to help the user understand the food nutrition facts, and make a decision on the desired food.

Importing libraries

First import libraries.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

import scipy.stats as st

from sklearn import decomposition
from sklearn import preprocessing
from functions import *
from utils import *

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

Loading dataset

df = pd.read_csv('data/fr_openfoodfacts_products.csv', sep = '\t', encoding='utf-8', decimal='.', low_memory=False)
#df = data.copy()
df.head(5)
code url creator created_t created_datetime last_modified_t last_modified_datetime product_name generic_name quantity ... ph_100g fruits-vegetables-nuts_100g collagen-meat-protein-ratio_100g cocoa_100g chlorophyl_100g carbon-footprint_100g nutrition-score-fr_100g nutrition-score-uk_100g glycemic-index_100g water-hardness_100g
0 0000000003087 http://world-fr.openfoodfacts.org/produit/0000... openfoodfacts-contributors 1474103866 2016-09-17T09:17:46Z 1474103893 2016-09-17T09:18:13Z Farine de blé noir NaN 1kg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0000000004530 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Banana Chips Sweetened (Whole) NaN NaN ... NaN NaN NaN NaN NaN NaN 14.0 14.0 NaN NaN
2 0000000004559 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Peanuts NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN
3 0000000016087 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489055731 2017-03-09T10:35:31Z 1489055731 2017-03-09T10:35:31Z Organic Salted Nut Mix NaN NaN ... NaN NaN NaN NaN NaN NaN 12.0 12.0 NaN NaN
4 0000000016094 http://world-fr.openfoodfacts.org/produit/0000... usda-ndb-import 1489055653 2017-03-09T10:34:13Z 1489055653 2017-03-09T10:34:13Z Organic Polenta NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 162 columns

EDA

# quantité des données
print('Les quantité des données')
df.shape

# regarder les type des variable
print('Les type observé pour chaque variable')
df.dtypes

print('Conté les type des variables')
df.dtypes.value_counts()
Les quantité des données

(320772, 162)

Les type observé pour chaque variable

code                        object
url                         object
creator                     object
created_t                   object
created_datetime            object
                            ...   
carbon-footprint_100g      float64
nutrition-score-fr_100g    float64
nutrition-score-uk_100g    float64
glycemic-index_100g        float64
water-hardness_100g        float64
Length: 162, dtype: object

Conté les type des variables

float64    106
object      56
dtype: int64

We observe:

  • Observations and variables: 320772, 162
  • Types des variables: qualitative: 56 quantitative: 106

Filtering data. Cleaning data.

# verifié les valeurs manquants en affichant le pourcentage
dd = df.isna().mean().sort_values(ascending=True)*100

fig = plt.figure(figsize=(15, 10));
axes = sns.barplot(x=dd.values, y=dd.index, data=dd);
axes.set_xticks([]);
axes.set_yticks([0, 20, 40, 60, 80, 100]);
plt.title('NaN Values on entire dataset',fontsize=25);
plt.xlabel('Variables',fontsize=15);
plt.ylabel('% of NaN values',fontsize=15);
del dd;
this is a placeholder image
Figure 2: Missing values in dataset.

We observe:

  • 162 variables with lots of NaNs! We should try to find an application and select the necessary variables!

First, for more clarity, we can see the variables that are less than 40% of NaNs.

var_verify = (df.isna().mean() < 0.4)
columns40 = list(df.columns[var_verify])

The purpose of my data analysis project is to analyze the nutritional information provided by food facts. Additionally, I propose a software application that helps users understand food nutrition facts and make food choices.

Despite the fact that there are other important variables in the dataset, these are the most important features for our first data analysis. For the purpose of identifying the product, we utilize the following features:

  1. Product information:

    • code
    • creator
    • brands
    • product_name
    • countries_fr
    • ingredients_text
    • serving_size
    • additives_n
    • ingredients_from_palm_oil_n
    • ingredients_that_may_be_from_palm_oil_n
    • additives_tags
    • pnns_groups_2
  2. The nutritions value used to compute the nutri score
    • energy_100g
    • fat_100g
    • saturated-fat_100g
    • carbohydrates_100g
    • sugars_100g
    • fiber_100g
    • proteins_100g
    • salt_100g
    • sodium_100g
    • fruits-vegetables-nuts_100g
  3. The nutri score
    • nutrition_grade_fr
    • nutrition-score-fr_100g
    • nutrition-score-uk_100g
selected_columns = []

if 'fruits-vegetables-nuts_100g' not in columns40:
    columns40.append('fruits-vegetables-nuts_100g') # needed to compute the nutri-score

if 'pnns_groups_2' not in columns40:
    columns40.append('pnns_groups_2') # need to get categories

for c in columns40:
    if not(c.endswith('_datetime')) and not(c.endswith('_t')) and not(c.endswith('_tags')):
        selected_columns.append(c)

if 'additives_tags' not in selected_columns:
    selected_columns.append('additives_tags')
    
if 'additives' in selected_columns:
    selected_columns.remove('additives')

df_selected = df[selected_columns];

Now we have our data with the selected columns. The next step is to omit unnecessary columns.

Getting rid of unnecessary colons

cols_to_delete = ['states', 'states_fr', 'countries', 'url'] # 'nutrition-score-uk_100g'

for c in cols_to_delete:
    if c in df_selected.columns:
        df_selected.drop(c, inplace=True, axis=1)
    
# quantité des données
print('Les quantité des données')
df_selected.shape

# regarder les type des variable
print('Les type observé pour chaque variable')
df_selected.dtypes

print('Conté les type des variables')
df_selected.dtypes.value_counts()
Les quantité des données

(320772, 25)

Les type observé pour chaque variable

code                                        object
creator                                     object
product_name                                object
brands                                      object
countries_fr                                object
ingredients_text                            object
serving_size                                object
additives_n                                float64
ingredients_from_palm_oil_n                float64
ingredients_that_may_be_from_palm_oil_n    float64
nutrition_grade_fr                          object
energy_100g                                float64
fat_100g                                   float64
saturated-fat_100g                         float64
carbohydrates_100g                         float64
sugars_100g                                float64
fiber_100g                                 float64
proteins_100g                              float64
salt_100g                                  float64
sodium_100g                                float64
nutrition-score-fr_100g                    float64
nutrition-score-uk_100g                    float64
fruits-vegetables-nuts_100g                float64
pnns_groups_2                               object
additives_tags                              object
dtype: object

Conté les type des variables


float64    15
object     10
dtype: int64

Let us now examine our data to see how many NaN values it contains.

plot_data(df_selected)
this is a placeholder image
Figure 3: Plotting the data and getting the NaN values.

We observe lots of NaNs. However, fewer columns are selected and data analysis can be done.

First, we start with a data description in order to better understand the data.

Describing data

df_selected.describe()
additives_n ingredients_from_palm_oil_n ingredients_that_may_be_from_palm_oil_n energy_100g fat_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g nutrition-score-fr_100g nutrition-score-uk_100g fruits-vegetables-nuts_100g
count 248939.000000 248939.000000 248939.000000 2.611130e+05 243891.000000 229554.000000 243588.000000 244971.000000 200886.000000 259922.000000 255510.000000 255463.000000 221210.000000 221210.000000 3036.000000
mean 1.936024 0.019659 0.055246 1.141915e+03 12.730379 5.129932 32.073981 16.003484 2.862111 7.075940 2.028624 0.798815 9.165535 9.058049 31.458587
std 2.502019 0.140524 0.269207 6.447154e+03 17.578747 8.014238 29.731719 22.327284 12.867578 8.409054 128.269454 50.504428 9.055903 9.183589 31.967918
min 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 -17.860000 -6.700000 -800.000000 0.000000 0.000000 -15.000000 -15.000000 0.000000
25% 0.000000 0.000000 0.000000 3.770000e+02 0.000000 0.000000 6.000000 1.300000 0.000000 0.700000 0.063500 0.025000 1.000000 1.000000 0.000000
50% 1.000000 0.000000 0.000000 1.100000e+03 5.000000 1.790000 20.600000 5.710000 1.500000 4.760000 0.581660 0.229000 10.000000 9.000000 23.000000
75% 3.000000 0.000000 0.000000 1.674000e+03 20.000000 7.140000 58.330000 24.000000 3.600000 10.000000 1.374140 0.541000 16.000000 16.000000 51.000000
max 31.000000 2.000000 6.000000 3.251373e+06 714.290000 550.000000 2916.670000 3520.000000 5380.000000 430.000000 64312.800000 25320.000000 40.000000 40.000000 100.000000

The mean, standard deviation, minimum and maximum values as well as the quantiles are calculated for the numerical variables. It is easy to identify outliers in the maximum and minimum values. It is hoped that these issues will be addressed in the future.

df_selected.describe(include=[object])
code creator product_name brands countries_fr ingredients_text serving_size nutrition_grade_fr pnns_groups_2 additives_tags
count 320749 320770 303010 292360 320492 248962 211331 221210 94491 154680
unique 320749 3535 221347 58784 722 205520 25423 5 42 41537
top 0000000003087 usda-ndb-import Ice Cream Carrefour États-Unis Carbonated water, natural flavor. 240 ml (8 fl oz) d unknown en:e322
freq 1 169868 410 2978 172998 222 5496 62763 22624 8264

In categorical variables, the count, unique values, most commonly used, and frequency can be gathered. In future analyses, the code can be omitted since it is a unique value. Nevertheless, it is used to detect duplicates.

# afficher les valeurs unique pour chaque variable
df_selected.nunique()
code                                       320749
creator                                      3535
product_name                               221347
brands                                      58784
countries_fr                                  722
ingredients_text                           205520
serving_size                                25423
additives_n                                    31
ingredients_from_palm_oil_n                     3
ingredients_that_may_be_from_palm_oil_n         7
nutrition_grade_fr                              5
energy_100g                                  3997
fat_100g                                     3378
saturated-fat_100g                           2197
carbohydrates_100g                           5416
sugars_100g                                  4068
fiber_100g                                   1016
proteins_100g                                2503
salt_100g                                    5586
sodium_100g                                  5291
nutrition-score-fr_100g                        55
nutrition-score-uk_100g                        55
fruits-vegetables-nuts_100g                   333
pnns_groups_2                                  42
additives_tags                              41537
dtype: int64
df_selected.isna().mean().sort_values(ascending=True)
creator                                    0.000006
code                                       0.000072
countries_fr                               0.000873
product_name                               0.055373
brands                                     0.088574
energy_100g                                0.185986
proteins_100g                              0.189699
salt_100g                                  0.203453
sodium_100g                                0.203599
ingredients_text                           0.223866
additives_n                                0.223938
ingredients_from_palm_oil_n                0.223938
ingredients_that_may_be_from_palm_oil_n    0.223938
sugars_100g                                0.236308
fat_100g                                   0.239675
carbohydrates_100g                         0.240620
saturated-fat_100g                         0.284370
nutrition_grade_fr                         0.310382
nutrition-score-fr_100g                    0.310382
nutrition-score-uk_100g                    0.310382
serving_size                               0.341180
fiber_100g                                 0.373742
additives_tags                             0.517788
pnns_groups_2                              0.705426
fruits-vegetables-nuts_100g                0.990535
dtype: float64

Data cleaning

The next step is to clean up the data. Data quality and reliability are improved by identifying, correcting, and removing errors, inconsistencies, and inaccuracies. In this process, problems such as missing or duplicate data, incorrect data formatting, outliers, and inconsistencies in the data relationships are identified and resolved. In data preparation and analysis, data cleaning ensures that the data used in a study is accurate, complete, and reliable. Cleansing data is intended to produce high-quality data that can be analyzed in a reliable and meaningful manner.

Quantitative variables

Let us look at our quantitative variables. They are also known as numerical variables. This are types of variables in statistics that represent measurable quantities or numerical values. Quantitative variables are typically measured using numerical scales or units of measurement.

Elimination code is NaN

First, we delete all the rows where code variable does not have a value.

df_selected = df_selected[~df_selected.code.isna()]

Verify duplicates by code

Next, we verify the duplicates for the codevariable.

## verifié les valeurs dupliqué sur le même code
df_selected.duplicated(['code']).sum()
0

We see that there are no duplicates for this variable.

Dropping the feature code

Finally, we drop the variable code that does not serve us anymore.

df_selected.drop(['code'], inplace=True, axis=1)

Delete observations with empty nutritionists value

The observation with an empty nutritionists value has been deleted.

df_selected = df_selected[~(df_selected.energy_100g.isna() & df_selected.proteins_100g.isna() & df_selected.sugars_100g.isna() & df_selected.fat_100g.isna() &
           df_selected['saturated-fat_100g'].isna() & df_selected.fiber_100g.isna() & df_selected.sodium_100g.isna() & df_selected['fruits-vegetables-nuts_100g'].isna())]

Identifying and correcting outliers

Here, we identify and delete outliers. Nutrition that is negative or larger than 100g are outliers, and we do not consider them.

mask = ~((df_selected.fiber_100g<0) | (df_selected.fiber_100g>100) |
         (df_selected.salt_100g<0) | (df_selected.salt_100g>100) |
         (df_selected['proteins_100g']<0) | (df_selected['proteins_100g']>100) |
         (df_selected['sugars_100g']<0) | (df_selected['sugars_100g']>100)
        );
df_selected = df_selected[mask];

Additionally, foods contain protein, fat, and carbohydrates. In total, we should not have more than 100 grams. In such a context, we also identify some outliers and discard them.

cols = [
    'proteins_100g',
    'fat_100g',
    'carbohydrates_100g'    
    ]   
df_selected['sum_on_g'] = df_selected[cols].abs().sum(axis=1)
df_selected['is_outlier'] = df_selected.sum_on_g>100

df_selected = df_selected[df_selected.is_outlier==False];

df_selected.drop(['sum_on_g', 'is_outlier'], inplace=True, axis=1);

Fill NaN

  • First we set _100g numerical variables with 0 if they are not specified
  • Next, fornutrition_grade_fr we set from ‘a’ to ‘e’ values, using the knowledge from nutrition-score-fr_100g
  • energy_100g - Data entry in OpenFood can be difficult and complex, so users may confuse kJ and kcal when introducing the dataset. Calculate the total energy in kcal for all the values (17 proteins, 17 carbohydrates, and 39 fats)
  • filling the additive counting variables with 0 if they are not specified
cols = []
for col in df_selected.columns:
    if col.endswith('_100g') & ('nutrition-score' not in col) & ('nutrition-grade' not in col) & (col != 'energy_100g') :
        cols.append(col)
df_selected[cols] = df_selected[cols].fillna(value=0)

Next, we calculate the energy value and set it.

# 1 calorie vaut 180/43 soit 4.1860465116 Joules que nous arrondirons à 4,186 Joules.
# 1000 calories = 1 Kilocalorie = 1 kcal
df_selected['energy_100g'] = 17*df_selected.proteins_100g + 17*df_selected.carbohydrates_100g + 39*df_selected.fat_100g

Now, we fill the NaN values with 0.

df_selected['additives_n'] = df_selected['additives_n'].fillna(value=0)
df_selected['ingredients_from_palm_oil_n'] = df_selected['ingredients_from_palm_oil_n'].fillna(value=0)
df_selected['ingredients_that_may_be_from_palm_oil_n'] = df_selected['ingredients_that_may_be_from_palm_oil_n'].fillna(value=0)

Minimum and maximum of nutriscore

(df_selected['nutrition-score-fr_100g'].min(), df_selected['nutrition-score-fr_100g'].max())
(-15.0, 40.0)
(df_selected['nutrition-score-uk_100g'].min(), df_selected['nutrition-score-uk_100g'].max())
(-15.0, 36.0)

Qualitative variables

Qualitative variables, also known as categorical variables, are types of variables in statistics that represent qualities or characteristics that cannot be measured numerically. These variables are typically represented by non-numeric data or labels such as colors, names, types, or categories.

Creation of variables by taking top values

def create_top(df, col, nr_top):
    col_name = '{0}_top{1}'.format(col , nr_top)
    ll = list(df[col].value_counts().head(5).index)
    df[col_name] = df[col]
    df.loc[~(df[col_name].isin(ll)),col_5] = 'Autre'

For example, we take two variables: creator and countries_fr. We want to select the top 5 values, and for the rest, we will name Autre. In this way, we can see the top values.

take_top_5_col = ['creator', 'countries_fr']

for col in take_top_5_col:
    create_top(df_selected, col, 5)

The same thing we do for the product_name. Here we want to see the top 10 products.

take_top_10_col = ['product_name']

for col in take_top_10_col:
    create_top(df_selected, col, 10)

Clean countries_fr variable

The : character can be found in the values of countries_fr (ex: ‘en:Tunisie’ and ‘Tunisie.’ should be the same). We need to remove this and make countries have the same name.

df_selected.countries_fr = df_selected.countries_fr.str.replace('en:', '')
df_selected.countries_fr = df_selected.countries_fr.str.replace('es:', '')
df_selected.countries_fr = df_selected.countries_fr.str.replace('de:', '')
df_selected.countries_fr = df_selected.countries_fr.str.replace('ar:', '')
df_selected.countries_fr = df_selected.countries_fr.str.replace('nl:', '')
df_selected.countries_fr = df_selected.countries_fr.str.replace('xx:', '')


df_selected.loc[(df_selected.countries_fr.str.lower() == 'royaume-uni') | (df_selected.countries_fr.str.lower() == 'Angleterre'), 'countries_fr'] = 'Royaume-Uni'
df_selected.loc[(df_selected.countries_fr.str.lower() == '77-provins') | (df_selected.countries_fr.str.lower() == 'Aix-en-provence'), 'countries_fr'] = 'France'

Correlation

Correlation between two variables refers to the statistical relationship between them. Specifically, correlation measures the degree to which two variables are related or vary together.

In general, there are two types of correlation: positive and negative. A positive correlation exists when an increase in one variable is associated with an increase in the other variable, while a negative correlation exists when an increase in one variable is associated with a decrease in the other variable.

The strength of the correlation is measured by the correlation coefficient, which ranges from -1 to 1. A correlation coefficient of -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.

Correlation is often used in statistical analysis to understand the relationship between two variables and to make predictions about one variable based on the other. However, correlation does not necessarily imply causation, and other factors may be influencing the relationship between the two variables. Therefore, the correlation should be interpreted with caution and further analysis should be conducted to establish causation.

Between 2 quantitative variables

plot_correlation(df_selected)
  additives_n ingredients_from_palm_oil_n ingredients_that_may_be_from_palm_oil_n energy_100g fat_100g saturated-fat_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g nutrition-score-fr_100g nutrition-score-uk_100g fruits-vegetables-nuts_100g
additives_n nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
ingredients_from_palm_oil_n 0.120900 nan nan nan nan nan nan nan nan nan nan nan nan nan nan
ingredients_that_may_be_from_palm_oil_n 0.288444 0.184633 nan nan nan nan nan nan nan nan nan nan nan nan nan
energy_100g 0.039412 0.098584 0.031941 nan nan nan nan nan nan nan nan nan nan nan nan
fat_100g -0.078317 0.063591 0.025512 0.802379 nan nan nan nan nan nan nan nan nan nan nan
saturated-fat_100g -0.061414 0.084041 0.030113 0.507905 0.636419 nan nan nan nan nan nan nan nan nan nan
carbohydrates_100g 0.200227 0.083950 0.030712 0.540972 -0.042021 -0.047544 nan nan nan nan nan nan nan nan nan
sugars_100g 0.146411 0.058287 0.005414 0.274477 -0.063760 0.063558 0.617330 nan nan nan nan nan nan nan nan
fiber_100g -0.115047 -0.001547 -0.035574 0.249919 0.095830 0.004550 0.230889 -0.017901 nan nan nan nan nan nan nan
proteins_100g -0.101381 -0.007868 -0.038991 0.277584 0.213393 0.194014 -0.093323 -0.238110 0.231642 nan nan nan nan nan nan
salt_100g -0.003864 -0.005207 -0.016555 -0.077246 -0.046840 -0.039982 -0.067531 -0.091756 -0.032159 -0.001990 nan nan nan nan nan
sodium_100g -0.003864 -0.005207 -0.016555 -0.077246 -0.046840 -0.039983 -0.067532 -0.091756 -0.032159 -0.001990 1.000000 nan nan nan nan
nutrition-score-fr_100g 0.152258 0.111722 0.055395 0.563106 0.533356 0.632214 0.234830 0.467773 -0.161741 0.108323 0.126733 0.126734 nan nan nan
nutrition-score-uk_100g 0.150110 0.114145 0.056638 0.585692 0.557287 0.649291 0.235674 0.457226 -0.155272 0.131695 0.130602 0.130602 0.986087 nan nan
fruits-vegetables-nuts_100g -0.013594 -0.006479 -0.001290 -0.051571 -0.039096 -0.037375 -0.021815 0.016174 -0.015843 -0.047305 -0.015122 -0.015122 -0.041717 -0.058165 nan
st.pearsonr(df_selected.fat_100g, df_selected.energy_100g)[0] # coefficient de correlation lineere
np.cov(df_selected.fat_100g, df_selected.energy_100g, ddof=0) # matrice de covariance 
0.8023789000889111

array([[2.91999772e+02, 1.15335422e+04],
       [1.15335422e+04, 7.07593613e+05]])

Between 1 quantitative variable and 1 qualitative variable (ANOVA)

data = get_pandas_catVar_numVar(df_selected, catVar = 'product_name_top10', numVar = 'fat_100g')

# Propriétés graphiques (pas très importantes)    
medianprops = {'color':"black"}
meanprops = {'marker':'o', 'markeredgecolor':'black',
            'markerfacecolor':'firebrick'}

plt.figure(figsize=(20,8));
b = sns.boxplot(x="variable", y="value", data=pd.melt(data), showfliers = False,  showmeans=True, medianprops=medianprops, meanprops=meanprops);
plt.title('Top 10: product_name / fat_100g', fontsize=20);
plt.xlabel('Variable values', fontsize=15);
plt.ylabel('Values', fontsize=15);
plt.show();
this is a placeholder image
Figure 4: Correlation between 1 quantitative and 1 qualitative variables. (product_name / fat_100g)

By analyzing Figure 4 we can say:

  • The fat is different from one product to other.
  • For instance, the fat for Potato chips, Cookies, and Popcorn is more considerable and more dispersed than those of salsa, and pinto beans.
  • The product that contains the biggest fat is Extra Virgin Olive Oil which is logical.

We shall say here that according to our application, some of these products could be omitted by our users. For example, if we want to eat cookies that have less than 30 fat, that means not all cookies are permitted, and those will not be selected.

data = get_pandas_catVar_numVar(df_selected, catVar = 'product_name_top10', numVar = 'energy_100g')

# Propriétés graphiques (pas très importantes)    
medianprops = {'color':"black"}
meanprops = {'marker':'o', 'markeredgecolor':'black',
            'markerfacecolor':'firebrick'}

plt.figure(figsize=(20,8));
b = sns.boxplot(x="variable", y="value", data=pd.melt(data), showfliers = False);
plt.title('Top 10: product_name / energy_100g', fontsize=20);
plt.xlabel('Variable values', fontsize=15);
plt.ylabel('Values', fontsize=15);
plt.show();
this is a placeholder image
Figure 5: Correlation between 1 quantitative and 1 qualitative variables (product_name / energy_100g).

By analyzing Figure 5 we can say:

  • Somehow the energy is similar to the fat. The higher fat produces higher energy products. However, they are more dispersed on the energy variable than on the fat variable.
  • Having a standing profession high caloric products should be evited. For example, tacking cookies with preferred less kcal for persons with the sitting profession.

By analyzing the correlation of one quantitative and one qualitative variable we look for the eta squared coefficient. Suppose Y is a categorical variable and X is a numerical variable, we look at the correlation between these variables with eta_squared = Total_variance/Inclass_variance. If eta_squared = 0, it means that the class averages are all equal. There is therefore no a priori relationship between the variables Y and X. On the contrary, if eta_squared = 1, this means that the averages per class are very different, and each class is made up of identical values: there is therefore a priori a relationship between the variables Y and X.

eta_squared(df_selected, 'product_name_top10', 'energy_100g')
0.015498074324149612
eta_squared(df_selected, 'product_name_top10', 'fat_100g')
0.029213633935751267
eta_squared(df_selected[~df_selected.nutrition_grade_fr.isna()], 'nutrition_grade_fr', 'energy_100g')
0.2982499803324311
eta_squared(df_selected[~df_selected.nutrition_grade_fr.isna()], 'nutrition_grade_fr', 'fat_100g')
0.2618975105793661

By computing, some of the eta squared we can say that the nutrition grade has a correlation with the fat and energy variables.

Between two qualitatives variables

Let us pose a question. Do you have the same products in different states?

X = "product_name_top10"
Y = "countries_fr_top5"

cont = df_selected[[X,Y]].pivot_table(index=X,columns=Y,aggfunc=len,margins=True,margins_name="Total")

plt.figure(figsize=(20,8));
sns.heatmap(cont, cmap="YlGnBu", annot=True, fmt='.1f')
plt.title('Top 10 produits dans top 5 états', fontsize=20);
plt.xlabel('Top 5 états', fontsize=15);
plt.ylabel('Top 10 produits', fontsize=15);
this is a placeholder image
Figure 6: Correlation between 2 qualitative variables (top 10 products / top 5 states).

By analyzing Figure 6 we can say that:

  • the top ten products are all observed in the US.
  • France is the second country that has most of the observed products.

Data Analysis

Univariate analyse

Densities of nutritional variables

plot_density(df_selected, dt = DensityTypes.Boxplot) #dt = DensityTypes.Density
this is a placeholder image
Figure 7: Densities of nutritional variables.

Visualizing some of the top variables values

plt.figure(figsize=(15,8))
sns.barplot(x=df_selected.creator_top5.value_counts(), y=df_selected.creator_top5.value_counts().index, data=df_selected);
plt.title('Top 5 creators', fontsize=20);
plt.xlabel('# of creations', fontsize=15);
plt.ylabel('Creators', fontsize=15);
plt.show();
this is a placeholder image
Figure 8: Top 5 creators.
plt.figure(figsize=(15,8))
sns.barplot(x=df_selected.product_name.value_counts().head(10), y=df_selected.product_name.value_counts().head(10).index, data=df_selected);
plt.title('Top 10 product_name', fontsize=20);
plt.xlabel('# of products', fontsize=15);
plt.ylabel('Product name', fontsize=15);
plt.show();
this is a placeholder image
Figure 9: Top 10 products.
plt.figure(figsize=(15,8))
sns.barplot(x=df_selected.brands.value_counts().head(10), y=df_selected.brands.value_counts().head(10).index, data=df_selected);
plt.title('Top 10 brands', fontsize=20);
plt.xlabel('# of brands', fontsize=15);
plt.ylabel('Brands', fontsize=15);
plt.show();
this is a placeholder image
Figure 10: Top 10 brands.
plt.figure(figsize=(15,8))
sns.barplot(x=df_selected.ingredients_text.value_counts().head(10), y=df_selected.ingredients_text.value_counts().head(10).index, data=df_selected);
plt.title('Top 10 ingredients_text', fontsize=20);
plt.xlabel('# of ingredients', fontsize=15);
plt.ylabel('Ingredients', fontsize=15);
plt.show();
this is a placeholder image
Figure 11: Top 10 ingredients.
plt.figure(figsize=(15,8))
sns.barplot(x=df_selected.additives_tags.value_counts().head(10), y=df_selected.additives_tags.value_counts().head(10).index, data=df_selected);
plt.title('Top 10 additives', fontsize=20);
plt.xlabel('# of additives', fontsize=15);
plt.ylabel('Additives', fontsize=15);
plt.show();
this is a placeholder image
Figure 12: Top 10 additives.
df1 =  df_selected.countries_fr.str.split(',', expand=True).melt(var_name='columns', value_name='values');
df2 = pd.crosstab(index=df1['values'], columns=df1['columns'], margins=True).All.drop('All').sort_values(ascending = False).head(10);
df2 = df2.to_frame();
#Using reset_index, inplace=True
df2.reset_index(inplace=True);

plt.figure(figsize=(15,8));
sns.barplot(y='values', x='All', data=df2);
plt.title('Top 10 countries', fontsize=20);
plt.xlabel('count', fontsize=15);
plt.ylabel('Countries', fontsize=15);
plt.show();

del df1, df2;
this is a placeholder image
Figure 13: Top 10 countries.
plot_words(df, 'countries_fr')
this is a placeholder image

Distribution of nutriscore_grade

plt.figure(figsize=(15,8))
df_selected.nutrition_grade_fr.value_counts().plot.pie(autopct="%.1f%%");
plt.title('Nutriscore grade', fontsize=20);
plt.ylabel('');
this is a placeholder image
Figure 14: nutriscore_grade distribution.

Multivariate analysis

cols = ['energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 
  'salt_100g', 'nutrition_grade_fr']
d = df_selected[(~df_selected['nutrition_grade_fr'].isna()) & (~df_selected['nutrition-score-fr_100g'].isna())][cols].sample(10000)

sns.pairplot(data=d, hue="nutrition_grade_fr", hue_order=['e','d','c','b','a'], 
             plot_kws = {'s': 10}, corner=True)
del d
this is a placeholder image
Figure 15: Pairplot of our dataset.

Analyzing Figure 15 we can say:

The level of fats and that of saturated fats penalizes the nutriscore.
Other nutrition compositions affect less the nutriscore. However, some effects can be observed. This difference can be due to the food categories.

Some foods are rich in caloric energy having a good nutrition grade:

  • A high nutrition grade of ‘a’ and ‘b’ with energy in the range of 1500 can be observed with fat smaller than 20
  • A high nutrition grade of ‘a’ and ‘b’ with energy in a range of 3000 can be observed with very less saturated fat that is less than 10.
  • We observe foods rich in carbohydrates that have a good nutrition score having more than 2000 in energy.

These can be also seen in the following 3 figures.

plt.figure(figsize=(15,8));
sns.scatterplot(data=df_selected, x="fat_100g", y="energy_100g", hue="nutrition_grade_fr", hue_order=['e','d','c','b','a'])
plt.title('Interaction of fat on energy', fontsize=20);
plt.xlabel('Fat', fontsize=15);
plt.ylabel('Energy', fontsize=15);
plt.show();
this is a placeholder image
Figure 16: Scatterplot between fat_100g and energy_100g.
plt.figure(figsize=(15,8));
sns.scatterplot(data=df_selected, x="saturated-fat_100g", y="energy_100g", hue="nutrition_grade_fr", hue_order=['e','d','c','b','a'])
plt.title('Interaction of saturated fat on energy', fontsize=20);
plt.xlabel('Saturated Fat', fontsize=15);
plt.ylabel('Energy', fontsize=15);
plt.show();
this is a placeholder image
Figure 17: Scatterplot between saturated-fat_100g and nutrition_grade_fr.
plt.figure(figsize=(15,8));
sns.scatterplot(data=df_selected, x="carbohydrates_100g", y="energy_100g", hue="nutrition_grade_fr", hue_order=['e','d','c','b','a'])
plt.title('Interaction of carbohydrates on energy', fontsize=20);
plt.xlabel('Carbohydrates', fontsize=15);
plt.ylabel('Energy', fontsize=15);
plt.show();
this is a placeholder image
Figure 18: Scatterplot between carbohydrates_100g and energy_100g.
plt.figure(figsize=(15,8));
sns.boxplot(x="nutrition_grade_fr", y="energy_100g", data=df_selected, showfliers = False, order = ['a', 'b', 'c', 'd', 'e'])
plt.title('Nutrition grade distributions/ Energy', fontsize=20);
plt.xlabel('Nutrition grade', fontsize=15);
plt.ylabel('Energy', fontsize=15);
plt.show();
this is a placeholder image
Figure 19: nutrition_grade_fr distributions over the energy_100g variable
plt.figure(figsize=(15,8));
sns.boxplot(x="nutrition_grade_fr", y="fat_100g", data=df_selected, showfliers = False, order = ['a', 'b', 'c', 'd', 'e'])
plt.title('Nutrition grade distributions/ fat_100g', fontsize=20);
plt.xlabel('Nutrition grade', fontsize=15);
plt.ylabel('Fat', fontsize=15);
plt.show();
this is a placeholder image
Figure 19: nutrition_grade_fr distributions over the fat_100g variable

Note that foods with different nutrition_grade_fr can have relatively equal high energies. But preferring good foods (with nutrition_grade_fr ‘a’ and ‘b’) we are likely to eat foods with less energy. The same thing we observe in fat foods, where preferring better foods (with a better nutrition score) we shall choose not fat foods.

Add my_category variable

plot_words(df_selected, 'pnns_groups_2')
this is a placeholder image
compute_words_freq(df_selected, 'pnns_groups_2', sep=',')
Word Frequency
0 unknown 12835
1 one-dish meals 4927
2 biscuits and cakes 4018
3 cereals 3701
4 sweets 3587
5 cheese 3516
6 milk and yogurt 2914
7 dressings and sauces 2785
8 chocolate products 2648
9 vegetables 2585
10 processed meat 2548
11 non-sugared beverages 2242
12 fish and seafood 2052
13 sweetened beverages 1952
14 appetizers 1880
15 fruit juices 1729
16 bread 1590
17 fats 1342
18 breakfast cereals 1310
19 fruits 1297
20 meat 1150
21 legumes 754
22 dairy desserts 726
23 ice cream 647
24 sandwich 640
25 nuts 565
26 pizza pies and quiche 464
27 soups 463
28 dried fruits 410
29 pastries 403
30 fruit nectars 342
31 artificially sweetened beverages 255
32 eggs 186
33 alcoholic beverages 155
34 potatoes 96
35 tripe dishes 49
36 salty and fatty products 19
categories ={
    'cheese' : ['cheese'],
    'appetizer' : ['appetizers', 'nuts', 'salty and fatty products', 'dressings and sauces'],
    'melange': ['soups', 'sandwich', 'pizza pies and quiche'],
    'juice' : ['fruit juices', 'fruit nectars'],
    'plants' : ['legumes', 'legume', 'fruits', 'Fruit', 'vegetables', 'dried fruits'],
    'sweet' : ['sweets', 'biscuits and cakes', 'chocolate products', 'dairy desserts'],
    'feculent' : ['cereals', 'bread', 'pastries', 'potatoes', 'breakfast cereals' ],
    'beverage' : ['non-sugared beverages', 'artificially sweetened beverages', 'alcoholic beverages', 'sweetened beverages'],
    'meat_fish' : ['tripe dishes', 'meat','fish and seafood', 'processed meat', 'eggs'],
    'fats' : ['fats'],
    'milk' : ['milk and yogurt', 'ice cream'],
}

def replace(df, col, key, val):
    m = [v == key for v in df[col]]
    df.loc[m, col] = val
    return df
    
df_selected2 = df_selected[~df_selected.pnns_groups_2.isna()]
plot_data(df_selected2)
this is a placeholder image
Figure 21. Plotting selected data.
df_selected2['my_categoty'] = df_selected2['pnns_groups_2'].str.lower();
for new_value, old_value in categories.items():
    #print(old_value)
    df_selected['my_categoty'] = df_selected['my_categoty'].replace(old_value, new_value);
    #df_selected2['my_categoty'] = df_selected2['my_categoty'].replace(['appetizers', 'nuts', 'salty and fatty products', 'dressings and sauces'], 'appetizer');
    
plot_words(df_selected, 'my_categoty')
this is a placeholder image
Figure 22.
compute_words_freq(df_selected, 'my_categoty')
Word Frequency
0 unknown 12835
1 sweet 10979
2 feculent 7100
3 meat_fish 5985
4 appetizer 5249
5 plants 5046
6 onedish 4927
7 meals 4927
8 beverage 4604
9 milk 3561
10 cheese 3516
11 juice 2071
12 melange 1567
13 fats 1342
plt.figure(figsize=(15,8));
sns.boxplot(x="nutrition_grade_fr", y="energy_100g", data=df_selected[(df_selected.my_categoty == 'plants') | (df_selected.my_categoty == 'meat_fish')], showfliers = False, order = ['a', 'b', 'c', 'd', 'e'], hue =  'my_categoty')
plt.title('Nutrition grade distributions/ energy_100g', fontsize=20);
plt.xlabel('Nutrition grade', fontsize=15);
plt.ylabel('Energy', fontsize=15);
plt.show();
this is a placeholder image
Figure 23. Nutrition grade distributions over energy categorized by plants and meat_fish foods
plt.figure(figsize=(15,8));
sns.boxplot(x="nutrition_grade_fr", y="energy_100g", data=df_selected[(df_selected.my_categoty == 'beverage') | (df_selected.my_categoty == 'milk')], showfliers = False, order = ['a', 'b', 'c', 'd', 'e'], hue =  'my_categoty')
plt.title('Nutrition grade distributions/ energy_100g', fontsize=20);
plt.xlabel('Nutrition grade', fontsize=15);
plt.ylabel('Energy', fontsize=15);
plt.show();
this is a placeholder image
Figure 24. Nutrition grade distributions over energy categorized by beverage and milk foods

Difference between nutriscores

Let us now examine if there are some differences between nutri-score-fr_100g and nutriscore-uk-100g.

from sklearn.linear_model import LinearRegression
mask = (~df_selected['nutrition-score-fr_100g'].isna()) & (~df_selected['nutrition-score-uk_100g'].isna())
x=df_selected[mask]['nutrition-score-fr_100g']
y=df_selected[mask]['nutrition-score-uk_100g']

plt.figure(figsize=(15,8));
sns.scatterplot(x, 
                y, 
                hue = df_selected['my_categoty'],
                legend='full',
                s=100);

plt.title('Nutri score UK vs FR', fontsize=20);
plt.xlabel('nutri score fr 100g', fontsize=15);
plt.ylabel('nutri score uk 100g', fontsize=15);


#linear regression
x = np.array(x).reshape(-1, 1);
y = np.array(y).reshape(-1, 1);

reg = LinearRegression();
model = reg.fit(x, y);
plt.plot(x, model.predict(x),color='k');
plt.show()

print('y=ax with a={}\n score : {}'.format(model.coef_[0], model.score(x, y)));
this is a placeholder image
Figure 25. Difference between nutrition-score-fr and nutrition-score-uk
 y=ax with a=[1.0000532]
 score : 0.9723672948764118

Nutriscore for the two countries are rather similar, a linear model between them is easily modeled. However, we see some differences in the computation of nutriscore for some categories of products:

  • beverage is considered with a smaller nutrition score
  • fats are considered with a higher nutrition score
  • cheese is considered with a higher nutrition score

Energy for each category of foods?

plt.figure(figsize=(15,8));
sns.boxplot(x="energy_100g", y="my_categoty", data=df_selected, orient = 'h', showfliers = False,);
plt.title('Calorics food', fontsize=20);
plt.xlabel('energy 100g', fontsize=15);
plt.ylabel('Categories', fontsize=15);
plt.show();
this is a placeholder image
Figure 26. how much energy do we have in each category of foods?

Fat for each category of foods?

plt.figure(figsize=(15,8));
sns.boxplot(x="fat_100g", y="my_categoty", data=df_selected, orient = 'h', showfliers = False,);
plt.title('Fat food', fontsize=20);
plt.xlabel('fat 100g', fontsize=15);
plt.ylabel('Categories', fontsize=15);
plt.show();
this is a placeholder image
Figure 26. how much fat do we have in each category of foods?

Exploratory analysis with PCA

PCA allows us to:

  • Analyse the variability between individuals, i.e. what are the differences and similarities between individuals.
  • Analyse links between variables: what are there groups of variables that are highly correlated with each other that can be grouped into new synthetic variables?
# selection des colonnes à prendre en compte dans l'ACP
columns_acp = []
for c in list(df_selected.columns):
    if c.endswith('_100g'):
        columns_acp.append(c)
df_pca = df_selected[columns_acp]

plot_data(df_pca)
this is a placeholder image
Figure 27. Selected data for PCA.
# Preparation des données
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean');
X = np.array(df_pca['nutrition-score-fr_100g']).reshape(-1, 1);
imp.fit(X);
df_pca['nutrition-score-fr_100g'] = imp.transform(X);

X = np.array(df_pca['nutrition-score-uk_100g']).reshape(-1, 1);
imp.fit(X);
df_pca['nutrition-score-uk_100g'] = imp.transform(X);
#del X
X = df_pca.values
names = df_pca.index #["product_name"] # ou data.index pour avoir les intitulés
features = df_pca.columns

# Centrage et Réduction
std_scale = preprocessing.StandardScaler().fit(X)
X_scaled = std_scale.transform(X)
# choix du nombre de composantes à calculer
n_comp = 6

# Calcul des composantes principales
pca = decomposition.PCA(n_components=n_comp)
pca.fit(X_scaled)

X_projected = pca.fit_transform(X_scaled)


X_projected = pd.DataFrame(X_projected, index = df_pca.index, columns = ['F{0}'.format(i) for i in range(n_comp)])

X_projected
PCA(n_components=6)
F0 F1 F2 F3 F4 F5
1 2.969878 -0.398472 0.512247 0.170433 0.168039 -1.250576
2 -0.119704 -0.925610 0.434700 2.543939 0.046679 0.171516
3 2.309261 0.007418 2.223656 1.584307 0.250097 -0.581340
4 0.002351 -0.753928 -0.451536 1.405344 -0.106914 0.298048
5 0.858225 -0.791237 0.211521 1.907893 0.022654 0.200782
... ... ... ... ... ... ...
320756 0.157600 0.036311 0.451537 -0.319817 0.015153 -1.009554
320757 -1.670609 -0.420981 1.782779 1.898136 -0.025061 1.269749
320763 -2.339713 -0.139322 0.323233 -0.883152 -0.164451 -0.445308
320768 -2.492881 -0.161867 0.332084 -0.887011 -0.164449 -0.535805
320771 -1.512330 0.099590 0.084318 -1.414149 -0.217476 -0.020945

260767 rows × 6 columns

# Eboulis des valeurs propres
display_scree_plot(pca)

# Cercle des corrélations
pcs = pca.components_
display_circles(pcs, n_comp, pca, [(0,1),(2,3),(4,5)], labels = np.array(features))

Percentages of inertia, also known as the percentage of variance explained or proportion of variance, is a measure of the amount of variation in a dataset that is accounted for by a particular factor or component in a statistical analysis.

In multivariate statistical techniques such as principal component analysis (PCA), correspondence analysis, or factor analysis, percentages of inertia are used to assess the importance of each factor or component. The percentage of inertia is calculated as the proportion of the total variance of the dataset that is explained by a particular factor or component.

For example, in PCA, the first principal component accounts for the largest percentage of variance in the dataset, followed by the second principal component, and so on. The percentage of inertia for each principal component indicates how much of the variance in the dataset is accounted for by that component.

Percentages of inertia are useful in interpreting and visualizing the results of multivariate statistical analyses, as they provide a measure of the relative importance of each factor or component in explaining the variability in the data.

this is a placeholder image
Figure 28. Percentages of inertia.
this is a placeholder image
Figure 29. Correlation circle between F1 and F2.
this is a placeholder image
Figure 30. Correlation circle between F3 and F4.
this is a placeholder image
Figure 31. Correlation circle between F5 and F6.
# Projection des individus
X_projected = pca.transform(X_scaled)
display_factorial_planes(X_projected, n_comp, pca, [(0,1),(2,3),(4,5)], alpha = 0.2)
Figure 32. Projection of individuals.

Realizing PCA with 6 composants capturing greater than 80% of the information. Studying the correlation between the initial variables with the obtained principal components we observe. To see that we project the flashes on the axes and obtain the correlation between variables. We can have negative and positive correlations.

  • The variable nutri-score-fr_100g, nutri-score-uk_100g, energy_100g is described by F1.
  • The variable sodium_100g is described by F2.
  • The variable sugars_100g is described by F3.
  • The variable fiber_100g is described by F4.
  • The variable fruits-vegetables-nuts_100g is described by F5.
  • The variable proteins_100g is described by F6.

We have also made a projection of individuals.

K-means algorithm avec ACP

cols = ['energy_100g', 'fat_100g', 'saturated-fat_100g',  'proteins_100g',  'nutrition_grade_fr', 
        'carbohydrates_100g']

df_selected_clustering = df_selected[cols]
df_selected_clustering = df_selected_clustering[~df_selected_clustering.nutrition_grade_fr.isna()]

clusters = df_selected_clustering['nutrition_grade_fr']
clusters = np.array(clusters.apply(lambda x: ord(x)-97)) # transformé en numeric
df_selected_clustering.drop('nutrition_grade_fr', inplace=True, axis=1)

features = df_selected_clustering.columns

# Centrage et Réduction
std_scale = preprocessing.StandardScaler().fit(df_selected_clustering)
df_selected_clustering = std_scale.transform(df_selected_clustering)


n_comp = 2
# Calcul des composantes principales
pca = decomposition.PCA(n_components=n_comp)
pca.fit(df_selected_clustering)
X_projected = pca.fit_transform(df_selected_clustering)

X_projected# = pd.DataFrame(X_projected, index = df_pca.index, columns = ['F{0}'.format(i) for i in range(n_comp)])
PCA(n_components=2)

array([[ 3.03514692, -0.50058059],
       [ 1.08440559, -0.77137787],
       [ 3.12397384,  0.78626913],
       ...,
       [-0.88462164,  1.63264941],
       [-1.96665734,  0.54173836],
       [-2.02328748,  0.52906735]])
# Eboulis des valeurs propres
display_scree_plot(pca)

# Cercle des corrélations
pcs = pca.components_
display_circles(pcs, n_comp, pca, [(0,1)], labels = np.array(features))
Figure 33. Percentage of inertia and circle of correlation.
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn import metrics

K = len(np.unique(clusters))
kmeans = KMeans(n_clusters=K).fit(X_projected)

metrics.rand_score(clusters, kmeans.labels_)
0.6879621875475739

Preparing and Saving dataset for application

Let us get the data from df_selected maintaining just the values foods category values.

df_application = df_selected[~df_selected.my_categoty.isna()].drop(['ingredients_text', 'serving_size', 'additives_tags', 'pnns_groups_2', 'brands'], axis=1).reset_index()

df_application = df_application.drop('index', axis=1)

First, we will do a KNN Imputer to fill the nutri_score fr/uk where there are some missing values.

from sklearn.impute import KNNImputer

cols = ['energy_100g', 'fat_100g', 'saturated-fat_100g',  'proteins_100g', 
        'carbohydrates_100g', 'nutrition-score-fr_100g', 'nutrition-score-uk_100g']

df_selected_knn = df_application[cols]

imputer = KNNImputer(n_neighbors=5)  # tell the imputer to consider only '#' as missing data
imputed_data = imputer.fit_transform(df_selected_knn)  # impute all '#'
df_selected_knn = pd.DataFrame(data=imputed_data, columns=cols)


df_application['nutrition-score-fr_100g'] = df_selected_knn['nutrition-score-fr_100g']
df_application['nutrition-score-uk_100g'] = df_selected_knn['nutrition-score-uk_100g']

Next we fill NaN for the product_name by inserting unknown as value.

df_application.loc[df_application.product_name.isna(), 'product_name'] = 'unknown'

Let us discover the unique values for the category we have created.`

df_application['my_categoty'].unique()
array(['unknown', 'plants', 'sweet', 'melange', 'meat_fish', 'beverage',
       'appetizer', 'one-dish meals', 'feculent', 'milk', 'fats',
       'cheese', 'juice'], dtype=object)
not_beverages = ((df_application['my_categoty']!='beverage') & (df_application['my_categoty']!='juice') & (df_application['my_categoty']!='milk'))
beverages = ~not_beverages 
                 
df_application['bevarage'] = beverages
cond1 = (~df_application.bevarage & df_application['nutrition_grade_fr'].isna() & (df_application['nutrition-score-fr_100g'] <= -1))
cond2 = (~df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] > -1) & (df_application['nutrition-score-fr_100g'] <= 2)))
cond3 = (~df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] > 2) & (df_application['nutrition-score-fr_100g'] <= 10)))
cond4 = (~df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] > 10) & (df_application['nutrition-score-fr_100g'] <= 18)))
cond5 = (~df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] > 18)))
cond6 = (df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] <= -1)))
cond7 = (df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] > -1) & (df_application['nutrition-score-fr_100g'] <= 1)))
cond8 = (df_application.bevarage & df_application['nutrition_grade_fr'].isna() & ((df_application['nutrition-score-fr_100g'] > 1) & (df_application['nutrition-score-fr_100g'] <= 5)))
cond9 = df_application.bevarage & df_application['nutrition_grade_fr'].isna() & (df_application['nutrition-score-fr_100g'] > 5) & (df_application['nutrition-score-fr_100g'] <= 9)
cond10 = df_application.bevarage & df_application['nutrition_grade_fr'].isna() & (df_application['nutrition-score-fr_100g'] > 9)

df_application.loc[cond1, 'nutrition_grade_fr'] = 'a'
df_application.loc[cond2, 'nutrition_grade_fr'] = 'b'
df_application.loc[cond3, 'nutrition_grade_fr'] = 'c'
df_application.loc[cond4, 'nutrition_grade_fr'] = 'd'
df_application.loc[cond5, 'nutrition_grade_fr'] = 'e'

df_application.loc[cond6, 'nutrition_grade_fr'] = 'a'
df_application.loc[cond7, 'nutrition_grade_fr'] = 'b'
df_application.loc[cond8, 'nutrition_grade_fr'] = 'c'
df_application.loc[cond9, 'nutrition_grade_fr'] = 'd'
df_application.loc[cond10, 'nutrition_grade_fr'] = 'e'

Finally, save the dataset.

df_application.to_csv('data/df_app.csv', index = False, header=True)

The next step of this project was to use the selected data and create an application that is helping people to choose their food regardless of the nutrition they need.

Updated: