In [1]:
# PyParis logo
from IPython.display import Image
Image("PyParis.png")
Out[1]:

Introduction to Data Analysis using Python - PyParis 2017

Francis Wolinski - Yotta Conseil

Expertise & IT Audit - Data Science - Python

0. Tutorial objectives and materials

0.1 Objectives

  • Exploring basic functionalities of Python Data Analysis Ecosystem: Numpy, Pandas, Matplotlib
  • Analysis two different datasets but conforming them so that the same analysis can be performed on both datasets
  • Perform few cross analysis on both datasets

0.2 Materials

Python 3.5+ : numpy, pandas, matplotlib

Datasets :

  • Names from US Social Security from 1880 to 2016 : https://www.ssa.gov/oact/babynames/limits.html (National data)
    • 137 CSV files named yobYYYY.txt, without header
    • Within each file: name, gender (F or M), births (int)
    • No names when less that 5 occurences in a year
  • Names from French Insee from 1900 to 2015 : https://www.insee.fr/fr/statistiques/2540004 (Fichier France)
    • 1 single TSV file: nat2015.txt, with header
    • In the file: gender (1 or 2), name, year, births (float!)
    • No names when less than 3 occurences in a year
    • Rare names are labelled as _PRENOMS_RARES
    • Years for rare names are labelled XXXX

References

  • Wes McKinney - Python for Data Analysis - O'Reilly (2013)
  • Jake VanderPlas - Python Data Science Handbook - O'Reilly (2016)
In [2]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os

# magic command to display graphics within the notebook
%matplotlib inline

# display options
from pandas import set_option
set_option("display.max_rows", 16)
set_option("precision", 3)
plt.style.use('seaborn-notebook')

1. Loading Data in a DataFrame

1.1 Loading US Data

1.1.1 Loading a single file

We first load a single CSV file using Pandas' read_csv() function.

In [3]:
# loads US data from last year
path = "datasets/names/"
filename = "yob2016.txt"
us2016 = pd.read_csv(os.path.join(path, filename),
                           header=None,
                           names=['name', 'gender', 'births'])
us2016
Out[3]:
name gender births
0 Emma F 19414
1 Olivia F 19246
2 Ava F 16237
3 Sophia F 16070
4 Isabella F 14722
5 Mia F 14366
6 Charlotte F 13030
7 Abigail F 11699
... ... ... ...
32860 Ziya M 5
32861 Ziyang M 5
32862 Zoel M 5
32863 Zolton M 5
32864 Zurich M 5
32865 Zyahir M 5
32866 Zyel M 5
32867 Zylyn M 5

32868 rows × 3 columns

We obtain a Pandas DataFrame object which represents 2D data. The info() method provides the main characteritics of the DataFrame.

In [4]:
# characteristics of a DataFrame
us2016.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32868 entries, 0 to 32867
Data columns (total 3 columns):
name      32868 non-null object
gender    32868 non-null object
births    32868 non-null int64
dtypes: int64(1), object(2)
memory usage: 770.4+ KB
In [5]:
# access to the index
us2016.index
Out[5]:
RangeIndex(start=0, stop=32868, step=1)
In [6]:
# acces to columns
us2016.columns
Out[6]:
Index(['name', 'gender', 'births'], dtype='object')
In [7]:
# acces to dimensions
us2016.shape
Out[7]:
(32868, 3)

We can access to a specific column by using the [] operator and obtain a Pandas Series object which represents 1D data. It shares the original DataFrame index.

In [8]:
# access to a specific column
us2016['name']  # us2016.name
Out[8]:
0             Emma
1           Olivia
2              Ava
3           Sophia
4         Isabella
5              Mia
6        Charlotte
7          Abigail
           ...    
32860         Ziya
32861       Ziyang
32862         Zoel
32863       Zolton
32864       Zurich
32865       Zyahir
32866         Zyel
32867        Zylyn
Name: name, dtype: object
In [9]:
# number of US unique names
us2016['name'].nunique()
Out[9]:
30294

Few operations on DataFrame: transposition, sorting and grouping

Transposition reverse rows and columns and switch index and column labels : transpose() or T method.

In [10]:
# transposition T or transpose()
us2016.T
Out[10]:
0 1 2 3 4 5 6 7 8 9 ... 32858 32859 32860 32861 32862 32863 32864 32865 32866 32867
name Emma Olivia Ava Sophia Isabella Mia Charlotte Abigail Emily Harper ... Zinn Zirui Ziya Ziyang Zoel Zolton Zurich Zyahir Zyel Zylyn
gender F F F F F F F F F F ... M M M M M M M M M M
births 19414 19246 16237 16070 14722 14366 13030 11699 10926 10733 ... 5 5 5 5 5 5 5 5 5 5

3 rows × 32868 columns

It is possible to sort a DataFrame by one or more columns and specifying the ascending : sort_values() method.

In [11]:
# top 10 US names in 2016
us2016.sort_values('births', ascending=False).head(10)
Out[11]:
name gender births
0 Emma F 19414
1 Olivia F 19246
18757 Noah M 19015
18758 Liam M 18138
2 Ava F 16237
3 Sophia F 16070
18759 William M 15668
18760 Mason M 15192
18761 James M 14776
4 Isabella F 14722

DataFrame are provided with a powerful groupby() method which splits it into chuncks - according to the values of one or more columns - and then apply a summarization function to each chunk and concatenates the whole in a new DataFrame.

This illustrates the split-apply-combine strategy : groupby() then sum() methods.

In [12]:
# US female and male births of year 2016
us2016.groupby('gender').sum()
Out[12]:
births
gender
F 1756647
M 1880674

1.1.2 Loading all files

We load and merge all CSV files within the folder names. The year is extracted from file names and added as a new column. We use the append() method which concatenates 2 DataFrames.

In [13]:
# load all data in a single DataFrame
us = pd.DataFrame()
for root, dirs, files in os.walk(path):
    for filename in files:
        if filename[0:3] != "yob":
            continue
        csv = pd.read_csv(os.path.join(root, filename),
                           header=None,
                           names=['name', 'gender', 'births'])
        year = int(filename[3:7])  # yobYYYY.txt
        csv['year'] = year
        us = us.append(csv, ignore_index=True)

us = us[['year', 'name', 'gender', 'births']]  # fancy indexing in order to reorder the columns

us
Out[13]:
year name gender births
0 1880 Mary F 7065
1 1880 Anna F 2604
2 1880 Emma F 2003
3 1880 Elizabeth F 1939
4 1880 Minnie F 1746
5 1880 Margaret F 1578
6 1880 Ida F 1472
7 1880 Alice F 1414
... ... ... ... ...
1891886 2016 Ziya M 5
1891887 2016 Ziyang M 5
1891888 2016 Zoel M 5
1891889 2016 Zolton M 5
1891890 2016 Zurich M 5
1891891 2016 Zyahir M 5
1891892 2016 Zyel M 5
1891893 2016 Zylyn M 5

1891894 rows × 4 columns

In [14]:
# characteristics of a DataFrame
us.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891894 entries, 0 to 1891893
Data columns (total 4 columns):
year      int64
name      object
gender    object
births    int64
dtypes: int64(2), object(2)
memory usage: 57.7+ MB
In [15]:
# number of US unique names
us['name'].nunique()
Out[15]:
96174
In [16]:
# top 10 global US names within a year
us.sort_values('births', ascending=False).head(10)
Out[16]:
year name gender births
431059 1947 Linda F 99685
441430 1948 Linda F 96210
437162 1947 James M 94762
544612 1957 Michael M 92716
437163 1947 Robert M 91641
451669 1949 Linda F 91013
533150 1956 Michael M 90620
556187 1958 Michael M 90512
447469 1948 James M 88584
510799 1954 Michael M 88525
In [17]:
# global US female and male births
us.groupby('gender').sum()['births']  [['births']]
Out[17]:
gender
births   NaN
Name: births, dtype: float64

What are the top 10 US names all years? We are going to implement a function so that it can be reused later.

In [18]:
# top (10) names all years
def top(df, size=10):
    return df.groupby(['name', 'gender']).sum()['births'].sort_values(ascending=False).head(size)
In [19]:
# global US top 10 names all years
top(us)
Out[19]:
name     gender
James    M         5136026
John     M         5105834
Robert   M         4809176
Michael  M         4338144
Mary     F         4120692
William  M         4087525
David    M         3601032
Joseph   M         2592856
Richard  M         2560771
Charles  M         2378806
Name: births, dtype: int64

1.2 Loading French Data

Firstly, we load the French data in order to have a look at them.

In [20]:
# load 
fr = pd.read_csv("datasets/nat2015.txt",
                     sep='\t',
                     encoding='latin-1',
                     header=0,
                     names = ['gender', 'name', 'year', 'births'])
fr = fr[['year', 'name', 'gender', 'births']]
fr
Out[20]:
year name gender births
0 1980 A 1 3.0
1 1998 A 1 3.0
2 XXXX A 1 21.0
3 1976 AADEL 1 5.0
4 1978 AADEL 1 3.0
5 1980 AADEL 1 3.0
6 1981 AADEL 1 5.0
7 1982 AADEL 1 4.0
... ... ... ... ...
589403 1992 ÖZLEM 2 5.0
589404 1993 ÖZLEM 2 7.0
589405 2010 ÖZLEM 2 3.0
589406 2012 ÖZLEM 2 6.0
589407 2013 ÖZLEM 2 3.0
589408 2014 ÖZLEM 2 5.0
589409 2015 ÖZLEM 2 3.0
589410 XXXX ÖZLEM 2 25.0

589411 rows × 4 columns

Secondly, we reload the French data so that they have the same characteristics as the US ones :

  • columns :
    • gender mapping '1' => 'M', '2' => 'F' and switching to caterogy
    • name case including compound names with '-'
    • year astype int
    • births astype int
  • eliminate irrelevant data
  • sort the data
  • reset the index
In [21]:
# load French data
fr = pd.read_csv("datasets/nat2015.txt",
                     sep='\t',
                     encoding='latin-1',
                     header=0,
                     names = ['gender', 'name', 'year', 'births'],
                     converters = {'gender': lambda x: 'M' if x == '1' else 'F',
                                   'name': lambda x: '-'.join(map(lambda y: y.capitalize(), x.split('-'))),
                                   'births': lambda x: int(float(x))})
fr = fr[(fr['year'] != 'XXXX') &
              (fr['name'] != '_prenoms_rares') &
              (fr['name'].str.len() != 1)]
fr['year'] = fr['year'].astype(int)
fr = fr[['year', 'name', 'gender', 'births']]
fr.sort_values(['year', 'gender', 'births', 'name'], ascending=[True, True, False, True], inplace=True)
fr.reset_index(drop=True, inplace=True)
fr
Out[21]:
year name gender births
0 1900 Marie F 48727
1 1900 Jeanne F 13983
2 1900 Marguerite F 8059
3 1900 Germaine F 6981
4 1900 Louise F 6698
5 1900 Yvonne F 5099
6 1900 Madeleine F 4960
7 1900 Suzanne F 4954
... ... ... ... ...
556735 2015 Zayad M 3
556736 2015 Zayane M 3
556737 2015 Ziggy M 3
556738 2015 Zoan M 3
556739 2015 Zoubir M 3
556740 2015 Zoé M 3
556741 2015 Étan M 3
556742 2015 Évann M 3

556743 rows × 4 columns

In [22]:
# characteristics of a DataFrame
fr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556743 entries, 0 to 556742
Data columns (total 4 columns):
year      556743 non-null int32
name      556743 non-null object
gender    556743 non-null object
births    556743 non-null int64
dtypes: int32(1), int64(1), object(2)
memory usage: 14.9+ MB
In [23]:
# top 10 global French names within a year
fr.sort_values('births', ascending=False).head(10)
Out[23]:
year name gender births
95476 1946 Jean M 53897
1705 1901 Marie F 52167
3434 1902 Marie F 51870
97892 1947 Jean M 51665
5174 1903 Marie F 50445
6931 1904 Marie F 50153
8709 1905 Marie F 48998
0 1900 Marie F 48727
10545 1906 Marie F 48465
14232 1908 Marie F 47479
In [24]:
# global French female and male births
fr.groupby('gender').sum()['births']
Out[24]:
gender
F    39990411
M    40680898
Name: births, dtype: int64
In [25]:
# global French top 10 names all years
top(fr)
Out[25]:
name      gender
Marie     F         2234906
Jean      M         1919820
Pierre    M          889306
Michel    M          820080
André     M          711757
Jeanne    F          550384
Philippe  M          538133
René      M          516492
Alain     M          506750
Louis     M          504437
Name: births, dtype: int64

2. Searching in the Data

2.1 Two letters names

We are going to implement a function that looks for names with only 2 letters. We use the str operator which tells Python to process series data as strings. Then we use the [] operator for a logical selection.

In [26]:
# making a boolean mask
mask = us['name'].str.len() == 2
mask
Out[26]:
0          False
1          False
2          False
3          False
4          False
5          False
6          False
7          False
           ...  
1891886    False
1891887    False
1891888    False
1891889    False
1891890    False
1891891    False
1891892    False
1891893    False
Name: name, dtype: bool
In [27]:
# using the mask to select data
us[mask]
Out[27]:
year name gender births
996 1880 Ed M 310
1444 1880 Wm M 14
1467 1880 Al M 12
1850 1880 Ab M 5
2880 1881 Jo F 5
2998 1881 Ed M 272
3326 1881 Al M 19
3396 1881 Wm M 15
... ... ... ... ...
1889035 2016 Jb M 6
1889463 2016 Or M 6
1889871 2016 Zi M 6
1889875 2016 Zo M 6
1890122 2016 Bb M 5
1890482 2016 En M 5
1891392 2016 Qi M 5
1891817 2016 Ye M 5

4588 rows × 4 columns

We implement a function which perform all necessary operations and run US and French data using the unique() method.

In [28]:
# searching for names with 2 letters only
def names_of_length_2(df):
    mask = df['name'].str.len() == 2
    return df[mask]['name'].unique()
In [29]:
# US names with 2 letters
names_of_length_2(us)
Out[29]:
array(['Ed', 'Wm', 'Al', 'Ab', 'Jo', 'Ah', 'Ng', 'Lu', 'Le', 'Ad', 'Cy',
       'Bo', 'Mc', 'La', 'Jc', 'Hy', 'Jb', 'De', 'Jd', 'Jt', 'El', 'Jr',
       'Ty', 'Vi', 'Ac', 'Sy', 'Lc', 'Od', 'Tj', 'Wc', 'Ld', 'Rc', 'Jw',
       'Be', 'Eb', 'Jl', 'St', 'Lo', 'Oc', 'Rb', 'Lj', 'Lb', 'Rl', 'Rd',
       'Hu', 'Lg', 'Av', 'Ma', 'Ej', 'Si', 'Su', 'Ok', 'An', 'Di', 'Ki',
       'Ky', 'Li', 'Ea', 'Mi', 'In', 'Ja', 'Zo', 'Fe', 'Aj', 'Kc', 'Ka',
       'Jj', 'Dj', 'Bj', 'Gy', 'Je', 'Ji', 'Cj', 'Ra', 'Yu', 'Pj', 'Ta',
       'My', 'Ry', 'Ho', 'Ly', 'Na', 'Rj', 'Ai', 'Tu', 'Vu', 'Ha', 'Ti',
       'Un', 'Jm', 'Pa', 'Ju', 'So', 'Vy', 'Jp', 'Ia', 'Yi', 'Du', 'Ge',
       'Va', 'Sa', 'Wa', 'Po', 'Tc', 'Ni', 'Da', 'Fu', 'Om', 'Ya', 'Me',
       'Ko', 'Ku', 'Se', 'Gi', 'Xu', 'Kt', 'Ye', 'Kd', 'Te', 'Do', 'Md',
       'Kj', 'Ii', 'Mr', 'Bg', 'No', 'Bb', 'Ke', 'Oz', 'Ze', 'Io', 'Qi',
       'Zi', 'Ny', 'Zy', 'Mj', 'Yo', 'Eh', 'Or', 'En', 'Za', 'Sj', 'Ar',
       'Ro', 'Mo', 'Oh', 'Xi', 'Tr'], dtype=object)
In [30]:
# French names with 2 letters
names_of_length_2(fr)
Out[30]:
array(['St', 'Ah', 'Jo', 'El', 'Ab', 'Si', 'Le', 'Ha', 'My', 'Al', 'Da',
       'An', 'Ly', 'Pa', 'Va', 'Ai', 'Yi', 'By', 'Tu', 'Ka', 'Ma', 'Yu',
       'Na', 'So', 'Ya', 'Vu', 'Ky', 'Lu', 'Li', 'Vi', 'To', 'Ba', 'Sy',
       'Su', 'Do', 'Ea', 'De', 'Ny', 'Mc', 'Or', 'Ty', 'La', 'Bo', 'Zi',
       'Bi'], dtype=object)

2.2 Names for which births = year

Now, we implement a function which looks for names for which births within a year are equal to the year.

In [31]:
# looking for names for wich births within a year are equal to the year
def births_is_year(df):
    return df[df['births'] == df['year']]
In [32]:
# US names with births equals year
births_is_year(us)
Out[32]:
year name gender births
76786 1907 Minnie F 1907
134324 1916 Oscar M 1916
265125 1929 Velma F 1929
274940 1930 Janice F 1930
312410 1934 Agnes F 1934
371429 1940 Vernon M 1940
421493 1946 Margie F 1946
544774 1957 Gilbert M 1957
1249432 1997 Selena F 1997
1293009 1998 Gage M 1998
1692050 2011 Eliana F 2011
1745461 2012 Avery M 2012
1859178 2016 Norah F 2016
1859179 2016 Rose F 2016
In [33]:
# French names with briths equals year
births_is_year(fr)
Out[33]:
year name gender births
3461 1902 Angèle F 1902
23212 1912 Marius M 1912
55913 1928 Gaston M 1928
68050 1933 Fernand M 1933
466147 2009 Lou F 2009

2.3 Looking for prefix

Here, we implement a function which search for prefix names.

First, we select the names starting with a specific prefix using the str operator and the startswith() method.

In [34]:
# select US names starting with 'Fran'
selection = us[us['name'].str.startswith("Fran")]
selection
Out[34]:
year name gender births
41 1880 Frances F 605
205 1880 Francis F 55
410 1880 Frankie F 17
478 1880 Frank F 13
947 1880 Frank M 3242
991 1880 Francis M 344
1053 1880 Franklin M 120
1336 1880 Francisco M 19
... ... ... ... ...
1882584 2016 Franz M 20
1882939 2016 Fransisco M 18
1883613 2016 Francois M 15
1886094 2016 Franciszek M 9
1887714 2016 Fran M 7
1890537 2016 Frances M 5
1890538 2016 Franck M 5
1890539 2016 Frandy M 5

4705 rows × 4 columns

Second, we can reuse the top function and implement a new function.

In [35]:
# looking for prefix
def look(df, prefix, **kwargs):
    selection = df[df['name'].str.startswith(prefix)]
    return top(selection, **kwargs)
In [36]:
# US names starting with "Fran"
look(us, "Fran")
Out[36]:
name       gender
Frank      M         909430
Frances    F         588601
Francis    M         284528
Franklin   M         134707
Francisco  M         131432
Frankie    M          40310
Francine   F          35747
Frankie    F          32913
Francis    F          29041
Francesca  F          28176
Name: births, dtype: int64
In [37]:
# French names starting with "Fran"
look(fr, "Fran")
Out[37]:
name             gender
Françoise        F         401522
François         M         398241
Franck           M         177720
Francis          M         166060
Francine         F          92723
France           F          28073
Frank            M          11379
Francette        F           7777
Francisco        M           7236
François-Xavier  M           7125
Name: births, dtype: int64

We can see a French compound name. What are the top French compound name?

In [38]:
# global French top 10 compound names all years
selection = fr[fr['name'].str.contains('-')]
top(selection)
Out[38]:
name           gender
Jean-Pierre    M         213143
Jean-Claude    M         172142
Jean-Luc       M         116647
Anne-Marie     F          99695
Jean-François  M          99024
Jean-Marc      M          90144
Marie-Thérèse  F          88345
Jean-Paul      M          86843
Jean-Marie     M          86223
Jean-Michel    M          85960
Name: births, dtype: int64

Which French names are involved in a compound name?

3. Putting Data into Graphs

3.1 Evolution of births for a single name (+ gender) over years

First, we select the data according to a name (Kevin) and a gender (M). Logical operators (&, | and ~) are provided for selecting data.

In [39]:
# selection of data related to Kevin (M)
selection = us[(us['name'] == "Kevin") & (us['gender'] == "M")]
selection
Out[39]:
year name gender births
103699 1912 Kevin M 9
110513 1913 Kevin M 9
117875 1914 Kevin M 12
126616 1915 Kevin M 15
135970 1916 Kevin M 18
145816 1917 Kevin M 19
155956 1918 Kevin M 21
166850 1919 Kevin M 14
... ... ... ... ...
1643354 2009 Kevin M 8767
1677700 2010 Kevin M 7322
1711516 2011 Kevin M 6558
1745343 2012 Kevin M 6216
1778794 2013 Kevin M 5929
1811995 2014 Kevin M 5886
1845095 2015 Kevin M 5164
1877871 2016 Kevin M 4526

105 rows × 4 columns

Second, we switch the year column to be the index of the DataFrame.

In [40]:
# same with year as index
selection_years = selection.set_index('year')
selection_years
Out[40]:
name gender births
year
1912 Kevin M 9
1913 Kevin M 9
1914 Kevin M 12
1915 Kevin M 15
1916 Kevin M 18
1917 Kevin M 19
1918 Kevin M 21
1919 Kevin M 14
... ... ... ...
2009 Kevin M 8767
2010 Kevin M 7322
2011 Kevin M 6558
2012 Kevin M 6216
2013 Kevin M 5929
2014 Kevin M 5886
2015 Kevin M 5164
2016 Kevin M 4526

105 rows × 3 columns

There is a strong connexion between Pandas DataFrame and Matplotlib. We can directly use the Matplotlib plot() function which applied to DataFrame or Series as a method.

In [41]:
# plot the births column along with the years
selection_years.plot(title="Births of Kevin (M) over years");

We can put every thing together in a single function and run the French data.

In [42]:
# evolution of a name (+ gender)
def plot_name_gender(df, name, gender):
    selection = df[(df['name'] == name) & (df['gender'] == gender)]
    selection_years = selection.set_index('year')
    selection_years.plot(title="Births of {} ({}) over years".format(name, gender));
In [43]:
# evolution of French Kevin (M)
plot_name_gender(fr, 'Kevin', 'M')

We will try to find later correlations between US and French names over time.

3.2 Evolution of births by gender over years

DataFrame are provided with another powerful method pivot_table(). It produces a new DataFrame with as the index the different values of a first column, as the column index the different values of second column and the values of a third column are summarized into cells by using an aggregative function.

We use here this method to obtain the births by gender over years.

In [44]:
# pivoting US data by years and gender and computing the sum of births
genders = us.pivot_table(index='year',
                         columns='gender',
                         values='births',
                         aggfunc=np.sum)
genders
Out[44]:
gender F M
year
1880 90992 110491
1881 91953 100743
1882 107847 113686
1883 112318 104627
1884 129020 114443
1885 133055 107799
1886 144533 110784
1887 145981 101415
... ... ...
2009 1834048 1980491
2010 1774186 1915331
2011 1755278 1895156
2012 1756347 1892094
2013 1749061 1885683
2014 1779496 1913434
2015 1776538 1907211
2016 1756647 1880674

137 rows × 2 columns

In [45]:
# plot number of births by gender over years
genders.plot(title="Total births by gender over years", color=['r', 'b']);

We can put every thing together in a single function and run the French data.

In [46]:
# plot number of births by gender over years
def plot_births(df):
    table = df.pivot_table(index='year',
                           columns='gender',
                           values='births',
                           aggfunc=np.sum)
    table.plot(title="Total births by gender over years", color=['r', 'b']);
In [47]:
# plot French births by gender over years
plot_births(fr)

4. Measuring the increase in naming diversity

We are using the crosstab() method in order to compute the number of different names by gender over years. In the 1880' parents had roughly 1000 names per gender to name their babies. Nowadays, parents have 14 to 19 thousands!

In [48]:
# number of different names by gender over years
diversity = pd.crosstab(us['year'], us['gender'])
diversity.head()
Out[48]:
gender F M
year
1880 942 1058
1881 938 997
1882 1028 1099
1883 1054 1030
1884 1172 1125
In [49]:
# nowadays
diversity.tail()
Out[49]:
gender F M
year
2012 19486 14229
2013 19219 14034
2014 19169 14037
2015 19054 14009
2016 18757 14111

We can put every thing together in a single function.

In [50]:
# plot evolution of number of names over years
def plot_diversity(df):
    diversity = pd.crosstab(df['year'], df['gender'])
    diversity.plot(title="Evolution of names diversity by gender over years", color=['r', 'b']);
In [51]:
# US diversity
plot_diversity(us)
In [52]:
# French diversity
plot_diversity(fr)

5. Evolution of Last Letter

We will now add a column with the last letter of each name and compute the evolution of the most frequent last letters which appears nowadays. This is computed by using the apply() method and a lambda function which extracts the last letter and put it in uppercase for readability.

In [53]:
# add a column with the last later of names
us["last"] = us["name"].apply(lambda x: x[-1].upper())
us
Out[53]:
year name gender births last
0 1880 Mary F 7065 Y
1 1880 Anna F 2604 A
2 1880 Emma F 2003 A
3 1880 Elizabeth F 1939 H
4 1880 Minnie F 1746 E
5 1880 Margaret F 1578 T
6 1880 Ida F 1472 A
7 1880 Alice F 1414 E
... ... ... ... ... ...
1891886 2016 Ziya M 5 A
1891887 2016 Ziyang M 5 G
1891888 2016 Zoel M 5 L
1891889 2016 Zolton M 5 N
1891890 2016 Zurich M 5 H
1891891 2016 Zyahir M 5 R
1891892 2016 Zyel M 5 L
1891893 2016 Zylyn M 5 N

1891894 rows × 5 columns

Then we use the crosstab() function in order to get the number of births by last letters over years.

In [54]:
# diversity of names by last letter over years
last = pd.crosstab(us['year'], us['last'])
last
Out[54]:
last A B C D E F G H I J ... Q R S T U V W X Y Z
year
1880 450 7 7 93 490 7 9 60 7 0 ... 0 89 125 76 5 1 7 4 148 4
1881 432 6 8 85 484 5 13 59 6 0 ... 0 83 116 78 4 1 7 5 142 2
1882 478 7 8 96 538 8 10 70 6 0 ... 0 87 124 83 3 1 8 5 167 4
1883 480 5 8 96 542 7 10 61 4 0 ... 0 86 122 79 4 1 9 6 151 2
1884 543 10 7 101 584 6 10 72 4 0 ... 0 100 133 83 4 1 7 4 179 4
1885 547 8 8 88 614 6 12 69 6 0 ... 0 95 130 83 3 1 6 5 174 2
1886 583 7 7 103 633 4 12 72 5 0 ... 0 96 127 87 4 1 7 4 188 3
1887 591 8 8 105 633 6 8 73 6 0 ... 0 102 121 81 4 1 7 4 173 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2009 7975 124 117 496 4438 70 96 2610 2330 56 ... 23 1250 1222 480 135 67 47 105 2210 188
2010 7659 135 123 508 4345 75 103 2621 2310 68 ... 21 1244 1182 453 137 71 42 107 2192 178
2011 7399 127 105 501 4417 76 96 2613 2327 67 ... 24 1249 1171 452 151 70 50 116 2155 170
2012 7293 117 118 476 4364 76 111 2560 2402 70 ... 23 1282 1182 479 136 74 47 119 2168 172
2013 7150 124 110 502 4333 84 103 2550 2300 63 ... 22 1256 1156 471 154 77 47 132 2147 161
2014 7050 124 105 517 4262 74 102 2523 2286 69 ... 21 1302 1152 499 140 77 51 136 2130 170
2015 6982 114 114 513 4238 78 111 2513 2386 71 ... 28 1338 1167 486 152 90 47 151 2149 157
2016 6944 126 113 525 4073 77 122 2518 2385 72 ... 30 1391 1149 510 171 80 50 156 2081 153

137 rows × 26 columns

Then we sort the letters of the last available year. We use the .iloc[] operator which enables to select rows, columns and parts of a DataFrame by indices.

In [55]:
# sort most used last letters for US names in 2015
last.iloc[-1].sort_values(ascending=False)
Out[55]:
last
N    7015
A    6944
E    4073
H    2518
I    2385
Y    2081
R    1391
L    1350
     ... 
G     122
C     113
V      80
F      77
J      72
W      50
P      31
Q      30
Name: 2016, dtype: int64

Here we put the 7 top letters in an array.

In [56]:
# put the 7 letters in an array
letters = last.iloc[-1].sort_values(ascending=False).index[:7]
letters
Out[56]:
Index(['N', 'A', 'E', 'H', 'I', 'Y', 'R'], dtype='object', name='last')

We can perform a fancy indexing on the DataFrame with the 7 letters and perform a plot.

In [57]:
# last letters DataFrame limited to 7 top last letters of 2015
last[letters]
Out[57]:
last N A E H I Y R
year
1880 223 450 490 60 7 148 89
1881 214 432 484 59 6 142 83
1882 224 478 538 70 6 167 87
1883 214 480 542 61 4 151 86
1884 244 543 584 72 4 179 100
1885 226 547 614 69 6 174 95
1886 240 583 633 72 5 188 96
1887 233 591 633 73 6 173 102
... ... ... ... ... ... ... ...
2009 7460 7975 4438 2610 2330 2210 1250
2010 7353 7659 4345 2621 2310 2192 1244
2011 7400 7399 4417 2613 2327 2155 1249
2012 7322 7293 4364 2560 2402 2168 1282
2013 7177 7150 4333 2550 2300 2147 1256
2014 7271 7050 4262 2523 2286 2130 1302
2015 7062 6982 4238 2513 2386 2149 1338
2016 7015 6944 4073 2518 2385 2081 1391

137 rows × 7 columns

We can put every thing together in a single function and run the French data.

In [58]:
# plot the evolution of diversity of names according to their last letter
def plot_last_letters(df):
    df["last"] = df["name"].apply(lambda x: x[-1].upper())
    last = pd.crosstab(df['year'], df['last'])
    letters = last.iloc[-1].sort_values(ascending=False).head(7).index
    last[letters].plot(title="Evolution of diversity of names according to their last letter");
In [59]:
# plot the evolution of diversity of US names according to their last letter
plot_last_letters(us)
In [60]:
# plot the evolution of diversity of US names according to their last letter
plot_last_letters(fr)
In [61]:
# interactive scatter plot of last letters for US and French names across years
from ipywidgets import interact, IntSlider

# prepare and normalize data
last_us = pd.crosstab(us['year'], us['last'])
last_us = last_us.div(last_us.sum(axis=1), axis=0)
last_fr = pd.crosstab(fr['year'], fr['last'])
last_fr = last_fr.div(last_fr.sum(axis=1), axis=0)

def plot_letters(year):
    data = pd.concat([last_us.loc[year], last_fr.loc[year]], axis=1, join='inner')
    plt.scatter(data.iloc[:,0], data.iloc[:,1])
    for (x, y, letter) in zip(data.iloc[:,0], data.iloc[:,1], last.columns):
        plt.text(x + 0.003, y + 0.003, letter)
        
interact(plot_letters, year=IntSlider(min=1900, max=2015, step=1, value=1900));

6. Names which gender changed over time

6.1 Global gender repartition per name

First we restrict names which appear at least 10000 times as F or M over time.

In [62]:
# find top names
top_names = us.groupby('name').sum()['births']
top_names = top_names.sort_values(ascending=False)
top_names = top_names[top_names >= 10000]
top_names
Out[62]:
name
James       5159306
John        5127501
Robert      4829274
Michael     4359820
Mary        4135851
William     4103456
David       3613916
Joseph      2603475
             ...   
Kenna         10079
Jovan         10068
Tatyana       10061
Karly         10054
Catrina       10045
Randell       10031
Gilda         10022
Marisela      10012
Name: births, dtype: int64

Select those names in the US DataFrame using the isin() method.

In [63]:
# data for top names
selection = us[us['name'].isin(top_names.index)]
selection
Out[63]:
year name gender births last
0 1880 Mary F 7065 Y
1 1880 Anna F 2604 A
2 1880 Emma F 2003 A
3 1880 Elizabeth F 1939 H
4 1880 Minnie F 1746 E
5 1880 Margaret F 1578 T
6 1880 Ida F 1472 A
7 1880 Alice F 1414 E
... ... ... ... ... ...
1891145 2016 Madelyn M 5 N
1891330 2016 Norbert M 5 T
1891353 2016 Ora M 5 A
1891365 2016 Paisley M 5 Y
1891402 2016 Rae M 5 E
1891493 2016 Rose M 5 E
1891530 2016 Samantha M 5 A
1891554 2016 Serenity M 5 Y

355075 rows × 5 columns

Secondly, we pivot the data so as to have both genders on each line sorted by names and the values are the sum of births per name and gender. We add a column with the ratio F vs M.

In [64]:
# pivot by name and gender with the sum of births
ratio_name = selection.pivot_table(index='name',
                                     columns='gender',
                                     values='births',
                                     aggfunc=np.sum)
ratio_name["ratio"] = ratio_name["F"] / (ratio_name["F"] + ratio_name["M"])
ratio_name
Out[64]:
gender F M ratio
name
Aaliyah 79367.0 89.0 0.999
Aaron 4269.0 568057.0 0.007
Abbey 17044.0 35.0 0.998
Abbie 21266.0 325.0 0.985
Abbigail 11286.0 5.0 1.000
Abby 56869.0 181.0 0.997
Abel 113.0 47412.0 0.002
Abigail 345728.0 621.0 0.998
... ... ... ...
Zane 490.0 43403.0 0.011
Zayden 129.0 14576.0 0.009
Zelda 12831.0 5.0 1.000
Zella 12056.0 NaN NaN
Zelma 16705.0 112.0 0.993
Zion 5472.0 24388.0 0.183
Zoe 116591.0 448.0 0.996
Zoey 71128.0 194.0 0.997

2469 rows × 3 columns

We can now plot the distribution of the ratio among names by using the plot() function with the option kind='hist'. We notice that most of names are mostly either female or male, but some names are possibily ambivalent.

In [65]:
# distribution of F vs M ratio for top 1000 US names
ratio_name['ratio'].plot(kind='hist');

We put everything in a function so that we can process French data.

In [66]:
# plot distribution of F vs M ratio for top 1000 names
def plot_distribution(df):
    top_names = df.groupby('name').sum()['births']
    top_names = top_names.sort_values(ascending=False)
    top_names = top_names[top_names >= 10000]
    selection = df[df['name'].isin(top_names.index)]
    ratio_name = selection.pivot_table(index=['name'],
                                         columns='gender',
                                         values='births',
                                         aggfunc=np.sum)
    ratio_name["ratio"] = ratio_name["F"] / (ratio_name["F"] + ratio_name["M"])
    ratio_name['ratio'].plot(kind='hist');
In [67]:
# distribution of F vs M ratio for top 1000 French names
plot_distribution(fr)

6.2 Plot evolution of names which gender changed over years

First we select the data related to a single name, for instance Leslie.

In [68]:
# select US data for Leslie
selection = us[us["name"] == "Leslie"]
selection
Out[68]:
year name gender births last
654 1880 Leslie F 8 E
1108 1880 Leslie M 79 E
2522 1881 Leslie F 11 E
3072 1881 Leslie M 92 E
4593 1882 Leslie F 9 E
5081 1882 Leslie M 122 E
6850 1883 Leslie F 7 E
7225 1883 Leslie M 120 E
... ... ... ... ... ...
1759787 2013 Leslie F 1111 E
1781819 2013 Leslie M 36 E
1793093 2014 Leslie F 1000 E
1814143 2014 Leslie M 61 E
1826333 2015 Leslie F 878 E
1847409 2015 Leslie M 54 E
1859438 2016 Leslie F 801 E
1880467 2016 Leslie M 45 E

274 rows × 5 columns

Then we compute the evolution of births for this name by gender over years. Note that we could have reuse the selection performed above in ratio_gender.

In [69]:
# evolution of births by gender
evolution_gender = selection.pivot_table(index='year',
                                         columns='gender',
                                         values='births')
evolution_gender
Out[69]:
gender F M
year
1880 8 79
1881 11 92
1882 9 122
1883 7 120
1884 15 125
1885 10 122
1886 8 136
1887 12 166
... ... ...
2009 1981 76
2010 1564 47
2011 1285 54
2012 1220 50
2013 1111 36
2014 1000 61
2015 878 54
2016 801 45

137 rows × 2 columns

Then we compute the F and M ratio by dividing the births by the sum of each line (axis=1) along the index (axis=0).

In [70]:
# gender evolution over years
evolution_gender = evolution_gender.div(evolution_gender.sum(axis=1), axis=0)
evolution_gender
Out[70]:
gender F M
year
1880 0.092 0.908
1881 0.107 0.893
1882 0.069 0.931
1883 0.055 0.945
1884 0.107 0.893
1885 0.076 0.924
1886 0.056 0.944
1887 0.067 0.933
... ... ...
2009 0.963 0.037
2010 0.971 0.029
2011 0.960 0.040
2012 0.961 0.039
2013 0.969 0.031
2014 0.943 0.057
2015 0.942 0.058
2016 0.947 0.053

137 rows × 2 columns

Put everything in a function and run over US and French data.

In [71]:
# plot gender ratio evolution of years
def plot_transname(df, name):
    selection = df[df["name"] == name]
    gender = selection.pivot_table(index='year',
                                   columns='gender',
                                   values='births')
    gender = gender.div(gender.sum(axis=1), axis=0)
    gender.plot(title="Evolution proportion of {} by gender over years".format(name), style={'M': 'b', 'F': 'r'});
In [72]:
# gender evolution of US Donnie
plot_transname(us, "Leslie")
In [73]:
# gender evolution of US Donnie
plot_transname(us, "Donnie")
In [74]:
plot_transname(us, 'Jean')
In [75]:
# gender evolution of French Camille
plot_transname(fr, "Camille")
In [76]:
# gender evolution of French Dominique
plot_transname(fr, "Dominique")
In [77]:
# gender evolution of French Alix
plot_transname(fr, "Alix")

7. Few cross analysis

7.1 Names with different genders in US and France

First, we partly reuse the plot_distribution() function to get the F vs M ratio for top 1000 names.

In [78]:
# add a ratio F vs M for top 1000 names
def ratio(df):
    top1000 = df.groupby('name').sum()['births']
    top1000 = top1000.sort_values(ascending=False).head(1000)
    selection = df[df['name'].isin(top1000.index)]
    ratio_name = selection.pivot_table(index=['name'],
                                         columns='gender',
                                         values='births',
                                         aggfunc=np.sum)
    ratio_name["ratio"] = ratio_name["F"] / (ratio_name["F"] + ratio_name["M"])
    return ratio_name
In [79]:
ratio(us)
Out[79]:
gender F M ratio
name
Aaliyah 79367 89 0.999
Aaron 4269 568057 0.007
Abby 56869 181 0.997
Abigail 345728 621 0.998
Abraham 203 86304 0.002
Ada 93464 277 0.997
Adam 2024 544822 0.004
Addison 110535 12623 0.898
... ... ... ...
Wyatt 387 137667 0.003
Xavier 826 125706 0.007
Yolanda 112251 537 0.995
Yvette 62953 186 0.997
Yvonne 155219 619 0.996
Zachary 1513 527858 0.003
Zoe 116591 448 0.996
Zoey 71128 194 0.997

1000 rows × 3 columns

In [80]:
ratio(fr)
Out[80]:
gender F M ratio
name
Aaron NaN 17337.0 NaN
Abdelkader NaN 9805.0 NaN
Abel 141.0 23293.0 6.017e-03
Achille 142.0 13088.0 1.073e-02
Adam 45.0 50049.0 8.983e-04
Adel NaN 7566.0 NaN
Adelaide 9801.0 243.0 9.758e-01
Adeline 54154.0 NaN NaN
... ... ... ...
Yveline 9603.0 NaN NaN
Yves 234.0 229361.0 1.019e-03
Yvette 203485.0 NaN NaN
Yvon NaN 40549.0 NaN
Yvonne 257058.0 3.0 1.000e+00
Zakaria NaN 9572.0 NaN
Zelie 7147.0 NaN NaN
Zoé 51220.0 311.0 9.940e-01

1000 rows × 3 columns

Second, we concatenate the results to get them in a single DataFrame. We need to rename the columns.

In [81]:
# US and French top 1000 names with F vs M ratio
data = pd.concat([ratio(us), ratio(fr)], axis=1)
data.columns = ['us_F', 'us_M', 'us_ratio', 'fr_F', 'fr_M', 'fr_ratio']
data
Out[81]:
us_F us_M us_ratio fr_F fr_M fr_ratio
Aaliyah 79367.0 89.0 0.999 NaN NaN NaN
Aaron 4269.0 568057.0 0.007 NaN 17337.0 NaN
Abby 56869.0 181.0 0.997 NaN NaN NaN
Abdelkader NaN NaN NaN NaN 9805.0 NaN
Abel NaN NaN NaN 141.0 23293.0 0.006
Abigail 345728.0 621.0 0.998 NaN NaN NaN
Abraham 203.0 86304.0 0.002 NaN NaN NaN
Achille NaN NaN NaN 142.0 13088.0 0.011
... ... ... ... ... ... ...
Yvon NaN NaN NaN NaN 40549.0 NaN
Yvonne 155219.0 619.0 0.996 257058.0 3.0 1.000
Zachary 1513.0 527858.0 0.003 NaN NaN NaN
Zakaria NaN NaN NaN NaN 9572.0 NaN
Zelie NaN NaN NaN 7147.0 NaN NaN
Zoe 116591.0 448.0 0.996 NaN NaN NaN
Zoey 71128.0 194.0 0.997 NaN NaN NaN
Zoé NaN NaN NaN 51220.0 311.0 0.994

1728 rows × 6 columns

Finally, we add a delta column with the absolute difference between US and French ratio and we sort the DataFrame.

In [82]:
# compute ratio absolute difference and sort by delta
data["delta"] = np.abs(data["us_ratio"] - data["fr_ratio"])
data.sort_values("delta", ascending=False).tail(10)
Out[82]:
us_F us_M us_ratio fr_F fr_M fr_ratio delta
Yveline NaN NaN NaN 9603.0 NaN NaN NaN
Yves NaN NaN NaN 234.0 229361.0 0.001 NaN
Yvette 62953.0 186.0 0.997 203485.0 NaN NaN NaN
Yvon NaN NaN NaN NaN 40549.0 NaN NaN
Zachary 1513.0 527858.0 0.003 NaN NaN NaN NaN
Zakaria NaN NaN NaN NaN 9572.0 NaN NaN
Zelie NaN NaN NaN 7147.0 NaN NaN NaN
Zoe 116591.0 448.0 0.996 NaN NaN NaN NaN
Zoey 71128.0 194.0 0.997 NaN NaN NaN NaN
Zoé NaN NaN NaN 51220.0 311.0 0.994 NaN

7.2 Correlation between US and French names over time

Here, we try to compute the correlation between US and French names (+ gender) over time. We build 2 Series with births and compute the correlation over 60 years by shifting one Series along the other.

In [83]:
def correlation(us, fr, name, gender):
    # select and pivot US data
    us_name = us[(us['name'] == name) & (us['gender'] == gender)]
    us_name = us_name.pivot_table(index='year', values='births')
    
    # select and pivot French data
    fr_name = fr[(fr['name'] == name) & (fr['gender'] == gender)]
    fr_name = fr_name.pivot_table(index='year', values='births')
    
    # get the best correlation over 60 years
    n = 60
    k = int(n / 2) - 1
    results = np.zeros(n)
    for i in range(-k, k + 1):
        results[i + k] = us_name.corr(fr_name.shift(-i))  # French data are shifted negatively
    best = results.argmax()
    print("Correlation over {} years : {:.2f}".format(best - k, results[best]))
    
    # concatenate and plot data
    data = pd.concat([us_name, fr_name], axis=1)
    data.columns=['us', 'fr']
    data.plot(title="Correlation between US and French {} ({}) over years".format(name, gender), color=['r', 'b'])
In [84]:
us_name = us[(us['name'] == "Alex") & (us['gender'] == "M")]
us_name = us_name.pivot_table(index='year', values='births')
us_name
Out[84]:
year
1880     147
1881     114
1882     172
1883     120
1884     148
1885     159
1886     161
1887     136
        ... 
2009    5165
2010    4429
2011    4027
2012    3854
2013    3544
2014    3172
2015    3096
2016    2826
Name: births, dtype: int64
In [85]:
fr_name = fr[(fr['name'] == "Alex") & (fr['gender'] == "M")]
fr_name = fr_name.pivot_table(index='year', values='births')
fr_name
Out[85]:
year
1900      5
1901      9
1902      6
1903      9
1904     12
1905     14
1906     12
1907      9
       ... 
2008    545
2009    581
2010    561
2011    501
2012    493
2013    445
2014    409
2015    405
Name: births, dtype: int64
In [86]:
correlation(us, fr, "Alex", "M")
Correlation over 5 years : 0.89
In [87]:
correlation(us, fr, "Marc", "M")
Correlation over -11 years : 0.91
In [88]:
correlation(us, fr, "Cindy", "F")
Correlation over 28 years : 0.98

8. Conclusion and summary

In this session, we have used some of the powerful Python Data Analysis Ecosystem features.

Links refer to the pandas documentation. Most of the functions and methods we have used provide a number of possible arguments.

Data representation

Loading data with *read_csv()*

  • header
  • names
  • sep
  • encoding
  • converters

Getting information

  • index: Series and DataFrame index
  • columns: DataFrame column
  • shape: Series and DataFrame dimensions
  • info(): DataFrame informations
  • values: Index and Series values
  • unique(): Series unique values
  • nunique(): Series number of unique values

Selecting data

  • head() and tail()
  • Column accessing
  • Row accessing
  • Fancy indexing
  • Logical masking

Indexing and merging data

Computing over data

  • sum()
  • str.len(), str.startswith() and str.contains()
  • *apply()* and lambda
  • isin()
  • div()

Organizing data

Displaying data

© 2017 Yotta Conseil