Infer Gender from Indian Names
Project description
The ability to programmatically reliably infer the social attributes of a person from their name can be useful for a broad set of tasks, from estimating bias in coverage of women in the media to estimating bias in lending against certain social groups. But unlike the American Census Bureau, which produces a list of last names and first names, which can (and are) used to infer the gender, race, ethnicity, etc., from names, the Indian government produces no such commensurate datasets. And hence inferring the relationship between gender, ethnicity, language group, etc., and names have generally been done with small datasets constructed in an ad-hoc manner.
We fill this yawning gap. Using data from the Indian Electoral Rolls (parsed data here), we estimate the proportion female, male, and third sex (see here) for a particular first name, year, and state.
Data
In all, we capitalize on information in the parsed electoral rolls from the following 31 states and union territories:
Andaman |
Delhi |
Kerala |
Puducherry |
Andhra Pradesh |
Goa |
Madhya Pradesh |
Punjab |
Arunachal Pradesh |
Gujarat |
Maharashtra |
Rajasthan |
Assam |
Haryana |
Manipur |
Sikkim |
Bihar |
Himachal Pradesh |
Meghalaya |
Tripura |
Chandigarh |
Jammu and Kashmir |
Mizoram |
Uttar Pradesh |
Dadra |
Jharkhand |
Nagaland |
Uttarakhand |
Daman |
Karnataka |
Odisha |
How is the underlying data produced?
We split the name into first name and last name (see the python notebook for how we do this) and then aggregate per state and first_name, and tabulate prop_male, prop_female, prop_third_gender, n_female, n_male, n_third_gender. We produce native language rolls and english transliterations. (We use indicate to produce transliterations for hindi rolls.)
This is used to provide the base prediction.
Given the association between prop_female and first_name may change over time, we exploited the age. Given the data were collected in 2017, we calculated the year each person was born and then did a group by year to create prop_male, prop_female, prop_third_gender, n_female, n_male, n_third_gender
Issues with underlying data
Concerns:
Voting registration lists may not be accurate, systematically underrepresenting poor people, minorities, and similar such groups.
Voting registration lists are, at best, a census of adult citizens. But to the extent there is prejudice against women, etc., that prevents them from reaching adulthood, the data bakes those biases in.
Indian names are complicated. We do not have good parsers for them yet. We have gone for the default arrangement. Please go through the notebook to look at the judgments we make. We plan to improve the underlying data over time.
For states with non-English rolls, we use libindic to transliterate the names. The transliterations are consistently bad. (We hope to make progress here. We also plan to provide a way to match in the original script.)
Gender Classifier
We start by providing a base model for first_name that gives the Bayes optimal solution—the proportion of people with that name who are women. We also provide a series of base models where the state of residence and year of birth is known.
If the name does not exist in the database, we use ML model that uses the relationship between sequences of characters in the first name and gender to predict gender from the name.
The model was trained as a regression problem instead of a classification problem because men and women share names. (See the histogram below for the female proportion for the dataset.) The model predicts the female proportion of the name. If it is less than 0.5, we classify it as male; otherwise, we classify it as female.
Test data
MSE no weights - loss: 0.04974181950092316, metric: 0.04974181950092316
RMSE no weights - loss: 0.21903139352798462, metric: 0.2212539166212082
Test data with weights
RMSE with weights - loss: 0.21645867824554443, metric: 0.2223343402147293
MSE with weights - loss: 0.0501617006957531, metric: 0.043311625719070435
Below are the inference results using different models.
Installation
We strongly recommend installing naampy inside a Python virtual environment (see venv documentation)
pip install naampy
Usage
usage: in_rolls_fn_gender [-h] -f FIRST_NAME
[-s {andaman,andhra,arunachal,assam,bihar,chandigarh,dadra,daman,delhi,goa,gujarat,haryana,himachal,jharkhand,jk,karnataka,kerala,maharashtra,manipur,meghalaya,mizoram,mp,nagaland,odisha,puducherry,punjab,rajasthan,sikkim,tripura,up,uttarakhand}]
[-y YEAR] [-o OUTPUT]
input
Appends Electoral roll columns for prop_female, n_female, n_male
n_third_gender by first name
positional arguments:
input Input file
optional arguments:
-h, --help show this help message and exit
-f FIRST_NAME, --first-name FIRST_NAME
Name or index location of column contains the first
name
-s {andaman,andhra,arunachal,assam,bihar,chandigarh,dadra,daman,delhi,goa,gujarat,haryana,himachal,jharkhand,jk,karnataka,kerala,maharashtra,manipur,meghalaya,mizoram,mp,nagaland,odisha,puducherry,punjab,rajasthan,sikkim,tripura,up,uttarakhand},
--state {andaman,andhra,arunachal,assam,bihar,chandigarh,dadra,daman,delhi,goa,gujarat,haryana,himachal,jharkhand,jk,karnataka,kerala,maharashtra,manipur,meghalaya,mizoram,mp,nagaland,odisha,puducherry,punjab,rajasthan,sikkim,tripura,up,uttarakhand}
State name of Indian electoral rolls data
(default=all)
-y YEAR, --year YEAR Birth year in Indian electoral rolls data
(default=all)
-o OUTPUT, --output OUTPUT
Output file with Indian electoral rolls data columns
Using naampy
>>> import pandas as pd
>>> from naampy import in_rolls_fn_gender
>>> names = [{'name': 'gaurav'},
{'name': 'nabha'},
{'name': 'yasmin'},
{'name': 'deepti'},
{'name': 'hrithik'},
{'name': 'vivek'}]
>>> df = pd.DataFrame(names)
>>> in_rolls_fn_gender(df, 'name')
name n_male n_female n_third_gender prop_female prop_male prop_third_gender pred_gender pred_prob
0 gaurav 25625.0 47.0 0.0 0.001831 0.998169 0.0 NaN NaN
1 nabha NaN NaN NaN NaN NaN NaN female 0.755028
2 yasmin 58.0 6079.0 0.0 0.990549 0.009451 0.0 NaN NaN
3 deepti 35.0 5784.0 0.0 0.993985 0.006015 0.0 NaN NaN
4 hrithik NaN NaN NaN NaN NaN NaN male 0.922181
5 vivek 233622.0 1655.0 0.0 0.007034 0.992966 0.0 NaN NaN
>>> help(in_rolls_fn_gender)
Help on method in_rolls_fn_gender in module naampy.in_rolls_fn:
in_rolls_fn_gender(df, namecol, state=None, year=None) method of builtins.type instance
Appends additional columns from Female ratio data to the input DataFrame
based on the first name.
Removes extra space. Checks if the name is the Indian electoral rolls data.
If it is, outputs data from that row.
Args:
df (:obj:`DataFrame`): Pandas DataFrame containing the first name
column.
namecol (str or int): Column's name or location of the name in
DataFrame.
state (str): The state name of Indian electoral rolls data to be used.
(default is None for all states)
year (int): The year of Indian electoral rolls to be used.
(default is None for all years)
Returns:
DataFrame: Pandas DataFrame with additional columns:-
'n_female', 'n_male', 'n_third_gender',
'prop_female', 'prop_male', 'prop_third_gender' by first name
# If you want to use model prediction use `predict_fn_gender` like below
from naampy import predict_fn_gender
input = [
"rajinikanth",
"harvin",
"Shyamsingha",
"srihan",
"thammam",
"bahubali",
"rajarajeshwari",
"shobby",
"tamannaah bhatia",
"mehreen",
"kiara",
"shivathmika",
"komalee",
"nazriya",
"nabha",
"taapsee",
"parineeti",
"katrina",
"ileana",
"vishwaksen",
"sampoornesh",
"hrithik",
"emraan",
"rajkummar",
"sharman",
"ayushmann",
"irrfan",
"riteish"
]
print(predict_fn_gender(input))
name pred_gender pred_prob
0 rajinikanth male 0.994747
1 harvin male 0.840713
2 shyamsingha male 0.956903
3 srihan male 0.825542
4 thammam female 0.564286
5 bahubali male 0.901159
6 rajarajeshwari female 0.942478
7 shobby male 0.788314
8 tamannaah bhatia female 0.971478
9 mehreen female 0.659633
10 kiara female 0.614125
11 shivathmika female 0.743240
12 komalee female 0.901051
13 nazriya female 0.854167
14 nabha female 0.755028
15 taapsee female 0.665176
16 parineeti female 0.813237
17 katrina female 0.630126
18 ileana female 0.640331
19 vishwaksen male 0.992237
20 sampoornesh male 0.940307
21 hrithik male 0.922181
22 emraan male 0.795963
23 rajkummar male 0.845139
24 sharman male 0.858538
25 ayushmann male 0.964895
26 irrfan male 0.837053
27 riteish male 0.950755
Functionality
When you first run in_rolls_fn_gender, it downloads data from Harvard Dataverse to the local folder. Next time you run the function, it searches for local data and if it finds it, it uses it. Use predict_fn_gender to get gender predictions based on first name.
License
The package is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file naampy-0.4.2.tar.gz.
File metadata
- Download URL: naampy-0.4.2.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd56de80ae503333440601977c97dfce072a3a77e7f7d9b6412956aaf9a65d10
|
|
| MD5 |
517afe5d5b5506283cbe8a282a13b3c4
|
|
| BLAKE2b-256 |
38caeb6402b370538ec18aebab940caabda1894568bdad5471d765edf408de60
|
File details
Details for the file naampy-0.4.2-py2.py3-none-any.whl.
File metadata
- Download URL: naampy-0.4.2-py2.py3-none-any.whl
- Upload date:
- Size: 1.9 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d248f16fbdd0d61e9bec9284be18f5e023afac5b120dbaef31279fe7c0841616
|
|
| MD5 |
810f3d4ddedc0315126b87d8920181b5
|
|
| BLAKE2b-256 |
aa11eccd730bd4a08405eaeec1ea54cc4ec44d19a9bd8af65cdcccc25d54c193
|