Given names data
This data table, built by GRC Data Intelligence, has been
created by processing real world and online data files and
counting the occurrences of each given name. The file has
been cleaned, but as it has been created from real world data
it may contain real world errors. To reduce errors, only those
names found five or more times in real world data files are
released.
For the 2012-4-1 release, the file contains 239662 records,
created from analysing over 29.8 million data records from more
than 375 different sources.
The data contains:
Unique record number
Country information (ISO 3166 codes). This applies to the
country in which the name was found, not where the name
originated.
The string as found. Given names are very diverse and
strings are assumed correct unless clearly incorrect.
A corrected name. Corrections are only made when there is a
clear indication that a name is incorrect and the correct
version is obvious. Thus is is never correct to assume that
Boby should be Bobby, or that Bob should be Robert.
A gender indicator (Male, Female or empty (unknown))
A count to show the number of times this given name has
been found in real world data. The higher this number, the
more likely that the string is a given name. Only strings found
5 or more times are released in this file.
A flag to indicate whether the name string is, in fact, initials.
For example: A.J.P.
A flag to indicate whether there are initials as well as a given
name in the string. For example: Edward L.P.
A flag to indicate whether the string contains a form of
address or a seniority indicator. For example: Mr William,
Fred Jr.
A flag to indicate whether the string contains multiple given
names. For example: Claire & Robert, Jean et Thérèse
The date that the record was added to the file, and the last
date that the string was found in real world data.
A list of other names to which this name is related
The table attempts to capture what is found in the real world and
to lend itself to processing according to most uses. Each record
is unique by country, name as written (including casing) and
gender. You can therefore expect to find each name multiple
times within the file in this way:
Country
Name
Gender
Count
GB
JOHN
8610
GB
John
3648
GB
john
17
GB
JOHN
M
236
GB
John
M
6
GB
john
M
5
When viewing the coverage figures (here), you would therefore
not expect to have more than 50% of names gendered in any
country for a single-gender name.
Strings which are known not to be given names, or which are or
include family names that cannot be given names, are excluded
from the release file.
We advise extreme caution is attempting any genderisation
process on the basis of given names, but as this data is often
used for this process, we have released a second table,
containing gender information distilled from the main file. It
shows the number of occurrences found for a given name per
gender per country. For example:
Country
Name
Male
%
Female
%
Universe
BE
KRIS
62
100.00
0
0.00
62
FR
KRIS
17
100.00
0
0.00
17
GB
KRIS
14
14.46
71
85.84
83
ID
KRIS
5
100.00
0
0.00
5
PL
KRIS
6
0.00
0
0.00
6
US
KRIS
37
100.00
0
0.00
37
The gender file contains 79219 records for this release. For full
information, please refer to the documentation.
If you have any questions about this file, please contact us.
Sample
View a sample of 200 records from a previous version of the file
here.
Coverage
View the coverage of this version here. The numbers released
are given in the "Found >5 times" column.
Full file documentation is available here.
Formats
Data is held in Microsoft Visual FoxPro format, but can be
provided also in these formats: FoxPro 2.x (dBase III+), comma
delimited text, tab delimited text, fixed column width text, and
Excel (for small files (<64 000 records) only). Data sets are
delivered by e-mail or download link.
Prices
These files is available at the price (for both files) of only EUR
950.
This data is offered on a royalty-free basis for use in any way you
wish, with this important proviso: The data may be used for
whatever purpose and is royalty free, but it may not be copied or
distributed in any way whatsoever when it can, in normal use, be
accessed by other users. In other words, if you would like to use
this data in your software package, that is allowed provided users
cannot get at, or export, the data themselves.
You will be asked to agree to our terms and conditions when
purchasing. Our terms, conditions and licensing structure can be
view here.
To order
To purchase the full file follow this link to order by credit card
Customers
Many of our customers prefer to remain nameless for competitive
reasons, and we respect this. Our customers include:
•
Coconut Island Software, Inc., Kea'au, USA
•
Talend, for use in their Talend Data Quality product
range.
If you have any questions about any of our products, please
contact us.
© GRC Database Information 2013
GRC Data Intelligence
Expertise in Global Data