GRC Data Intelligence

Expertise in Global Data

 

 




Surnames / Family names data

 


 

This data table, built  by GRC Data Intelligence, has been created by processing real world data files and counting the occurrences of each surname/family name.  As names can be very diverse, the file has been cleaned to exclude obvious errors and names containing initials, given names or additions such as seniority indicators (Sr, Jr. etc.);  but as it has been created from real world data it may contain real world errors.  To reduce errors, only those names found five or more times in real world data files are released.  For the 2011-3-1 release, the file contains 482739 records, created from analysing over 37.3 million data records from over 410 sources.  

 

The data contains:

 

Unique record number

Country information (ISO 3166 codes).  This applies to the country in which the name was found, not where the name originated.

The string as found.  Names are very diverse and during cleaning strings are assumed correct unless clearly incorrect.  

A corrected name.  Corrections are only made when there is a clear indication that a name is incorrect and the correct version is obvious.  Thus is is never correct to assume that Smyth should be Smith. However, Andr can clearly be corrected to André.

A count to show the number of times this surname/family name has been found in real world data.  The higher this number, the more likely that the string is a name.  Only strings found 5 or more times are released in this file.

 

The table attempts to capture what is found in the real world and to lend itself to processing according to most uses.  Each record is unique by country and name as written (including casing).  You can therefore expect to find each name multiple times within the file in this way:

 

Country Name Count
GB SMITH 5231
GB Smith 7563
GB smith 17
GB JONES 3876
GB Jones 6252
GB jones 5

 

Coverage figures are here.

 

Strings which are known not to be surnames/family names, or which are or include given names, or which contain additional information such as forms of addresses or seniority indicators, are excluded from the release file.

 

For full information, please refer to the documentation.

 

If you have any questions about this file, please contact us.

 


 

Sample

 

View a sample of 200 records from the file here.

 


 

Coverage

 

View the coverage of this version here.  The numbers released are given in the "Found >5 times" column.

 

Full file documentation is available here.

 


 

Formats

 

Data is held in Microsoft Visual FoxPro format, but can be provided also in these formats: FoxPro 2.x (dBase III+), comma delimited text, tab delimited text, fixed column width text, and Excel (for small files (<64 000 records) only).  Small data sets can be e-mailed, larger sets are provided on CD-ROM.

 


 

Prices

 

The file is available at the price of only EUR 495If have any questions regarding the file, please  contact us.  

 

This data is offered on a royalty-free basis for use in any way you wish, with this important proviso: The data may be used for whatever purpose and is royalty free, but it may not be copied or distributed in any way whatsoever when it can, in normal use, be accessed by other users.  In other words, if you would like to use this data in your software package, that is allowed provided users cannot get at, or export, the data themselves.

 

You will be asked to agree to our terms and conditions when purchasing.  Our terms, conditions and licensing structure can be view here

 


 

To order

 

To purchase the full file follow this link to  order by credit card

 


 

If you have any questions, please contact us

 


 



GRC Data Intelligence

AMSTERDAM

The Netherlands