Hadoop - look for matching names in two customer lists

I have two lists of people from different events; I would like to look for matching names of people amongst those lists, as well as matching companies. I understand that potentially there will be people with the same name in each list that are not the same people, but it will help to find the matches.

First List Example: Name, Company, Title John Doe, ACME Corporation, Elephant Trainer Jane Smith, ACME Corporation, CEO John Smith, Widgets-R-Us, Janitor +10,000's of rows

Second List Example: Name, Company Fred Smith, ACME Corporation John Smith, Widgets-R-Us John Smith, Company XYZ Jane Smith, Company XYZ +10,000's of rows

Desired Output Matching Names: John Smith Jane Smith

Matching Companies: ACME Corporation Widgets-R-Us

I am running it in an AWS environment, and new to Hadoop. Any programming language is fine. I know how to do this in excel, but want to be able to scale this over time with more lists of names (each in their own CSV file).

Thank you kindly!


You need a Mapper implementation in which you emit the Name and Company Name as Text and IntWritable. protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ /*Some logic to derive the person name or the Company name.*/ String name = value.split(',')[0]; context.write(new Text(value),new IntWritable(1)); }

The implementation of the reduce method in Reducer would be something similar to public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException, InterruptedException{ int count = 1; for(IntWritable val: values){count++;} //You would all the unique names with no of times it is repeated. context.write(key,new IntWritable(count)); } Hope this helps.

Need Your Help

In C,is casting to (void*) not needed/inadvisable for memcpy() just as it is not needed for malloc()?

c casting malloc void-pointers memcpy

I have some confusions about what I read from the following site about memcpy()(and malloc()):

Create Nuget Pacakge with configuration of cloud config(ServiceDefinition.csdef) file

azure nuget-package

I would like to build Nuget package for my add-on which will be used by end user to install as startup task and after that they will upload their applications on window azure platform.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.