Lucene.Net Underscores causing token split

I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.

I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.

I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.

Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?

Answers


Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.

WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.

EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):

// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
        StandardFilter standardFilter = new StandardFilter(baseTokenizer);
        LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
        return lcFilter; 
    }
}

Need Your Help

Url Routing in Anchor Tag in asp.net 4.0?

asp.net html url-rewriting anchor asp.net-4.0

I am using listview and in listview in Item Template, i have anchor tag with href.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.