Stemming and Multi Language
I received a question today on stemming and multi language. Basically, “why do we need multiple fields in our Solr in different languages and how do I test multi language stemming?”.
First of all, let’s explain what stemming is. Stemming involves reducing words to their stem (or base or root) during indexing and querying in an effort to improve recall.
For example, if a document includes the following phrase “Xavier walked to work every morning from Westside Parkway” and a user searches for walk then the results will correctly include the document that has walk.
Stemming is not perfect, because in some cases the algorithm will perform reductions that are not adequate, for example news -> new but the overall improvement in results means that applying stemming is much better than not applying it.
The next step to improve results will be to use lemmatization which reduces not to the stem, but to the lemma which is more advanced. For example, better would be reduced to good, thus having better results. This is achieved with dictionaries, but that is a much more complex process that should be taken at a later stage after the effects of stemming has been analyzed in depth because of added cost.
Now, going into multi language to answer the original question. Part of the features that can be added to a Solr implementation for multi language applications is the use of multi language fields that apply language specific rules to each field, including stemming. This means that separate fields are used for title, body and other fields so that each search applies language specific rules.
Let’s see one as an example and we will use the Analysis screen to show the results. As background, the Analysis screen is part of the Solr Admin UI and it shows you how words are treated at index and query time. If you want more information on the Analyzer, please head to Solr’s Wiki: https://cwiki.apache.org/confluence/display/solr/Running+Your+Analyzer
So here is an example of stemming in English
If you search for audits it will match audit. You can see by going into the Analysis screen and selecting English body field and typing audits into query and audit into index.
And these are the results, the one on the left being index side and the right is query side. As you can see, this would be a match!
Now, let’s try doing this with a different language. I will use the word alquileres which is the plural of alquiler in Spanish. Stemming in Spanish should correctly reduce the word to alquiler but if I select a different language field, it should not because Solr is not applying language specific rules. And
And as expected, the rule is not applied correctly. Alquieres is stemmed to alquiere and thus it is diferent from alquiler.
But if I change to a Spanish field, namely text_es, now the Spanish stemming rules have kicked in, reducing alquileres effectively to alquiler which is correct.
And this should apply for all languages that have a language specific field. What is required to work with each language is an understanding of what are the stemming rules for each individual language.
Hope this helps!