I received a question today on stemming and multi language. Basically, “why do we need multiple fields in our Solr in different languages and how do I test multi language stemming?”. First of all, let’s explain what stemming is. Stemming involves reducing words to their stem (or base or root) during indexing and querying in an effort to improve recall. For example, if a document includes the following phrase “Xavier walked to work every morning from Westside Parkway” and a user searches for walk then the results will correctly include the document that has walk.
I was having a conversation today with a person that needed some help on teaching his PMs Agile. I had a very simple response, get them started by watching the excellent trainings available in Pluralsight. So, the first time I told him was: – Agile has proven a succesful methodology in software… when done right
Today I am configuring spell correction in Solr 5.5. Enabling it is not very hard. Simply select which spellcheck component you want to use, please see here for the alternatives: https://cwiki.apache.org/confluence/display/solr/Spell+Checking There are several but I selected solr.IndexBasedSpellChecker which works for what I need. I replaced the one that comes in the solrconfig and then added spellcheck as lastcomponents. Reindexed, committed and it works. Most people stop here, but I wanted to learn more, and so here is some very good recommended lecture to understand spellchecking better: Getting started Spell Checking with Apache Lucene and Solr Which references a more technical post http://norvig.com/spell-correct.html That goes even into more technical depth http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36180.pdf http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf
There are multiple ways of creating cores in Solr. It is very straightforward and one of the ways is by calling Solr’s REST admin with action=create and also you can do it via bin\solr.cmd, however you could run into a small issue. Let me explain quickly this scenario that you might run into. First of all, you can create using solr.cmd with the following command: bin\solr.cmd create -c <nameofthecore> And a fresh new core is created, which echos back the call made: http://localhost:8983/solr/admin/cores?action=CREATE&name=othercourses&instanceDir=othercourses So then what if you are curious and decide to make the call directly yourself: (of course, changing core name) http://localhost:8983/solr/admin/cores?action=CREATE&name=othercourses&instanceDir=othercourses Well, it does not work! The hint there is that it can’t find some resources, namely solrconfig.xml. To solve this issue, you only need to specify what are the base configurations that you want to use. So the call would be: http://localhost:8983/solr/admin/cores?action=CREATE&name=othercourses&instanceDir=othercourses&configSet=basic_configs And presto, you get your core! Little detail, but worth knowing what was missing
Life is like a box of chocolates. You never know what you are going to get! A friend of mine, Katherine, volunteers ad honorem in a foundation called Lifting Hands that is aimed towards helping children from a very poor neighborhood in Costa Rica learn new skills and grow up as respectable members of society. The Request One day while we were talking about Big Data, Solr and the typical geek stuff we discuss all the time, she asked me if I wanted to go one afternoon and talk to her kids about what it was like to grow up to be a computer programmer and hopefully motivate them. It was two groups, 10-12 and 12-14 year olds. What I Thought Piece of cake. I am pretty good at presenting. I’ve done it in front of up to 850 people, spent years as a developer evangelist for Microsoft/Artinsoft and now I enjoy creating content as a Pluralsight author. People also tell me that I am good at motivating others to get into programming given the passion that I have for this field. So my answer was a quick yes. “What could go wrong?” The Briefing Katherine then sat down with […]
A couple of days ago I got asked, how do we monitor our cluster? Well, there are professional ways and other for the budget conscious deployment. Here are a few options that came to my mind: You have the ping request handler which can be used to determine if a node is up and running – this is useful if you want to configure the load balancer to determine which nodes are responding Additionally I’ve seen environments where a monitoring service uses several predefined queries that are issued at a predefined interval and will notify if no response is received. Something like http://www.site24x7.com/ but behind the firewall. I do not know which/if monitoring services you might have. And there are more specialized tools, for example Sematext although some of them are more Linux friendly, so it is necessary to look for Windows counterparts if you don’t have Linux. Also you can use the clusterstate.json (this would be the one from prod https:///solr/zookeeper?detail=true&path=/clusterstate.json) from Zookeepr which will tell you the state of the nodes. You just need to do a bit of parsing which can be done pretty easily with a bit of Json.Net which is easy to learn. And regarding […]
I had to look for empty values in a mandatory field in SOLR today. Wait, what? Shouldn’t mandatory values in the index should be marked as required=”true” when you are defining the field? Well yes, but some people forget to do it or maybe the spec was not fully completed at the time when they worked on the schema so they did not include it… just in case! (YAGNI definitively comes to mind) Well, in any case I had to find which documents did not have the publication date (which sounds like a really really really mandatory field to me). So how do you identify them? Option A: Query *:* and start paginating taking down notes of which documents do not have the value… Ok this is totally brute force approach. But I wouldn’t be too impressed if I find someone doing it. The things I have seen… Option B: Query *:* and in your fl include only id and publicationdate. Paginate or add enough rows. Very amateur but a bit better than before Option C: Query *:*, include only the two fields in fl and sort asc! Much better as in your results you will have the ones with empty at the beginning. Option […]