joh

Rule/search matching issues

Recommended Posts

I have noticed that rules capture many items that apparently equivalent active searches do not.

I don't know how this fits in with this recent issue:

On 7/6/2018 at 1:22 PM, maddhin said:

Just as a comment as this is a different problem: I do have a big number of articles which should have been tagged by a rule but never got tagged. The rule is usually very simple "for new articles in account -> title or content contains "football" -> assign tag "football" - still some articles manage not to get tagged this way although "football" is even in the title... 

 

Secondly, a rule can look for any incidence of the string, whereas the active search seems limited to whole words. This is great, since it solves the issue I had with lack of wildcard prefix, so I have switched my active searches to rules that match content and add a tag. This does not on its own seem to explain the differences I have seen between search and rule matching.

 

joh

 

 

Share this post


Link to post
Share on other sites

As further proof that this is a real issue:

I have two subscriptions that sometimes post the exact same item. When this occurs, sometimes the active search picks up both items, and sometimes just one. Thus it seems to be entirely random whether an active search misses a match. Rules seem to be more successful for me. Perhaps you can comment on why active search might randomly miss items, and why there is a difference between active search and rule matching.

cheers,

joh

 

Share this post


Link to post
Share on other sites

While this is true, it is not the answer to my query. While I have observed differences between rules and searches that are not related to syntax, I will for the sake of argument assume I keep missing something obvious. Okay. However, even comparing active searches of identical items, as in my example, items are missed. This shows that it is nothing to do with syntax, but apparently (frequent) random failure to match criteria. I'll repeat again that some of the items that are missed are identical to those that are matched, but just in a different subscription that is working and successfully matched on other occasions. Additionally rules and highlighters are both flagging the criteria successfully, when I look at the missed item.

cheers,

joh

Share this post


Link to post
Share on other sites

OK, but we'll need some examples in order to give you more adequate answer about it. So please give us your active searches and rules criteria respectively their matches. 

Share this post


Link to post
Share on other sites

OK, I haven't really been bothered to try and gather the evidence, but I can say that I have approx. 25 rules, and 25 active searches, which should be the same, and which sometimes look for very general words like "updated", and sometimes specific terms like "resonance fluorescence", but in all cases the rules capture more items than the searches. So I can't imagine this observation isn't easily replicated for anyone else. Have you tried creating "identical" active searches and rules, and see what is captured after a few days or weeks? You don't see any differences?

Cheers,

joh

Share this post


Link to post
Share on other sites

Yes, and even before you said that, I had already explained in this thread, and have now done so several times, that while differences arise due to syntactic differences between rules and searches, the differences are not entirely due to syntax. I gave the example of the active searches missing one of two identical feed items, when identical items are posted, and when rules catch both. I also mentioned that highlighters show the flagged criteria in the item that the searches missed. It has been quite clear from my first post and susbequent posts that I am aware of the syntactic differences, and that the whole point is that this is not my issue. If it is too much trouble for you or your team to look into this issue by creating some syntactically simple active searches and rules, and observing the differences that arise, and seeing if in all cases it is explained by syntax or not, then so be it, but please stop giving such useless and repetitive responses that are completely ignorant of what the (paying) user is saying. I wouldn't mind if you didn't wait nine days to give a stupid one-line response which is a quote of a previously useless response, I mean

cheers,

joh

Share this post


Link to post
Share on other sites

OK, there is no need of such frustration here... :) 

Sorry if I offended you somehow, but considering that the rules and search are using different syntax, and they are running in different threads (background) , and considering that our systems are processing millions of articles per hour and since we don't have any example of your observations, unfortunately we can't help you in this case. Of course we'll be glad to help if you provide us an examples of such issues (rules/searches setup and matched articles from both).

Share this post


Link to post
Share on other sites

There's usually a solid explanation why rule will or will not match. Same for active searches. The fact is that the team usually needs a concrete example to be able to reproduce a given issue or just to see why it happens in the specific case. Generally there aren't known issues with both features and the large volumes of articles aren't a problem unless there's an incident with the backend, which happens very rarely and usually only results in a slight delay of both functionalities, not degradation of results.

This is why wesson asked for examples.

Some things to note:

  • Active Search uses our Elasticsearch full text index.
  • Rules use a more traditional search & match algorithm.
  • Because Active Search is using Elasticsearch it is useful to know how a full-text index tokenization actually works. More specifically that any non-word character will be used to split strings into tokens. e.g. hot-dog will be tokenized as 'hot' and 'dog'. Now search terms are also analyzed, so a search for 'hot-dog' will likely match, because it will search for both 'hot' and 'dog' and they will be in the original document, but there might be case where you want to specifically look for 'hot-dog' and not 'hot' and 'dog'. You can't achieve that with ES. There are other pitfalls too, but they are too broad to list here.
  • Rules are using more traditional string comparison operators and should generally produce very reliable results.
  • Rules might "fail" in some cases:
    • An article has been updated on the original site. Article is checked agains all users' rules when inserted, not on updates. Even if the updated article contains a matching term it will not trigger the rule.
    • An article has been post processed. We have a separate post-processing process that "enriches" articles for some feeds. It is mostly adding missing pictures, but can sometimes add text, which won't be caught by rules.
    • "Invisible" whitespace in the source text, especially when you search for multi-word terms
    • Incorrect usage (e.g. multiple keywords in one condition separated with a comma)
    • Trying to match an HTML tag or attribute. Rules do not operate on non-readable content. Or to be even more precise, strip_tags() is applied to the text before passing it to the rule.
    • Maybe even more cases...

 

We really want to dig more into, because an algorithm can't just decide to act on its own and such spontaneous issues doesn't make sense, but please try to give us an example with a rule or active search and at least one article that is missed or incorrectly triggered.

Share this post


Link to post
Share on other sites

As you can probably appreciate it would be quite a lot of work for me to gather the details of all of these occurences. Luckily there is a clear example from the last day or two.

I have an active search, which searches title and contents for the term "purcell", and a rule matching the term "purcell" in title or contents:asearch_details.png.d05044d47dba72a16d4c5f91308b26b1.png

rule_purcell.png.819c5001742b0b1c19040cf08dea1e26.png

 

In the last few days these have produced the following:

asearch.thumb.png.02bf9f5a1626eee799e85485762d8c21.png

rule.thumb.png.f5df7c92faef59bc4bdb60e1176bc660.png

As you can see the rule has matched one more than the search, and the item that the search missed is identical to a different one that it did match, albeit from a different feed.

Here are the two items from the two feeds expanded so you can play spot the difference:

item1.thumb.png.338dc1cdc4a2260a66111a270560c131.png

item2.thumb.png.88b4365a722a94714842e0c97fd46c88.png

As you can see they are identical, at least at the user end. Also the one that was missed by the active search is missing another tag "photon*" which corresponds to another active search I have. So this item seems to not have been active searched at all.

The feed details are here:

feed1.thumb.png.c6a3e12a382ef12e40a6862697e0d428.png

feed2.thumb.png.48e502838a4e9207fa39f7e1f47a0cfe.png

Although from the same website, I am getting them through slightly locations it seems, although my problem isn't restricted to this one feed, as far as I can tell missed items can be from any feed, it is just more obvious when I see a case like this with identical items.

Hopefully you are now at least convinced that my query has nothing to do with syntax.

cheers,
joh

Share this post


Link to post
Share on other sites

Thank you for the detailed report! Of course we don't expect that you will hunt down and report all cases, that's why I asked for an example only. This should be enough for us to investigate and we will get back to you with results shortly.

Share this post


Link to post
Share on other sites

TLDR: It should be fixed, although it's very hard to confirm that it was the sole cause of all issues that you had.

The long story:

We caught one solid issue with the tokenizer that we are using - icu_tokenizer. It was correctly tokenizing the term "Google’s" into two tokens - "Google" and "s". But was failing to split "Google's" into different tokens. Note that the second example uses single quote (') instead of apostrophe (’). Really weird and hard to catch. Just try to spot the difference in the screenshot:

Screen Shot 2018-08-30 at 11.50.31.png

As a result, an Active search for "Google" couldn't pick up articles containing only "Google's" (with the single quote), while rules did indeed catch them. A regular search is also affected by this. I believe this is an edge case, but nevertheless we have fixed it by remapping ' to ’ at index time. This means that Active search is immediately fixed, but for regular search the fix will only apply to newly indexed articles.

While hunting this down however we noticed a more serious issue that was surely affecting random Active search results. One of our test environments was not correctly set up and was actually sending commands to our production Elasticsearch instance. This was causing the index to be truncated intermittently and theoretically it could have compromised some search results, very rarely but still a possibility and I am almost certain that this should be the root cause.

Please report to us if you notice any other issues. We have set up one test account full with active searches and corresponding rules and we are also monitoring this. Hopefully it should be fixed for good.

Share this post


Link to post
Share on other sites

Okay that's great. I will keep an eye on it. The most recent failure I can see is from about 12h ago, which is probably before you made the change. It seems that a bunch of items posted at once by the same feed were all missed -- perhaps this fits in with your truncated index explanation? As for your statement that it should be a very rare occurence, I don't know what that quantitatively means, but I would say that the incidence of missed items was at least in the single figure percent range, although hard to say exactly given the differences also arising due to syntax. I will see how it goes now.

Despite my persistence on this issue, rules work better for me, primarily because they are editable and generally more versatile, and so I now only really have active searches as a backup/test of the rules. Similar to your issue with the tokens, I noticed that sometimes my active searches were picking up items my rules missed, due to the fact the active searches seemed to treat hyphens and en dashes equivalently (perhaps also equivalently to spaces) -- I guess because it breaks everything into tokens as you described. Well anyway my rules were missing items because I only specified the compound word case with hyphens, so when an occurence used en dashes, it was not picked up. Perhaps (as in your token solution) some equivalency could be made (optionally) between frequently interchanged symbols, such as the different types of apostrophe (and quotes, and prime), or the many different types of hyphen/dashes, within the rule options. But this is just a syntax issue that can be dealt with at the user end in other ways.

cheers,
joh

Share this post


Link to post
Share on other sites

Yes, the fix was deployed just couple of hours ago. Please tell us if you spot the issue again.

While tidying up the punctuation at the input is possible, it can lead to some unexpected results and confusion with rules if one doesn't know that this is happening behind the scenes, so we prefer to leave the input string as close to the original as possible, only with the html tags stripped.

Share this post


Link to post
Share on other sites

Hello again.

After looking through my tags and searches now after about two weeks, I'm pretty sure this is completely resolved. Good work!

 

ActualIy at first it seemed that is was still randomly failing in some cases, but this turned out to be due to some subtle syntax issues, as I will now tediously explain.

Firstly I had a search and rule looking for a particular surname. The search missed one instance (compared to the rule) for no apparent reason. However it was because it was in the article as

"Forename Surname" rather than "Forename Surname". The character in the former case is not a normal space, so the search was treating it all as one word and not matching. [Okay, after copying it here, it seems to have been converted to a normal space automatically, but copying it into Notepad++ I can see that it's not, so I don't know how to show you it. Perhaps you know what character it is, looks like a normal space, but isn't.]
 
Secondly I had one search that seemed to still fail a lot. It was looking for multiple terms separated by ORs, something like:
aacc OR aa(bb)cc OR bbcc

Well on investigation, it seems a term like that with parentheses (perhaps just when combined with the OR statements?) causes many failures. For example for some reason it was not matching an article that had term 1. If I removed term 2, or just the parentheses, it found the missed article due to term 1. Putting quotes around term 2, i.e. "aa(bb)cc" also resolved the problem, so I guess the parentheses are special characters in this case. Anyway after redoing the search with the quotation marks and saving, the results matched those of my equivalent rule. Note: When the active search is saved, the quotation marks are removed from the search title, but looks like the proper search term is maintained.

 

cheers,

joh

Share this post


Link to post
Share on other sites

Great to hear that and thanks for getting back! Everything sounds like it's working as it should. And yes, parentheses are special symbols for Elasticsearch. Hopefully you'll have only smooth sailing from now on.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now