[DC-160] Search doesn't use the Search Analyzer to escape the query Created: 30/Oct/09 Updated: 13/Jul/12 Resolved: 02/Nov/09 |
|
| Status: | Resolved |
| Project: | Doctrine 1 |
| Component/s: | Searchable |
| Affects Version/s: | 1.0.12, 1.1.4, 1.2.0-ALPHA1, 1.2.0-ALPHA2, 1.2.0-ALPHA3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Markus Lanthaler | Assignee: | Jonathan H. Wage |
| Resolution: | Can't Fix | Votes: | 0 |
| Labels: | None | ||
| Description |
|
When you use the Doctrine_Search_Analyzer_Standard all special characters like "ü" are removed or converted to e.g. "ue". So far so good.. The problem arises when a user performs a search. Using the UTF8 analyzer is no option because often the normalization is a desired feature. It allows for example a user to formulate the query either as "Muenchen" or "München" and he still receives a relevant result. |
| Comments |
| Comment by Jonathan H. Wage [ 30/Oct/09 ] |
|
The only option I can see is to do this in Doctrine_Search_Query::query() $text = Doctrine_Inflector::unaccent($text); I am not sure about doing this though. What do you think? |
| Comment by Markus Lanthaler [ 30/Oct/09 ] |
|
I don't think that that's a good idea since it breaks the UTF8 analyzer. I would refactor the analyzers to include a method like normalize(). Those methods could then be called in the analyzers analyze() method. Doctrine_Search_Analyzer_Standard::normalize($text, $encoding = $null) would look as follows: public function normalize($text, $encoding = null) { $text = preg_replace('/[\'`�"]/', '', $text); $text = Doctrine_Inflector::unaccent($text); $text = preg_replace('/[^A-Za-z0-9]/', ' ', $text); $text = str_replace(' ', ' ', $text); return strtolower(trim($text)); } Doctrine_Search_Analyzer_Utf8::normalize($text, $encoding = $null) would look as follows: public function normalize($text, $encoding = null) { if (is_null($encoding)) { $encoding = isset($this->_options['encoding']) ? $this->_options['encoding']:'utf-8'; } // check that $text encoding is utf-8, if not convert it if (strcasecmp($encoding, 'utf-8') != 0 && strcasecmp($encoding, 'utf8') != 0) { $text = iconv($encoding, 'UTF-8', $text); } $text = preg_replace('/[^\p{L}\p{N}]+/u', ' ', $text); $text = str_replace(' ', ' ', $text); return mb_strtolower(trim($text), 'UTF-8'); } This would then also allow to remove some code duplication in the analyze() method. It could be changed to the following code in Doctrine_Search_Analyzer_Standard and could be completely removed in Doctrine_Search_Analyzer_Utf8: public function analyze($text, $encoding = null) { $text = $this->normalize($text, $encoding); $terms = explode(' ', $text); $ret = array(); if ( ! empty($terms)) { foreach ($terms as $i => $term) { if (empty($term)) { continue; } if (in_array($lower, self::$_stopwords)) { continue; } $ret[$i] = $lower; } } return $ret; } Finally the normalize() method is called in Doctrine_Search_Query::query(). Unfortunately I have no idea how to call it there!? |
| Comment by Jonathan H. Wage [ 02/Nov/09 ] |
|
At first this seems like a good solution but I realized it will break things even more. We allow wildcards and certain keywords in the query string. *, OR, AND, etc. If we were to run the normalize() method on the query text it would break all that functionality. |
| Comment by Markus Lanthaler [ 02/Nov/09 ] |
|
Well.. that's nowhere documented.. the only thing I found was Doctrine_Search provides a query language similar to Apache Lucene. The Doctrine_Search_Query converts human readable, easy-to-construct search queries to their complex DQL equivalents which are then converted to SQL like normal. So I would rather break those special things than to have the search missing existing items. But maybe there's a better place to call that normalize() - perhaps where the query is analyzed and converted to a DQL statement. It should be possible there to run normalize on every search term. |
| Comment by Jonathan H. Wage [ 02/Nov/09 ] |
|
That's what this means: Doctrine_Search provides a query language similar to Apache Lucene You can do things like $query->query('some text* OR some more test*');
If we normalized each term/word it will still remove those wildcards. That is what query language similar to Apache lucene means. |
| Comment by João Veríssimo [ 13/Jul/12 ] |
|
What do you think about this solution? $ret = array(); if($term == 'OR'){ $ret[$i] = $term; continue; } $lower = strtolower(trim($term)); if (in_array($lower, parent::$_stopwords)) { continue; } $ret[$i] = $lower; will it work? |