Doctrine 1
  1. Doctrine 1
  2. DC-160

Search doesn't use the Search Analyzer to escape the query

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Can't Fix
    • Affects Version/s: 1.0.12, 1.1.4, 1.2.0-ALPHA1, 1.2.0-ALPHA2, 1.2.0-ALPHA3
    • Fix Version/s: None
    • Component/s: Searchable
    • Labels:
      None

      Description

      When you use the Doctrine_Search_Analyzer_Standard all special characters like "ü" are removed or converted to e.g. "ue". So far so good.. The problem arises when a user performs a search.
      Since Doctrine is using the plain query with the special char "ü" instead of converting it also there no results are returned.

      Using the UTF8 analyzer is no option because often the normalization is a desired feature. It allows for example a user to formulate the query either as "Muenchen" or "München" and he still receives a relevant result.

        Activity

        Hide
        João Veríssimo added a comment -

        What do you think about this solution?
        class ErpSearchAnalizer extends Doctrine_Search_Analyzer_Standard {
        public function analyze($text, $encoding = null) {
        $text = preg_replace('/[\'`�"]/', '', $text);
        $text = Doctrine_Inflector::unaccent($text);
        // for * search
        //$text = preg_replace('/[^A-Za-z0-9]/', ' ', $text);
        $text = str_replace(' ', ' ', $text);
        $terms = explode(' ', $text);

        $ret = array();
        if (!empty($terms)) {
        foreach ($terms as $i => $term) {
        if (empty($term))

        { continue; }
        if($term == 'OR'){ $ret[$i] = $term; continue; }
        $lower = strtolower(trim($term));

        if (in_array($lower, parent::$_stopwords)) { continue; }

        $ret[$i] = $lower;
        }
        }
        return $ret;
        }
        }
        to make a search like this
        $analize = new ErpSearchAnalizer();
        $val_user = $analizar->analyze($val_user);
        $tempresult = $table->search(implode(" ", $val_user));

        will it work?

        Show
        João Veríssimo added a comment - What do you think about this solution? class ErpSearchAnalizer extends Doctrine_Search_Analyzer_Standard { public function analyze($text, $encoding = null) { $text = preg_replace('/ [\'`�"] /', '', $text); $text = Doctrine_Inflector::unaccent($text); // for * search //$text = preg_replace('/ [^A-Za-z0-9] /', ' ', $text); $text = str_replace(' ', ' ', $text); $terms = explode(' ', $text); $ret = array(); if (!empty($terms)) { foreach ($terms as $i => $term) { if (empty($term)) { continue; } if($term == 'OR'){ $ret[$i] = $term; continue; } $lower = strtolower(trim($term)); if (in_array($lower, parent::$_stopwords)) { continue; } $ret [$i] = $lower; } } return $ret; } } to make a search like this $analize = new ErpSearchAnalizer(); $val_user = $analizar->analyze($val_user); $tempresult = $table->search(implode(" ", $val_user)); will it work?
        Hide
        Jonathan H. Wage added a comment -

        That's what this means:

        Doctrine_Search provides a query language similar to Apache Lucene
        

        You can do things like

        $query->query('some text* OR some more test*');
        

        If we normalized each term/word it will still remove those wildcards. That is what query language similar to Apache lucene means.

        Show
        Jonathan H. Wage added a comment - That's what this means: Doctrine_Search provides a query language similar to Apache Lucene You can do things like $query->query('some text* OR some more test*'); If we normalized each term/word it will still remove those wildcards. That is what query language similar to Apache lucene means.
        Hide
        Markus Lanthaler added a comment -

        Well.. that's nowhere documented.. the only thing I found was

        Doctrine_Search provides a query language similar to Apache Lucene. The Doctrine_Search_Query converts human readable, easy-to-construct search queries to their complex DQL equivalents which are then converted to SQL like normal.
        

        So I would rather break those special things than to have the search missing existing items. But maybe there's a better place to call that normalize() - perhaps where the query is analyzed and converted to a DQL statement. It should be possible there to run normalize on every search term.

        Show
        Markus Lanthaler added a comment - Well.. that's nowhere documented.. the only thing I found was Doctrine_Search provides a query language similar to Apache Lucene. The Doctrine_Search_Query converts human readable, easy-to-construct search queries to their complex DQL equivalents which are then converted to SQL like normal. So I would rather break those special things than to have the search missing existing items. But maybe there's a better place to call that normalize() - perhaps where the query is analyzed and converted to a DQL statement. It should be possible there to run normalize on every search term.
        Hide
        Jonathan H. Wage added a comment -

        At first this seems like a good solution but I realized it will break things even more. We allow wildcards and certain keywords in the query string. *, OR, AND, etc. If we were to run the normalize() method on the query text it would break all that functionality.

        Show
        Jonathan H. Wage added a comment - At first this seems like a good solution but I realized it will break things even more. We allow wildcards and certain keywords in the query string. *, OR, AND, etc. If we were to run the normalize() method on the query text it would break all that functionality.
        Hide
        Markus Lanthaler added a comment -

        I don't think that that's a good idea since it breaks the UTF8 analyzer. I would refactor the analyzers to include a method like normalize(). Those methods could then be called in the analyzers analyze() method.

        Doctrine_Search_Analyzer_Standard::normalize($text, $encoding = $null) would look as follows:

        public function normalize($text, $encoding = null) 
        { 
          $text = preg_replace('/[\'`�"]/', '', $text); 
          $text = Doctrine_Inflector::unaccent($text); 
          $text = preg_replace('/[^A-Za-z0-9]/', ' ', $text); 
          $text = str_replace('  ', ' ', $text); 
        
          return strtolower(trim($text));
        }
        

        Doctrine_Search_Analyzer_Utf8::normalize($text, $encoding = $null) would look as follows:

        public function normalize($text, $encoding = null) 
        { 
          if (is_null($encoding)) { 
            $encoding = isset($this->_options['encoding']) ? $this->_options['encoding']:'utf-8'; 
          } 
        
          // check that $text encoding is utf-8, if not convert it 
          if (strcasecmp($encoding, 'utf-8') != 0 && strcasecmp($encoding, 'utf8') != 0) { 
            $text = iconv($encoding, 'UTF-8', $text); 
          } 
        
          $text = preg_replace('/[^\p{L}\p{N}]+/u', ' ', $text); 
          $text = str_replace('  ', ' ', $text); 
        
          return mb_strtolower(trim($text), 'UTF-8');
        }
        

        This would then also allow to remove some code duplication in the analyze() method. It could be changed to the following code in Doctrine_Search_Analyzer_Standard and could be completely removed in Doctrine_Search_Analyzer_Utf8:

        public function analyze($text, $encoding = null)
        {
          $text = $this->normalize($text, $encoding);
        
          $terms = explode(' ', $text);
        
          $ret = array();
          if ( ! empty($terms)) {
              foreach ($terms as $i => $term) {
                  if (empty($term)) {
                      continue;
                  }
        
                  if (in_array($lower, self::$_stopwords)) {
                      continue;
                  }
        
                  $ret[$i] = $lower;
              }
          }
          return $ret;
        }
        

        Finally the normalize() method is called in Doctrine_Search_Query::query(). Unfortunately I have no idea how to call it there!?
        What do you think?

        Show
        Markus Lanthaler added a comment - I don't think that that's a good idea since it breaks the UTF8 analyzer. I would refactor the analyzers to include a method like normalize() . Those methods could then be called in the analyzers analyze() method. Doctrine_Search_Analyzer_Standard::normalize($text, $encoding = $null) would look as follows: public function normalize($text, $encoding = null ) { $text = preg_replace('/[\'`�"]/', '', $text); $text = Doctrine_Inflector::unaccent($text); $text = preg_replace('/[^A-Za-z0-9]/', ' ', $text); $text = str_replace(' ', ' ', $text); return strtolower(trim($text)); } Doctrine_Search_Analyzer_Utf8::normalize($text, $encoding = $null) would look as follows: public function normalize($text, $encoding = null ) { if (is_null($encoding)) { $encoding = isset($ this ->_options['encoding']) ? $ this ->_options['encoding']:'utf-8'; } // check that $text encoding is utf-8, if not convert it if (strcasecmp($encoding, 'utf-8') != 0 && strcasecmp($encoding, 'utf8') != 0) { $text = iconv($encoding, 'UTF-8', $text); } $text = preg_replace('/[^\p{L}\p{N}]+/u', ' ', $text); $text = str_replace(' ', ' ', $text); return mb_strtolower(trim($text), 'UTF-8'); } This would then also allow to remove some code duplication in the analyze() method. It could be changed to the following code in Doctrine_Search_Analyzer_Standard and could be completely removed in Doctrine_Search_Analyzer_Utf8 : public function analyze($text, $encoding = null ) { $text = $ this ->normalize($text, $encoding); $terms = explode(' ', $text); $ret = array(); if ( ! empty($terms)) { foreach ($terms as $i => $term) { if (empty($term)) { continue ; } if (in_array($lower, self::$_stopwords)) { continue ; } $ret[$i] = $lower; } } return $ret; } Finally the normalize() method is called in Doctrine_Search_Query::query() . Unfortunately I have no idea how to call it there!? What do you think?
        Hide
        Jonathan H. Wage added a comment -

        The only option I can see is to do this in Doctrine_Search_Query::query()

        $text = Doctrine_Inflector::unaccent($text);
        

        I am not sure about doing this though. What do you think?

        Show
        Jonathan H. Wage added a comment - The only option I can see is to do this in Doctrine_Search_Query::query() $text = Doctrine_Inflector::unaccent($text); I am not sure about doing this though. What do you think?

          People

          • Assignee:
            Jonathan H. Wage
            Reporter:
            Markus Lanthaler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: