[DC-160] Search doesn't use the Search Analyzer to escape the query Created: 30/Oct/09  Updated: 13/Jul/12  Resolved: 02/Nov/09

Status: Resolved
Project: Doctrine 1
Component/s: Searchable
Affects Version/s: 1.0.12, 1.1.4, 1.2.0-ALPHA1, 1.2.0-ALPHA2, 1.2.0-ALPHA3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Markus Lanthaler Assignee: Jonathan H. Wage
Resolution: Can't Fix Votes: 0
Labels: None


 Description   

When you use the Doctrine_Search_Analyzer_Standard all special characters like "ü" are removed or converted to e.g. "ue". So far so good.. The problem arises when a user performs a search.
Since Doctrine is using the plain query with the special char "ü" instead of converting it also there no results are returned.

Using the UTF8 analyzer is no option because often the normalization is a desired feature. It allows for example a user to formulate the query either as "Muenchen" or "München" and he still receives a relevant result.



 Comments   
Comment by Jonathan H. Wage [ 30/Oct/09 ]

The only option I can see is to do this in Doctrine_Search_Query::query()

$text = Doctrine_Inflector::unaccent($text);

I am not sure about doing this though. What do you think?

Comment by Markus Lanthaler [ 30/Oct/09 ]

I don't think that that's a good idea since it breaks the UTF8 analyzer. I would refactor the analyzers to include a method like normalize(). Those methods could then be called in the analyzers analyze() method.

Doctrine_Search_Analyzer_Standard::normalize($text, $encoding = $null) would look as follows:

public function normalize($text, $encoding = null) 
{ 
  $text = preg_replace('/[\'`�"]/', '', $text); 
  $text = Doctrine_Inflector::unaccent($text); 
  $text = preg_replace('/[^A-Za-z0-9]/', ' ', $text); 
  $text = str_replace('  ', ' ', $text); 

  return strtolower(trim($text));
}

Doctrine_Search_Analyzer_Utf8::normalize($text, $encoding = $null) would look as follows:

public function normalize($text, $encoding = null) 
{ 
  if (is_null($encoding)) { 
    $encoding = isset($this->_options['encoding']) ? $this->_options['encoding']:'utf-8'; 
  } 

  // check that $text encoding is utf-8, if not convert it 
  if (strcasecmp($encoding, 'utf-8') != 0 && strcasecmp($encoding, 'utf8') != 0) { 
    $text = iconv($encoding, 'UTF-8', $text); 
  } 

  $text = preg_replace('/[^\p{L}\p{N}]+/u', ' ', $text); 
  $text = str_replace('  ', ' ', $text); 

  return mb_strtolower(trim($text), 'UTF-8');
}

This would then also allow to remove some code duplication in the analyze() method. It could be changed to the following code in Doctrine_Search_Analyzer_Standard and could be completely removed in Doctrine_Search_Analyzer_Utf8:

public function analyze($text, $encoding = null)
{
  $text = $this->normalize($text, $encoding);

  $terms = explode(' ', $text);

  $ret = array();
  if ( ! empty($terms)) {
      foreach ($terms as $i => $term) {
          if (empty($term)) {
              continue;
          }

          if (in_array($lower, self::$_stopwords)) {
              continue;
          }

          $ret[$i] = $lower;
      }
  }
  return $ret;
}

Finally the normalize() method is called in Doctrine_Search_Query::query(). Unfortunately I have no idea how to call it there!?
What do you think?

Comment by Jonathan H. Wage [ 02/Nov/09 ]

At first this seems like a good solution but I realized it will break things even more. We allow wildcards and certain keywords in the query string. *, OR, AND, etc. If we were to run the normalize() method on the query text it would break all that functionality.

Comment by Markus Lanthaler [ 02/Nov/09 ]

Well.. that's nowhere documented.. the only thing I found was

Doctrine_Search provides a query language similar to Apache Lucene. The Doctrine_Search_Query converts human readable, easy-to-construct search queries to their complex DQL equivalents which are then converted to SQL like normal.

So I would rather break those special things than to have the search missing existing items. But maybe there's a better place to call that normalize() - perhaps where the query is analyzed and converted to a DQL statement. It should be possible there to run normalize on every search term.

Comment by Jonathan H. Wage [ 02/Nov/09 ]

That's what this means:

Doctrine_Search provides a query language similar to Apache Lucene

You can do things like

$query->query('some text* OR some more test*');

If we normalized each term/word it will still remove those wildcards. That is what query language similar to Apache lucene means.

Comment by João Veríssimo [ 13/Jul/12 ]

What do you think about this solution?
class ErpSearchAnalizer extends Doctrine_Search_Analyzer_Standard {
public function analyze($text, $encoding = null) {
$text = preg_replace('/[\'`�"]/', '', $text);
$text = Doctrine_Inflector::unaccent($text);
// for * search
//$text = preg_replace('/[^A-Za-z0-9]/', ' ', $text);
$text = str_replace(' ', ' ', $text);
$terms = explode(' ', $text);

$ret = array();
if (!empty($terms)) {
foreach ($terms as $i => $term) {
if (empty($term))

{ continue; }
if($term == 'OR'){ $ret[$i] = $term; continue; }
$lower = strtolower(trim($term));

if (in_array($lower, parent::$_stopwords)) { continue; }

$ret[$i] = $lower;
}
}
return $ret;
}
}
to make a search like this
$analize = new ErpSearchAnalizer();
$val_user = $analizar->analyze($val_user);
$tempresult = $table->search(implode(" ", $val_user));

will it work?

Generated at Mon Nov 24 05:28:58 UTC 2014 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.