[DDC-586] Repo does not find "unflushed" object Created: 14/May/10  Updated: 26/Aug/10

Status: Open
Project: Doctrine 2 - ORM
Component/s: None
Affects Version/s: 2.0
Fix Version/s: None
Security Level: All

Type: Improvement Priority: Major
Reporter: John Kleijn Assignee: Roman S. Borschel
Resolution: Unresolved Votes: 1
Labels: None


 Description   

The problem is this:

$bar = new \entity\content\ContentTag();
$bar->setName('bar');
$em->persist($bar);

$existingTag = $em->getRepository('entity\content\ContentTag')->findOneByName('bar');

Seeing as in EntityRepository "find()" queries the Unit of Work first, and "findBy()" goes directly to the persister, only remotely stored objects will be found. Now if I want a tag object to attach related tags, it would have to query by name to see if an object already exist, BUT it wont find one as the UoW has not been committed, resulting in a new one being created, ultimately resulting in a PDO error on the unique name constraint.

This can be "solved" by inserting a flush, but it is impossible to know whether a flush is required, without knowledge of what comes next. I.e. for one part to know it has to flush, it has to know another wants to fetch an object you just created.

This causes an unacceptable amount of coupling.

Somehow the repo will have to be able execute DQL against the objects in the UoW. This does not have to be full support (straight away), but it should fail (throw an exception) if the possibility exists that the UoW contains items that are excluded (e.g. the operation is not supported and the UoW still contains items).

For right now, this means the EntityManager should throw an exception if DQL is executed on the type when the UoW is not empty. Until the time that the EntityManager can query the UoW using DQL. The alternative would be to "flush" before every operation that goes to the database for data.



 Comments   
Comment by Roman S. Borschel [ 14/May/10 ]

Hi,

you mention a good point, however, this currently only affects findBy queries made through a repository. A DQL query already triggers a flush when there are pending insertions but this still has its own problems. First of, querying against the objects in the UoW is not a viable solution in my eyes.

For a regular find() (by identifier) the situation is clear anyway, you must flush prior to a lookup on an entity you previously persisted in the same request because, by definition, generated primary key values are only guaranteed to be available after the next flush.

Automatic flushing if the UoW has pending inserts (new objects) and a query is executed (either through DQL or a repository) currently has its own set of problems, namely that it is still subject to infinite recursion if such a query is triggered in an event (listener) that executes during commit of a UoW, and secondly, that it will easily lead to double-flushes that cause unnecessary overhead (currently a flush() even if nothing needs to be done is not free because the UoW actually has to check whether nothing needs to be done). Both of these problems could be addressed with some sort of flags, but the question still is whether its not better to flush manually in the first place.
That would mean, in your example, you should flush after persisting the new objects, irrespectively of what code comes next, you persisted (a) new object(s) and you want to make sure these are fully available to the rest of the script.

Comment by Roman S. Borschel [ 14/May/10 ]

Furthermore, automatic flushing when there is no transaction active is probably also not a great idea, as it may split a single unit of work (that was supposed to be atomic) into 2 without the user knowing about it. So auto-flushing should better only happen when a transaction is active (i.e. explicit transaction demarcation is used).

Comment by John Kleijn [ 14/May/10 ]

That would mean, in every example, you should flush after persisting new objects, period. If I flush in some cases and not in others, I'm asking for issues that may not be caught by tests. It's an inconsistency that I personally am not comfortable with.

Could be that I'm overlooking something, I've just started playing with D2.

Why is querying against registered objects not viable? It's not easy, granted, but it doesn't seem impossible. There should probably be a layer between the UoW and the "persisters" (Data Mappers?).

RE: the UoW double flush: state management on the UoW as a whole should prevent that. i.e. after a commit the whole UoW is clean? Just a suggestion, as I said, still getting my bearings.

On a side note I just want to say that what I've seen so far, for the better part, pleases me greatly. Kudos.

Comment by Roman S. Borschel [ 14/May/10 ]

@"That would mean, in every example, you should flush after persisting new objects, period."

Yes, if you want the objects to be visible to queries in the same request. Generally, you should flush when you complete a unit of work and that is usually not the whole request (but can be).

I don't want to "query" against registered objects because it is a) not easy b) likely a lot of code and c) very likely error-prone. And in addition I don't see this helping with solving any inconsistency. If you want to use find() you have to flush anyway because you can not find() without having the identifier in the first place, which is only available after a flush.

@RE: the UoW double flush:

Yes, like I said, it can be done but it is a compromise. Having a "clean/dirty" flag in addition to calculating the changesets of the work to do (which implicitly tells us whether the UoW is dirty) adds more code and more potential for errors. Forget to update the flag in one location and you get flushes that don't do anything, because the flag was not updated. A dirty-flag for the UoW is not really required for proper working. It is similar to the approach of maintaining a separate counter for the number of elements in a collection implementation: can make many size/count requests faster but complicates the internal implementation and increases the likelihood for errors (and lock contention for the counter in a thread-safe/concurrent implementation, an interesting case where performance goes against scalability, but I digress and that does not apply to php obviously). That said, I am not strongly opposed to doing this.

If you're interested in how this is specified by "big brother", take a look at section 3.8.7 of the JPA 2 specification. Shortly, with the default behavior it requires the implementation to ensure that unflushed changes are visible to queries which can be achieved by flushing these to the database automatically but only if a transaction is active, otherwise the implementation must not flush to the database. There is alternatively also a "MANUAL" flush mode, in that case the effect of updates made to entities in the UoW upon queries is unspecified. We do not have different flush modes anymore, however, in Doctrine.

So I see two possible ways to go here:

1) More effort, more code, (really better?)

  • Maintaining a dirty flag in the UoW (this could be done anyway at some point, even if 2) is chosen)
  • Maintaining a flag to avoid infinite recursion triggered from events within a UoW commit/flush
  • Flushing automatically when querying while there are pending inserts and a transaction is active

2) No effort, less code

  • Removing the current auto-flush on DQL queries which is still subject to infinite recursion
  • No automatic flushes, anywhere (less magic, so to speak?)
  • Clearly documenting that new, unflushed entities are not visible in subsequent queries issued in the same request, and if this is desired, a flush should be issued.

That's how I see it. Now we need some votes and volunteers for the implementation Personally, I am not sure yet about which version I prefer, 2) does not sound too bad for me.

Comment by Roman S. Borschel [ 14/May/10 ]

In Nr. 1) the case with the infinite recursion may actually be more problematic. I think you simply can not see unflushed new objects in queries made during a UoW commit.

Comment by John Kleijn [ 14/May/10 ]

When there's no in-memory objects inclusion, I'd say 2) as well. Again, I have no idea how this is implemented currently, but I would prefer something like this:

$repo->start();

$repo->register($object);

$repo->commit();

Why?

  • Commit instead of flush: "flush" has little semantic value IMO, "commit" leaves no questions: you're committing your changes (which implies that they are not, before)
  • Operating on the repo leaves no question to what you are committing: changes of the associated type and relations configured to cascade, made after start()
  • Register instead of persist: "persist" is misleading as the object is not immediately persisted, and as my example shows, may not be.

The way I see it "start" would create a UoW associated with the repo, "commit" would calculate changes and write (the enitity manager would make sure references in other UoWs are removed).

Because the way it is currently implemented (or so it seems), it's unclear when to flush and when not to flush, and unclear what I'm flushing at any one point in the code (because it is not locally isolated). If I have to decide whether to flush in some bit of client code, I am apparently making an assumption about the target entity, i.e. coupling.

I know, you already went beta, so it's unlikely you would consider such a large change, but anyway, for your consideration.

Finally, I realize I'm borderline nagging now as you've made it clear you see nothing it, but a Repository (as in the PoEAA pattern, p 322) may provide a method of fetching native in-memory objects using criteria, acting as a "buffer" between code and database. The Repository in D2 does effectively nothing but delegate to the UoW (or mostly to the underlying persister). Ref PoEAA 327 for an example of an in-memory strategy. As a final point of critique, the Repository does not always seem to be used as entry point for data requests, which is the whole point of the pattern. Most of what's in EntityManager, should be in EntityRepository ("manager" is a bit to abstract a concept to expect clear responsibilities anyway). EntityManager::find() delegates to EntityRepository, but pretty much everything else is the other way around. EntityManager would be better off named DataGateway, as that accurately describes its intended function.

I admit, it would be very difficult to use DQL on in memory objects, but it would be far superior and if it work lead to much more predictable behaviour. It's the ONLY way the data store is ever going to be truly transparent. A few examples (DQL from the docs):

SELECT
u,
UPPER(u.name) nameUpper
FROM MyProject\Model\User u

  • Fetch everything from the db
  • Select all objects from the User UoW
  • Iterate over the in memory ones and modify the name property to upper case
  • Merge the results and return

SELECT u FROM User u WHERE u.id = ?1 OR u.nickname LIKE ?2 ORDER BY u.surname DESC

  • Execute against database
  • Iterate over the User UoW, indexing by "surname", adding items that match the criteria
  • Merge the results and return

With joins it could get more complex, provided you want to intelligently merge results into existing objects. Question is whether that is really needed, but there's obviously a performance benefit. Actually this may already be implemented.

I suspect there are edge cases, rooted in DQL still being based on SQL, but in theory it should be possible. Likely you would still want to do start(), and delegate to the driver to start an actual transaction to prevent inconsistent reads... The only way to find out if it's truly feasible is to to try it, I think.

Ramble, ramble, ramble, I'm done. I know I seem critical, but it's positive critique, I love the direction you went with D2.

Comment by Roman S. Borschel [ 14/May/10 ]

Maybe I was not clear, with approach Nr.1 there would be in-memory objects inclusion (of new objects), in fact, there always is, due to the identity map. When you query for objects and some of them are already in memory, these are used, not again reconstructed.

The EntityRepository provided by Doctrine is just a convenient mechanism for writing your own repositories. There are many different understandings for what a repository is, you can make it whatever you want it to be. Is a PoEAA repository the same as a DDD repository? Anyway, the repository could be stripped of the project, it is optional, the state management is handled by the EntityManager and UnitOfWork. These are the core components. I agree that the delegation from EntityManager#find to the repository is suboptimal in this regard and should be the other way around.

Now to your question: "When should I flush?". Generally, you should flush at the end of a transaction, which in turn is a unit of work. That means, use explicit transaction demarcation. begin() ... flush() commit(). I've added some control abstractions recently that should make this even easier. I can only recommend to explicitly demarcate your transaction boundaries. As you probably know, you can not talk to your database outside of a transaction anyway. The default behavior (flush() wrapping all its stuff in a transaction) is for convenience mostly and so as not to alienate or confuse people even more who are used to autocommit mode.

Concerning the naming, we mostly stick with the JPA specification and I, for one, really like the naming and I don't want to invent new names. PoEAA is far more abstract (and the examples far too specific) than what is specified in JPA, so I recommend giving that a read. The patterns in PoEAA obviously and intentionally leave a lot of room for different variants of implementation and also leave open a lot of open questions (many of the difficult questions especially, it is for a reason that the author recommends using an existing tool instead of writing your own). In my opinion it is just not feasible to query in-memory objects in a generic way, all the examples in PoEAA do not have generic but rather concrete code examples, which is obviously a lot easier. The feasible strategy, and that is what we do, is to do in-memory lookups only when querying by PK, otherwise the query is executed and afterwards nevertheless any objects reused that are already in memory (based on the PK) and not reconstructed. This is the approach we use.

Thanks for your input, I do see that you are an experienced fellow in object-relational persistence, maybe we can see you as a committer some day?

Comment by Roman S. Borschel [ 14/May/10 ]

@ "SELECT u, UPPER(u.name) nameUpper FROM MyProject\Model\User u"

This selects all users and their names in uppercase, the uppercase names are scalar values, the users are not modified! Scalar values are separate from objects.

@ "... and unclear what I'm flushing at any one point in the code"

flush() means: Synchronize the in-memory state of my objects with the database, making any changes that are only in-memory persistent. Nothing more, nothing less.

Again, objects are always reused based on the identity map and the state that is in-memory prevails, unless you use refresh() or execute a query with the Query::HINT_REFRESH query hint. All objects you fetch from DQL, be it as a root object or as a joined association, are first looked up in-memory (but after the SQL query has been issued!).

Maybe we have been talking past each other here, what I refer to as not feasible is querying the in-memory objects first in some way, even before the SQL query. This is just too complicated and error-prone, except for the simple case of a PK lookup and that is where we do it already.

Comment by John Kleijn [ 14/May/10 ]

> Scalar values are separate from objects.

Right. Bad example.

> flush() means: Synchronize the in-memory state of my objects with the database, making any changes that are only in-memory persistent. Nothing more, nothing less.

I realize that it means that, but commit() would be more obvious.

> Maybe we have been talking past each other here, what I refer to as not feasible is querying the in-memory objects first in some way, even before the SQL query. This is just too complicated and error-prone, except for the simple case of a PK lookup and that is where we do it already.

Fair enough, you don't think it's feasible, so we'll keep it at that. Maybe I'll give it a shot some time.

Generated at Tue Sep 23 02:20:46 UTC 2014 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.