Page 1 of 2 12 LastLast
Results 1 to 15 of 27

Thread: Crashing with node and VPR

  1. #1
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143

    Crashing with node and VPR

    My cute and nearly useful little C++ node is finally working in both Modeler and Layout. Renders and everything! Whee!

    Only... it crashes with a EXC_BAD_ACCESS if I fire up the VPR viewport, and according to the debugger (oh GOD I can finally attach the debugger!) it's happening while retrieving values from the LWNodalAccess* passed in to the Evaluate() method, and I can't seem to figure out why. The debugger shows that the node access pointer is non-null, and that there are sensible-looking values in the structure.

    Am I possibly hitting some kind of weird threading issue? Something else?

    Update: It's not nodal access after all. See thread below.
    Last edited by jrandom; 03-17-2013 at 12:02 AM.

  2. #2
    obfuscated SDK hacker Lightwolf's Avatar
    Join Date
    Feb 2003
    Location
    Stuttgart, Germany
    Posts
    13,711
    Retrieving data from LWNodalAccess* is thread safe. However, you should make sure that your evaluate() is thread safe as well. For example, don't write to member variables.

    If you'd like to, send me or post your evaluate and I can have a look.

    Cheers,
    Mike

  3. #3
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    It's also pretty sporadic. Sometimes VPR chugs away with no problems, both in and out of Draft Mode.

  4. #4
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    I'm working up to porting my material optimizer to c++ (and to see if it's even useful with the new Unified Sampling model), so to start easy I grabbed the internal value-caching system and turned just that into a scalar cache -- no interface, no preview, no fancy calculations. But it does update its internal cache during Evaluate (although the read/writes should be separated by the different thread/bounce numbers).

    The old C code never seemed to have problems, but we didn't have VPR back when I wrote it. Do I need to wrap the caching of my value in a thread lock?

    The evaluate method is pretty simple:

    Code:
        void Scalar_Cache::Evaluate( LWTypes::LWNodalAccess * nodal_access,
                                     LWTypes::NodeOutputID    node_output_id,
                                     LWTypes::NodeValue       node_value )
        {
            // Only one output
            if (node_output_id != Output_Out)
                return;
            
            // Retrieve cached value, evaluation and caching if necessry
            double output = 0.0;
            
            if ( has_Valid_Cache() )
            {
                if ( !Value_Cache_p->Get_Cached_Value(output, nodal_access) )
                {
                    Input_Functions->evaluate( Input_In, nodal_access, &output );
                    
                    Value_Cache_p->Set_Cached_Value( output, nodal_access );
                }
            }
            else
                Input_Functions->evaluate( Input_In, nodal_access, &output );
            
            
            // Set output
            Output_Functions->setValue( node_value, &output );
        }

  5. #5
    obfuscated SDK hacker Lightwolf's Avatar
    Join Date
    Feb 2003
    Location
    Stuttgart, Germany
    Posts
    13,711
    Quote Originally Posted by jrandom View Post
    It's also pretty sporadic.
    Those are the most fun... it sounds like a threading issue though.

    I hope it's nothing like my recent bug hunt, which ended up being due to negative values fed into pow().
    No crash in node editor previews, crashes on Windows (render and VPR), black splotches on OSX (which eventually led me to the culprit).

    I took ages to isolate. Fun.

    Cheers,
    Mike

  6. #6
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    I'll keep poking away at it. Sporadic sucks -- there's no way to tell the difference between "fixed it" and "it's still broken, just not crashing right now".

    On the positive side, the new XCode successfully breakpoints into the code during a crash. Handy, that.

  7. #7
    obfuscated SDK hacker Lightwolf's Avatar
    Join Date
    Feb 2003
    Location
    Stuttgart, Germany
    Posts
    13,711
    Quote Originally Posted by jrandom View Post
    ... so to start easy I grabbed the internal value-caching system and turned just that into a scalar cache -- no interface, no preview, no fancy calculations. But it does update its internal cache during Evaluate (although the read/writes should be separated by the different thread/bounce numbers).
    That's something you'd need to double and triple check. The code you posted so far seems solid enough. You might need to check the context (i.e. node editor previews can get in the way).

    Actually, I've just looked at my cache node, and it does mutex. I'll need to check if that's really necessary, I might have just added it to be on the save side.

    Actually, looking at the code again, I also mutex copy, newtime and delete - likely due to LW messing with the node during a VPR render (i.e. for the undo stack). One reason why I requested a flag for VPR renders.

    Cheers,
    Mike

  8. #8
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    Lemme check... yep, my NewTime() is mutex-wrapped, but that's the only place I do that. After I do some other tests, I'll wrap delete() and that one section of Evaluate() as well.

    So weird we don't get a VPR flag.

  9. #9
    obfuscated SDK hacker Lightwolf's Avatar
    Join Date
    Feb 2003
    Location
    Stuttgart, Germany
    Posts
    13,711
    Quote Originally Posted by jrandom View Post
    Lemme check... yep, my NewTime() is mutex-wrapped, but that's the only place I do that.
    Yeah, if it's only there it won't help. So far, NewTime() has always been called from a single thread in my experience.

    Actually, starting with LW 10 there's a comring function available to pause VPR (limbo state) - that may be an option to drop the mutex and then just put VPR into limbo when deleting or modifying the cache.
    I haven't tried it out though - but I should.

    Cheers,
    Mike

    Edit: It's not documented in HTML but included in the comring header file.

  10. #10
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    I've never even looked at comring before. I'll dive in as well.

  11. #11
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    Just verified that it nearly always crashes if VPR is active and I change one of my node's connections in the node editor -- the pointer to the internal cache object is null, and this access happens after I check to see if that very same pointer is null. Definitely a case of the cache object being deleted during another thread calling Evaluate().

    Now I've got to figure out how to arrange my locks so this doesn't happen while not killing performance on lock-waits.

  12. #12
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    Solution!

    Kinda cranky, but it it works. The design is driven by the fact that evaluation needs to be fast, node destruction doesn't. My first attempt wound up locking the entire evaluate method, and that causes a threading bottleneck. Evaluation needs to be parallel-friendly.

    Okay, so first I need two variables to coordinate evaluation with destruction:

    Code:
    volatile int count_In_Evaluation;
    bool         Destroy_Requested;
    I'm marking count_In_Evaluation as volatile so the compiler doesn't break things by optimizing away memory accesses in the destroy() callback.

    Here's the approach: Every thread that enters an evaluation needs to increment count_In_Evaluation, and decrement it when done. However, if the object is to be destroyed we want to simply abort out of the evaluation method. Here's the helper methods:

    Code:
            // -------------------------------------------------------------------- Mark_Entering_Evaluation()
            bool Mark_Entering_Evaluation()
            {
                // Attempt lock
                if (!Threading_Functions->groupLockMutex( Threading_Group_ID, MutexID_Object_Lock ) )
                    return false;
                
                // If we're marked for destruction, abort
                if (Destroy_Requested)
                {
                    Threading_Functions->groupUnlockMutex( Threading_Group_ID, MutexID_Object_Lock );
                    return false;
                }
                
                // Increment our evaluation counter and unlock
                count_In_Evaluation++;
                
                Threading_Functions->groupUnlockMutex( Threading_Group_ID, MutexID_Object_Lock );
                return true;
            }
            
            // -------------------------------------------------------------------- Mark_Leaving_Evaluation()
            void Mark_Leaving_Evaluation()
            {
                // Attempt lock
                if (!Threading_Functions->groupLockMutex( Threading_Group_ID, MutexID_Object_Lock ) )
                    throw std::runtime_error("Cache_Node_Base::Mark_Leaving_Evaluation() -- Unable to lock object");
                
                // Decrement our evaluation counter and unlock
                count_In_Evaluation--;
                
                // Unlock
                Threading_Functions->groupUnlockMutex( Threading_Group_ID, MutexID_Object_Lock );
            }
    Here's the evaluation method. Note that if we can't successfully mark for evaluation (meaning that the object is currently ready for destruction), we get the heck out without doing anything. Otherwise we go on our merry way and call our "leaving" method on the way out. This approach means that the object is locked only for very brief periods of time at the very beginning and very end of evaluation. The rest of evaluation is fully parallel.

    Code:
        // -------------------------------------------------------------------- Evaluate()
        void Scalar_Cache::Evaluate( LWTypes::LWNodalAccess * nodal_access,
                                     LWTypes::NodeOutputID    node_output_id,
                                     LWTypes::NodeValue       node_value )
        {
            // Only one output
            if (node_output_id != Output_Out)
                return;
            
            // Coordinate evaluation with destrution. Thank you VPR. :(
            if (!Mark_Entering_Evaluation())
                return;
            
            // Retrieve cached value, evaluating and caching if necessry
            double output = 0.0;
            
            if (has_Valid_Cache())
            {
                if (!Value_Cache_p->Get_Cached_Value(output, nodal_access))
                {
                    Input_Functions->evaluate       ( Input_In, nodal_access, &output );
                    Value_Cache_p->Set_Cached_Value ( output,   nodal_access );
                }
            }
            else
                Input_Functions->evaluate ( Input_In, nodal_access, &output );
            
            // Set output and coordinate with VPR
            Output_Functions->setValue( node_value, &output );
            
            Mark_Leaving_Evaluation();
        }
    The last thing we have to do is coordinate our destroy() call. It pained me to do it this way, but it works -- when destroy() is called, we first mark the object for destruction (so no new threads will increment our evaluation counter), then wait until that counter hits zero.

    Here's the helper methods for that:

    Code:
            // -------------------------------------------------------------------- Mark_for_Destruction()
            void Mark_for_Destruction()
            {
                // Attempt lock
                if (!Threading_Functions->groupLockMutex( Threading_Group_ID, MutexID_Object_Lock ) )
                    throw std::runtime_error("Cache_Node_Base::Mark_for_Destruction() -- Cannot lock object");
                
                // Let the object know the end is nigh...
                Destroy_Requested = true;
                
                // Unlock
                Threading_Functions->groupUnlockMutex( Threading_Group_ID, MutexID_Object_Lock );
            }
            
            // -------------------------------------------------------------------- is_In_Evaluation()
            bool is_In_Evaluation()
            {
                // Attemp lock
                if (!Threading_Functions->groupLockMutex( Threading_Group_ID, MutexID_Object_Lock ) )
                    throw std::runtime_error("Cache_Node_Base::is_In_Evaluation() -- Cannot lock object");
                
                // Get evaulation status
                bool evaluation_status = count_In_Evaluation > 0;
                
                // Unlock and return
                Threading_Functions->groupUnlockMutex( Threading_Group_ID, MutexID_Object_Lock );
                
                return evaluation_status;
            }
    The destroy() method marks the object as ready for destruction, then repeatedly checks the evaluation counter until no more evaluations are in progress:

    Code:
            // ------------------------------------------------------------------------ Destroy()
            XCALL_(static void)
            
            Destroy (void * cache_node_vptr)
            {
                // Get strongly-typed pointer to object
                Derived_t * derived_cache_node = static_cast< Derived_t* > (cache_node_vptr);
                
                // Indicate to other threads we wish to destroy object
                derived_cache_node->Mark_for_Destruction();
                
                // Wait until no more threads are evaluating the node
                while (derived_cache_node->is_In_Evaluation())
                    std::this_thread::sleep_for(std::chrono::milliseconds(1));;
                
                // Destroy object
                delete derived_cache_node;
            }
    I can now successfully manipulate the cache node in the node editor while VPR is running. No crashes. And all without locking up the entire evaluation method.
    Last edited by jrandom; 03-17-2013 at 11:35 AM.

  13. #13
    obfuscated SDK hacker Lightwolf's Avatar
    Join Date
    Feb 2003
    Location
    Stuttgart, Germany
    Posts
    13,711
    That's a neat trick... I'm just not sure if there's a fluke chance that it may break - if eveluate is called right as delete itself is taking place and the mutex is gone (assuming it is a member variable).

    I'll need to whip up a test scene to see if the limbo state would help in delete.

    Cheers,
    Mike

  14. #14
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    I used the same lock for both evaluation and deletion and I check the Destroy_Requested in the Mark_Entering_Evaluation() method so... lemme think for a second...

    If evaluate is called as deletion is taking place, either Destroy_Requested will already be set and the evaluation will abort, or Destroy_Requested has not yet been set, Mark_Entering_Evaluation() locks, destroy() tries to lock the same mutex and waits, Mark_Entering_Evaluation() increments the counter and unlocks, then destroy() sets the Destroy_Requested flag then busy-waits on the counter.

    I think that's solid, but I'm not 100% certain. Threading can be such a nightmare.

    Edit: Whoops, read your comment wrong. Okay, so if the C++ delete is called as an evaluation starts... that would break if I was using auto-nulling wrappers around Lightwave's mutex ID's. In fact, this is why the C++ version broke in the first place -- the caching node stores the pointer to the value cache class in a std::unique_ptr() which auto-nulls on destruction, whereas the C version didn't and the values in memory were freed but unaltered and still technically accessible in Evaluate().

    However, since I'm just using the straight-up Lightwave mutex ID value type, I believe it's either crash-proof, or so unlikely as to be practically crash-proof. Interestingly enough, this would be more crashy if built with a C++ compiler that auto-nulls POD members upon class destruction. I think the XBox360 works this way.
    Last edited by jrandom; 03-17-2013 at 01:10 PM.

  15. #15
    eye kan kode gud jrandom's Avatar
    Join Date
    Dec 2009
    Location
    Seattle, WA
    Posts
    1,143
    P.S. -- Limbo state? What is that?

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •