Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caching allocator of out-of-tree device is destructed before the … #126677

Closed
wants to merge 1 commit into from

Conversation

cdzhan
Copy link
Contributor

@cdzhan cdzhan commented May 20, 2024

…destruction of tensors cached by autocast

Root Cause

For out-of-tree device extension it is loaded after torch (different .so), so the global variable cached_casts may be constructed before caching allocator and then destructed in reversed order when exit.

Fix

Lazily initialize cached_casts to correct the order.

How to Reproduce && Test

Modify the testcase TestAutocastGPU.test_cast_cache_is_global in test/test_autocast.py to run on your out-of-tree device. You will see following failure in the end of test.

----------------------------------------------------------------------                                                                                                                                                                                                                                              
Ran 1 test in 4.812s                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                    
OK                                                                                                                                                                                                                                                                                                                  
free: 0x30080ff44000400                                                                                                                                                                                                                                                                                             
terminate called after throwing an instance of 'c10::Error'                                                                                                                                                                                                                                                         
  what():  invalid device pointer: 0x30080ff44000400                                                                                                                                                                                                                                                                
Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first):                                                                                                                                                 
frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                  
frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)   
frame #12: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                      
frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void*) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                             
frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void*) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                                         frame #15: std::unique_ptr<void, void (*)(void*)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)                                                                                                                                               
frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)                   
frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)                   
frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                               
frame #24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                               
frame #25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                
frame #28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                
frame #31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                frame #32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                
frame #38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                                                                                                         
frame #39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                
<omitting python frames>                                                                                                                                  
frame #47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6)                                                                   
                                                                                                                                                          
Aborted (core dumped)                                                                                                                                     

cc @mcarilli @ptrblck @leslie-fang-intel @jgong5 @albanD @ezyang

Copy link

pytorch-bot bot commented May 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126677

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 086e2da with merge base d9c3485 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@albanD
Copy link
Collaborator

albanD commented May 20, 2024

Hey!
This will not help if the cache is initialized before the extension is loaded though?

I would be curious in general, how do we handle a Tensor of privateuse1 device that outlives the .so providing that support (let's say we manually unload the shared lib to trigger this without static initialization ordering being involved)? @FFFrog might have an idea?

@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it doesn't completely solve the problem, this seems harmless enough

@ezyang
Copy link
Contributor

ezyang commented May 21, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ezyang ezyang added the topic: not user facing topic category label May 21, 2024
@ezyang
Copy link
Contributor

ezyang commented May 21, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@cdzhan cdzhan deleted the cdzhan-patch-4 branch May 21, 2024 06:20
@cdzhan
Copy link
Contributor Author

cdzhan commented May 21, 2024

Hey! This will not help if the cache is initialized before the extension is loaded though?

Yes, currently there is such a possibility. For example, torch.clear_autocast_cache might be called before the extension package is imported. In fact, from what I can recall, there are several issues caused by extension packages not being loaded promptly. Therefore, we generally require users to load our extension packages as early as possible to enable the functionalities provided by the packages (normally user should import the extension immediately after importing torch). Of course, we should try to avoid this situation as much as possible, and I believe that this kind of problem will greatly improve once autoload is supported in the future.

@FFFrog
Copy link
Collaborator

FFFrog commented May 21, 2024

Hey! This will not help if the cache is initialized before the extension is loaded though?

I would be curious in general, how do we handle a Tensor of privateuse1 device that outlives the .so providing that support (let's say we manually unload the shared lib to trigger this without static initialization ordering being involved)? @FFFrog might have an idea?

Sorry, I can't find any other good ideas to solve this problem other than trying to avoid it.

In my opinion, defining a global variable directly is almost the same as defining a variable in a function, with just a few differences:

Similarities:
The destructuring of variables is achieved by registering the dctor to the atexit mechanism. The order of function execution is opposite to the order of registration, so the earlier the function is registered, the later it will be executed.

Difference:
Defining global variables directly: registration will be triggered when .so is loaded. Of course, dependent .so will be registered first.
Define global variables in functions: registration is triggered when the function is called

Therefore, we can defer the registration of cached_casts through the latter, and it will be deconstructed first (before the allocator is deconstructed) when the program exits.

I completely agree that autoloading would fundamentally avoid this problem, but it only avoids it rather than solving it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: amp (automated mixed precision) autocast open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants