Fix caching allocator of out-of-tree device is destructed before the … #126677

cdzhan · 2024-05-20T09:33:15Z

…destruction of tensors cached by autocast

Root Cause

For out-of-tree device extension it is loaded after torch (different .so), so the global variable cached_casts may be constructed before caching allocator and then destructed in reversed order when exit.

Fix

Lazily initialize cached_casts to correct the order.

How to Reproduce && Test

Modify the testcase TestAutocastGPU.test_cast_cache_is_global in test/test_autocast.py to run on your out-of-tree device. You will see following failure in the end of test.

----------------------------------------------------------------------                                                                                                                                                                                                                                              
Ran 1 test in 4.812s                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                    
OK                                                                                                                                                                                                                                                                                                                  
free: 0x30080ff44000400                                                                                                                                                                                                                                                                                             
terminate called after throwing an instance of 'c10::Error'                                                                                                                                                                                                                                                         
  what():  invalid device pointer: 0x30080ff44000400                                                                                                                                                                                                                                                                
Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first):                                                                                                                                                 
frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                  
frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)   
frame #12: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                      
frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void*) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                             
frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void*) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                                         frame #15: std::unique_ptr<void, void (*)(void*)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)                                                                                                                                               
frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)                   
frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)                   
frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                                        
frame #21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                              
frame #23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                               
frame #24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)                                                                                                                                                                               
frame #25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                
frame #28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                
frame #31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                frame #32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                      
frame #37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                
frame #38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                                                                                                         
frame #39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                
<omitting python frames>                                                                                                                                  
frame #47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6)                                                                   
                                                                                                                                                          
Aborted (core dumped)

cc @mcarilli @ptrblck @leslie-fang-intel @jgong5 @albanD @ezyang

…destruction of tensors cached by autocast

pytorch-bot · 2024-05-20T09:33:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126677

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 086e2da with merge base d9c3485 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_cuda_expandable_segments.py::TestCudaOptimsCUDA::test_graph_scaling_fused_optimizers_Adagrad_cuda_float32
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_cuda_expandable_segments.py::TestCudaOptimsCUDA::test_graph_scaling_fused_optimizers_Adagrad_cuda_float32
trunk / linux-focal-rocm6.1-py3.8 / test (default, 1, 2, linux.rocm.gpu) (gh) (trunk failure)
test_cuda.py::TestCudaOptimsCUDA::test_graph_scaling_fused_optimizers_Adagrad_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD · 2024-05-20T14:25:45Z

Hey!
This will not help if the cache is initialized before the extension is loaded though?

I would be curious in general, how do we handle a Tensor of privateuse1 device that outlives the .so providing that support (let's say we manually unload the shared lib to trigger this without static initialization ordering being involved)? @FFFrog might have an idea?

ezyang

Even if it doesn't completely solve the problem, this seems harmless enough

ezyang · 2024-05-21T00:32:42Z

@pytorchbot merge

pytorchmergebot · 2024-05-21T00:34:32Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ezyang · 2024-05-21T01:43:21Z

@pytorchbot merge

pytorchmergebot · 2024-05-21T01:45:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

cdzhan · 2024-05-21T06:53:53Z

Hey! This will not help if the cache is initialized before the extension is loaded though?

Yes, currently there is such a possibility. For example, torch.clear_autocast_cache might be called before the extension package is imported. In fact, from what I can recall, there are several issues caused by extension packages not being loaded promptly. Therefore, we generally require users to load our extension packages as early as possible to enable the functionalities provided by the packages (normally user should import the extension immediately after importing torch). Of course, we should try to avoid this situation as much as possible, and I believe that this kind of problem will greatly improve once autoload is supported in the future.

FFFrog · 2024-05-21T14:10:31Z

Hey! This will not help if the cache is initialized before the extension is loaded though?

I would be curious in general, how do we handle a Tensor of privateuse1 device that outlives the .so providing that support (let's say we manually unload the shared lib to trigger this without static initialization ordering being involved)? @FFFrog might have an idea?

Sorry, I can't find any other good ideas to solve this problem other than trying to avoid it.

In my opinion, defining a global variable directly is almost the same as defining a variable in a function, with just a few differences:

Similarities:
The destructuring of variables is achieved by registering the dctor to the atexit mechanism. The order of function execution is opposite to the order of registration, so the earlier the function is registered, the later it will be executed.

Difference:
Defining global variables directly: registration will be triggered when .so is loaded. Of course, dependent .so will be registered first.
Define global variables in functions: registration is triggered when the function is called

Therefore, we can defer the registration of cached_casts through the latter, and it will be deconstructed first (before the allocator is deconstructed) when the program exits.

I completely agree that autoloading would fundamentally avoid this problem, but it only avoids it rather than solving it.

Fix caching allocator of out-of-tree device is destructed before the …

086e2da

…destruction of tensors cached by autocast

pytorch-bot bot added the module: amp (automated mixed precision) autocast label May 20, 2024

pytorchbot added the open source label May 20, 2024

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024

ezyang approved these changes May 21, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 21, 2024

pytorchmergebot added the merging label May 21, 2024

pytorchmergebot removed the merging label May 21, 2024

ezyang added the topic: not user facing topic category label May 21, 2024

pytorchmergebot added the merging label May 21, 2024

pytorchmergebot added Merged and removed merging labels May 21, 2024

pytorchmergebot closed this in 40cc616 May 21, 2024

cdzhan deleted the cdzhan-patch-4 branch May 21, 2024 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix caching allocator of out-of-tree device is destructed before the … #126677

Fix caching allocator of out-of-tree device is destructed before the … #126677

cdzhan commented May 20, 2024 •

edited

pytorch-bot bot commented May 20, 2024 •

edited

albanD commented May 20, 2024

ezyang left a comment

ezyang commented May 21, 2024

pytorchmergebot commented May 21, 2024

ezyang commented May 21, 2024

pytorchmergebot commented May 21, 2024

cdzhan commented May 21, 2024

FFFrog commented May 21, 2024 •

edited

Fix caching allocator of out-of-tree device is destructed before the … #126677

Fix caching allocator of out-of-tree device is destructed before the … #126677

Conversation

cdzhan commented May 20, 2024 • edited

Root Cause

Fix

How to Reproduce && Test

pytorch-bot bot commented May 20, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126677

✅ You can merge normally! (3 Unrelated Failures)

albanD commented May 20, 2024

ezyang left a comment

Choose a reason for hiding this comment

ezyang commented May 21, 2024

pytorchmergebot commented May 21, 2024

Merge failed

ezyang commented May 21, 2024

pytorchmergebot commented May 21, 2024

Merge started

cdzhan commented May 21, 2024

FFFrog commented May 21, 2024 • edited

cdzhan commented May 20, 2024 •

edited

pytorch-bot bot commented May 20, 2024 •

edited

FFFrog commented May 21, 2024 •

edited