Given a Module that's been traced with symbolic tracing, we want to demonstrate what it would look like to write a pass that splits up the nodes into a set of N partitions and reassembles the graph ...
For a large reduction op, Inductor splits it into a few triton kernels. The FX graph segment returned from get_kernel_metadata does not work. For example: def simple_sum_reduction(x): return torch.sum ...