ImageNet images are all different sizes, but neural networks need a fixed size input.
One solution is to take a crop size that is as large as will fit in the image, centered around the center point of the image. This works but has some drawbacks. Often times important parts of the object of interest in the image are cut out, and there are even cases where the correct object is completely missing while another object that belongs to a different class is visible, meaning your model will be trained wrong for that image.
Another solution would be to use the entire image and zero pad it to where each image has the same dimensions. This seems like it would interfere with the training process though, and the model would learn to look for vertical/horizontal patches of black near the edge of images.
What is commonly done?