Abstract Segmentation of ultra-high resolution images is increasingly demanded, yet poses signifificant challenges for algorithm effificiency, in particular considering the (GPU) memory limits. Current approaches either downsample an ultrahigh resolution image or crop it into small patches for separate processing. In either way, the loss of local fifine details or global contextual information results in limited segmentation accuracy. We propose collaborative GlobalLocal Networks (GLNet) to effectively preserve both global and local information in a highly memory-effificient manner. GLNet is composed of a global branch and a local branch, taking the downsampled entire image and its cropped local patches as respective inputs. For segmentation, GLNet deeply fuses feature maps from two branches, capturing both the high-resolution fifine structures from zoomed-in local patches and the contextual dependency from the downsampled input. To further resolve the potential class imbalance problem between background and foreground regions, we present a coarse-to-fifine variant of GLNet, also being