Crop yield estimations are important for national food security, people, and the environment. Timely and accurate estimation of crop yield at the field scale is of great significance for crop management, harvest and trade. It ultimately enables farmers to optimize inputs and economic return. We selected an irrigated wheat field in a region near Kaifeng, Henan province, for this study. The terrain in that region is undulating and spatial differences. We used a low-altitude unmanned aerial vehicle (UAV) remote sensing platform equipped with a multi-spectral camera, thermal infrared camera, and RGB camera to simultaneously obtain different remote sensing parameters during the key growth stages of wheat. Based on the extracted spectral reflectivity, thermal infrared temperature, and digital elevation information, we calculated the spatial variability of remote sensing parameters, and growth indices under different terrain characteristics. We also analyzed the correlations between vegetation indices, temperature parameters and wheat yield. By means of four machine learning methods, including multiple linear regression method (MLR), partial least squares regression method (PLSR), support vector machine regression method (SVR), and random forest regression method (RFR), we compared the yield estimation capability of single-modal data versus multimodal data fusion frameworks. The results showed that slope was an important factor affecting crop growth and yield. We observed significant differences in remote sensing parameters under different slope grades. Soil water content, water content of plants, and above-ground biomass at the three growth stages were significantly correlated with slope. Most of the vegetation indices and temperature parameters of three growth stages were significantly correlated with yield as well. Based on the strength of their correlation with yield, seven vegetation indices (NDVI, GNDVI, EVI2, OSAVI, SAVI, NDRE, and WDRVI) and two temperature parameters (NRCT, CTD) were selected as the final input variables for the model. For the single-modal data framework, the model constructed with the vegetation indices was better than the yield model constructed with the temperature parameters, and the highest accuracy was obtained with a RFR model based on vegetation indices at filling stage (R2 = 0.724, RMSE = 614.72 kg hm-2, MAE = 478.08 kg hm-2). For the double modal data fusion approach, the highest accuracy resulted at flowering stage, using the temperature parameters combined with the vegetation indices of RFR model (R2=0.865, RMSE=440.73 kg hm-2, MAE=374.86 kg hm-2). Even higher accuracies were obtained, using the multimodal data fusion approach with a RFR model based on vegetation indices, temperature parameters and slope information at flowering stage (R2 = 0.893, RMSE = 420.06 kg hm-2, MAE = 352.69 kg hm-2), and the highest validation model (R2 = 0.892, RMSE = 423.55 kg hm-2, MAE = 334.43 kg hm-2) for fusion of the flowering stage. The results revealed that by using a multimodal data fusion framework of terrain factors combined with RFR, we can fully exploit the complementary and synergistic roles of different remote sensing information sources. This effectively improves the accuracy and stability of the yield estimation model, and provides a reference and support for crop growth monitoring and yield estimation.