Building energy data has been used for decades to understand energy flows in buildings and plan for future energy demand. Recent market, technology and policy drivers have resulted in widespread data collection by stakeholders across the buildings industry. Consolidation of independently collected and maintained datasets presents a cost-effective opportunity to build a database of unprecedented size. Applications of the data include peer group analysis to evaluate building performance, and data-driven algorithms that use empirical data to estimate energy savings associated with building retrofits. This paper discusses technical considerations in compiling such a database using the DOE Buildings Performance Database (BPD) as a case study. We gathered data on over 750,000 residential and commercial buildings. We describe the process and challenges of mapping and cleansing data from disparate sources. We analyze the distributions of buildings in the BPD relative to the Commercial Building Energy Consumption Survey (CBECS) and Residential Energy Consumption Survey (RECS), evaluating peer groups of buildings that are well or poorly represented, and discussing how differences in the distributions of the three datasets impact use-cases of the data. Finally, we discuss the usefulness and limitations of the current dataset and the outlook for increasing its size and applications.