The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

1 aLazar, Alina1 aJin, Ling1 aSpurlock, Anna1 aWu, Kesheng1 aSim, Alex1 aTodd, Annika uhttp://dl.acm.org/citation.cfm?doid=3317030http://dl.acm.org/citation.cfm?doid=3317030.3301294http://dl.acm.org/ft_gateway.cfm?id=3301294&ftid=2043165&dwn=100695nas a2200205 4500008003900000245009500039210006900134260002300203653002100226653002900247653002700276100001400303700001500317700001400332700001800346700001600364700001900380700001700399856007300416 2017 d00aComparison of Clustering Techniques for Residential Energy Behavior using Smart Meter Data0 aComparison of Clustering Techniques for Residential Energy Behav aSan Francisco, CA 10acluster analysis10acluster validity metrics10aload shape time series1 aJin, Ling1 aLee, Doris1 aSim, Alex1 aBorgeson, Sam1 aWu, Kesheng1 aSpurlock, Anna1 aTodd, Annika uhttps://www.aaai.org/ocs/index.php/WS/AAAIW17/paper/view/15166/1467302125nas a2200169 4500008004100000245009100041210006900132260003500201520154200236100001701778700001401795700001901809700001701828700001601845700001401861856008001875 2017 eng d00aData quality challenges with missing values and mixed types in joint sequence analysis0 aData quality challenges with missing values and mixed types in j aBoston, MA, USAbIEEEc12/20173 aThe goal of this paper is to investigate the impact of missing values in categorical time series sequences on common data analysis tasks. Being able to more effectively identify patterns in socio-demographic longitudinal data is an important component in a number of social science settings. However, performing fundamental analytical operations, such as clustering for grouping these data based on similarity patterns, is challenging due to the categorical and multi-dimensional nature of the data, and their corruption by missing and inconsistent values. To study these data quality issues, we employ longitudinal sequence data representations, a similarity measure designed for categorical and longitudinal data, together with state-of-the art clustering methodologies reliant on hierarchical algorithms. The key to quantifying the similarity and difference among data records is a distance metric. Given the categorical nature of our data, we employ an “edit” type distance using Optimal Matching (OM). Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single similarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is harder in the nominal domain versus the binary domain. Additionally, artificial clusters introduced by the alignment of leading missing values can be resolved by tuning the missing value substitution cost parameter.

1 aLazar, Alina1 aJin, Ling1 aSpurlock, Anna1 aTodd, Annika1 aWu, Kesheng1 aSim, Alex uhttps://energyanalysis.lbl.gov/publications/data-quality-challenges-missing01548nas a2200205 4500008004100000245006000041210006000101260001200161300000900173490001100182520096900193100001701162700001701179700001701196700001901213700001401232700001701246700001601263856006301279 2017 eng d00aPredicting Baseline for Analysis of Electricity Pricing0 aPredicting Baseline for Analysis of Electricity Pricing c09/2017 a3-200 vVol. 53 aTo understand the impact of new pricing structure on residential electricity demands, we need a baseline model that captures every factor other than the new price. The standard baseline is a randomized control group, however, a good control group is hard to design. This motivates us to develop data-driven approaches. We explored many techniques and designed a strategy, named LTAP, that could predict the hourly usage years ahead. The key challenge in this process is that the daily cycle of electricity demand peaks a few hours after the temperature reaching its peak. Existing methods rely on the lagged variables of recent past usages to enforce this daily cycle. These methods have trouble making predictions years ahead. LTAP avoids this trouble by assuming the daily usage profile is determined by temperature and other factors. In a comparison against a well-designed control group, LTAP is found to produce accurate predictions.

1 aKim, Taehoon1 aLee, Dongeun1 aChoi, Jaesik1 aSpurlock, Anna1 aSim, Alex1 aTodd, Annika1 aWu, Kesheng uhttp://www.inderscience.com/info/inarticle.php?artid=88269