Augmenting Data with Semantic Features

With this experiment, we build an engine that can estimate the value of a label given one or multiple sparse features. A sparse feature may be a piece of text, a set of items, or a combination of other sparse features. The output estimates - semantic features - can be used as features in other machine learned models or hand-tuned logic. Compared with common alternatives such as learned embeddings, features produced by this Experiment are as accurate, while being interpretable and debuggable.

Apply for access Private documentation

Intended use

Inputs and Outputs


  • Training inputs: A set of examples, each of which contains one or multiple sparse features (a text string that can be broken into a set of ngrams or a set of items) and one or multiple numerical labels that reflect certain target values to predict. One example is the Kaggle Kickstarter problem that predicts how likely a campaign succeeds given its name and some other features. In this case, the sparse feature can be the name of the campaign that is a text string, and the label can be the binary outcome of “success” or “fail”, or some score indicating how successful the campaign is.
  • Training output: A lookup table that allows for predicting future values of labels given the values of the sparse features.


  • Evaluation inputs: The same type of sparse features used at training time. For example, the name of a campaign.
  • Evaluation output: The predicted values of the labels defined in the original training run. In our example above, it would be the predicted probability of a campaign being successful. E.g. P(succeed | “Superhero Teddy Bear”).

Technical challenges:

This approach can be used in various machine learning tasks that involve predicting a label using sparse features, such as text, categorical features with a large and/or unknown number of possible values, or an unordered set of items, possibly also with other features.

This experiment will be most helpful for the use case when sparse features are used to predict a label and the user wants to do so in a way that’s easy to explain and debug.

As part of the application to participate in this experiment, we will ask you about your use case, data types, and/or other relevant questions to ensure that the experiment is a good fit for you.

What data do I need?

Data and label types:

This experiment has been designed to help customers convert sparse features into meaningful dense features.

  • It is likely to be effective with sparse features that are short texts, a set of items, or a small number of such sparse features in combination (e.g. <= 3). For example, in the Kaggle Kickstarter problem described above, the sparse features can include the campaign name (short text) and its category (a set with a single item).
  • It may not be effective with long texts, or highly sparse categorical features that cannot be broken into smaller parts.
  • The number of training samples needed depend on how sparse the input features are and the number of them. The more sparse and more of them being used, the more examples are needed. More training samples usually lead to better accuracy. In general, a good range would be tens of thousands to hundreds of millions of samples.

We do not accept data or labels with personally identifiable information (e.g. name, email, etc.)


  • Data specs
    • The input dataset must contain at least one sparse feature and one numerical label.
    • Stored in the following formats: TSV, CSV with a header of the feature/label names.
  • Configuration specs
    • The names of the sparse features in the dataset, their types (Text or Set), and the delimiter that can be used to break each of them into items.
    • The names of the labels.

What skills do I need?

As with all AI Workshop experiments, successful users are likely to be savvy with core AI concepts and skills in order to both deploy the experiment technology and interact with our AI researchers and engineers.