VecShare

This library provides functionality for rapidly sharing and retrieving word embeddings over the internet. (Accepted at EMNLP 2017).

View the Project on GitHub

VecShare: Framework for Sharing Word Embeddings

About VecShare

The vecshare python library for word embedding query, selection and download. The vecshare python library uses indexers to regularly poll the data.world datastore for uploaded embeddings, record associated metadata, and generate lightweight signatures representing each uploaded embedding. Users can select embeddings for use by specifying the name of the desired embedding or using provided methods to compare their corpus against indexed signatures and extracting the embedding most similar to the target corpus.

Read more about VecShare: https://bit.ly/VecShare.

Embedding Leaderboard

Each indexed is evaluated and assigned a score on 10 word pair similarity tasks. The score is calculated by measuring the average Spearman correlation of the word vector cosine similarities and human-rated similarity for each word pair.

Highest Scoring Word Embeddings:

embedding_name contributor embedding_type dimension score
gnews_mod jaredfern word2vec 200 0.491233
glove_Gigaword100d jaredfern glove 100 0.456143
text8_emb jaredfern word2vec 50 0.37306
books_40 jaredfern word2vec 100 0.303337
OANC_Written jaredfern word2vec 100 0.293891
econ_40 jaredfern word2vec 100 0.290213
agriculture_40 jaredfern word2vec 100 0.289704
govt_40 jaredfern word2vec 100 0.288382
weather_40 jaredfern word2vec 100 0.277633
arts_40 jaredfern word2vec 100 0.266848

Word Pair Similarity Tasks:

Installation:

Install the VecShare Python library:

pip install vecshare

Before using the vecshare library, configure the datadotworld library with your API token. Your token is obtainable on data.world under Settings > Advanced

Set your data.world token using:

dw configure

or

export DW_AUTH_TOKEN=<DATA.WORLD_API_TOKEN>

To avoid resetting the token, add your token as a permanent environment variable to your bash profile. See Advanced Setup for details on creating new indexers or signature methods.

Supported Functions

The VecShare Python library currently supports:

Check Available Embeddings

check(): Returns embeddings available with the current indexer as a queryable pandas.DataFrame.

The default indexer aggregates a set of embeddings by polling data.world weekly for datasets with the tag vecshare. Currently indexed embeddings are viewable at: https://data.world/jaredfern/vecshare-indexer.

See Advanced Setup, if you would like to use a custom indexer.

For Example:

>>> from vecshare import vecshare as vs
>>> vs.check()
        embedding_name                              dataset_name contributor  \
0            reutersr8         jaredfern/reuters-word-embeddings   jaredfern   
1         reuters21578         jaredfern/reuters-word-embeddings   jaredfern   
2                brown                    jaredfern/brown-corpus   jaredfern   
3   glove_gigaword100d        jaredfern/gigaword-glove-embedding   jaredfern   
4         oanc_written            jaredfern/oanc-word-embeddings   jaredfern   


   case_sensitive  dimension embedding_type file_format vocab_size
0           False        100       word2vec         csv       7821
1           False        100       word2vec         csv      20203     
2           False        100       word2vec         csv      15062     
3           False        100          glove         csv     399922
4           False        100       word2vec         csv      73127        

Embedding Upload or Update

Embeddings must be uploaded as a .csv file with a header in the format: [‘text’, ‘d0’, ‘d1’, … ‘d_n’], such that they can be properly indexed and accessed.

format(emb_path,vocab_size=None,dim=None,pca=False,precision=None,sep=","): Formats local embeddings for upload to the data store as needed: 1) A header will be prepended to the file (text, d1, d2, …, dn) 2) Elements will be delimited with “,” 3) Prefix line from plain text word2vec format: Remove “ "

upload(set_name, emb_path, metadata = {}, summary = None): Create a new shared embedding on data.world

Alternatively, new embeddings can be added to the framework by uploading the embedding as a .csv file to data.world, and tagging the dataset with the tag. The default indexer will add new embedding sets weekly.

Metadata associated with the embedding can be added in the datasets description in the following format, Field: Value

For example:

Embedding Type: word2vec
Token Count: 6000000
Case Sensitive: False

Embedding Selection

signatures.avgrank(inp_dir): Returns the shared embedding most similar to the user’s target corpus, using the AvgRank method described in the VecShare paper. Note: Computation is performed locally. Users’ corpora will not be shared with other users

For Example:

>>> from vecshare import vecshare as vs
>>> vs.query(['The', 'farm'], 'agriculture_40')
   text       d99       d98       d97       d96       d95   ...           d1      d0  
0   the -1.414755  0.414973  1.115698  0.034085  0.542921   ...   0.037287 -1.004704  
1  farm  0.349535 -0.379208 -0.189476  2.776809 -0.099886   ...   0.067443 -1.391604  
[2 rows x 101 columns]

Embedding Extraction

def extract(emb_name, file_dir, set_name = None, download = False): Return a pandas DataFrame containing all available word vectors for the target corpora’s vocabulary.

Parameters:

For example:

>>> from vecshare import vecshare as vs
>>> vs.extract('agriculture_40', 'Test_Input/reutersR8_all')
Embedding extraction begins.
100% (23584 of 23584) |################################| Elapsed Time: 0:01:04
Embedding successfully extracted.

              text       d99       d98       d97       d96       d95    ... \
0        designing -0.194328 -0.229856  0.455848  0.234053 -0.272354    ...
1       affiliated -0.446879 -0.519360  0.130626  0.034608  0.134680    ...
2    appropriately  0.106778  0.057186 -0.222296  0.101948  0.395122    ...
3       cincinnati -0.563716 -0.274534  0.120897  0.273457  0.383307    ...
4           choice  0.689276  1.586349  1.301351 -1.193058 -0.243053    ...
5              han -0.287583  0.237989 -0.141203  0.328414  0.401448    ...
6            begin  1.952841 -1.497073 -0.656650  2.443687  0.315941    ...
7        wednesday -1.591453 -1.419733 -0.758305  2.638620  0.323779    ...
8            wales -0.591623 -0.761353 -0.042557 -0.106776  0.004614    ...
9             much  1.971340 -2.316020  0.147194 -0.641963 -0.280868    ...

            d14       d13       d12       d11       d10        d1         d0
0      0.432226 -0.023887 -0.246207  0.429862  0.268280  0.283950   0.218664   
1      0.702217 -0.516346  0.273179  0.662874  0.106199 -0.011592   0.057832   
2     -0.174151 -0.069734 -0.255887  0.070181 -0.163013  0.093490   0.028913
3     -0.189739 -0.089899 -0.048192  0.569139  0.595834  0.421905  -0.241777
4     -1.085993 -0.054178  1.156616 -1.449286  0.267787  0.677337   2.148856  
5     -0.004664 -0.414933 -0.346377 -0.214976  0.201621  0.063539  -0.331673
6      1.587940 -0.258819  1.396479  0.637493 -1.476619 -0.487518   0.864765    
7      0.190376  0.881103  0.966915  1.543105  1.974099 -0.807656   0.800163  
8     -0.181255  0.005893 -0.718905  0.373082  0.784821  0.393715  -0.000517  
9      1.348299  0.180225  1.686486  0.535154 -2.005099 -1.424234  -2.677770    
[9320 rows x 101 columns]

Full Embedding Download

download(emb_name, set_name=None): Returns the full embedding, containing all uploaded word vectors in the shared embedding and saves the embedding as a .csv file in the current directory

For example:

>>> from vecshare import vecshare as vs
>>> vs.download('agriculture_40')
              text        d0        d1        d2        d3        d4  \
0              the  1.477964  0.016078 -0.193995  1.113142  0.765398   
1               of -0.048878 -0.597735  0.196982  0.220966  1.463818   
2               to  1.932197  1.587676 -0.321938 -0.592603  0.137684   
3               in  0.294486  1.061131 -0.119670  0.611166  0.436337   
4             said -0.609932 -0.481854  0.028189  0.755433 -0.493351   
5                a  0.750953  0.342545 -0.758257  0.381944  0.824879   
6              and  0.991821 -0.252496  0.011951  0.384948  0.505785   
7              mln  0.215208  3.330005  0.458480  0.484309  1.128098   
8               vs  0.512198  3.565070 -1.698517  0.813855 -0.002396   
9             dlrs -0.026384  1.905773  1.313683  0.825797  1.981671

Advanced Setup

Custom Signature Methods:

Additional signature methods can be included in the library by downloading the library and adding to the signatures.py file. To incorporate new signatures into future releases of the official VecShare library, fork and merge your changes with the github repository.

Custom Indexers:

Custom indexers can be added by updating the indexer.py file.

INDEXER      = <NEW INDEXER DATASET ID>
INDEX_FILE   = <NAME OF THE INDEX FILE>
EMB_TAG      = <EMB TAG>

Acknowledgements

This research was supported in part by NSF grant IIS-1351029 and the Allen Institute for Artificial Intelligence.