使用Go和App Engine的数据存储区导入和解析大型CSV文件

使用Go和App Engine的数据存储区导入和解析大型CSV文件

问题描述:

Locally I am successfully able to (in a task):

  • Open the csv
  • Scan through each line (using Scanner.Scan)
  • Map the parsed CSV line to my desired struct
  • Save the struct to datastore

I see that blobstore has a reader that would allow me toread the value directly using a streaming file-like interface. -- but that seems to have a limit of 32MB. I also see there's a bulk upload tool -- bulk_uploader.py -- but it won't do all the data-massaging I require and I'd like to limit writes (and really cost) of this bulk insert.

How would one effectively read and parse a very large (500mb+) csv file without the benefit of reading from local storage?

在本地我可以成功地(在任务中): p>

    \ n
  • 打开csv li>
  • 扫描每行(使用Scanner.Scan) li>
  • 将已解析的CSV行映射到我想要的结构 li>
  • 将结构保存到数据存储区 li> ul>

    我看到了 blobstore有一个读者,它可以让我 直接使用类似于流文件的界面读取该值。 code> -似乎限制为32MB。 我还看到有一个批量上传工具-bulk_uploader.py-但它不能满足我所需的所有数据处理方式,因此我想限制此批量插入的写入次数(以及真正的成本)。 p>

    在没有从本地存储读取的好处的情况下,如何有效地读取和解析非常大的(500mb +)csv文件? p> div>

You will need to look at the following options and see if it works for you :

  1. Looking at the large file size, you should consider using Google Cloud Storage for the file. You can use the command line utilities that GCS provides to upload your file to your bucket. Once uploaded, you can look at using the JSON API directly to work with the file and import it into your datastore layer. Take a look at the following: https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples

  2. If this is like a one time import of a large file, another option could be spinning up a Google Compute VM, writing an App there to read from GCS and pass on the data via smaller chunks to a Service running in App Engine Go, that can then accept and persist the data.

Not a the solution I hoped for, but I ended up splitting the large files into 32MB pieces, uploading each to blob storage, then parsing each in a task.

It aint' pretty. But it took less time than the other options.