会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 6. 发明申请
    • METHOD AND SYSTEM FOR PARALLELIZATION OF INGESTION OF LARGE DATA SETS
    • US20180253478A1
    • 2018-09-06
    • US15909846
    • 2018-03-01
    • Next Pathway Inc.
    • Badih SchoueriGregory GorshteinVladimir Antonevich
    • G06F17/30G06F3/06
    • G06F16/254G06F3/0643G06F16/10G06F16/84
    • The present invention relates, in an embodiment, to a method for ingesting input data containing a plurality of records into a data lake. In an embodiment, the method comprises splitting the input data into a plurality of input splits consisting of a balanced number of records; reading the records from the plurality of input splits in parallel, regardless of the format and encoding of the input source; converting the input data within the records into at least one key/value pair; transforming the values input data into a serializable format; sorting the key/value pairs of the transformed values such that the records are sorted in the same order as they were read; writing the transformed values to an output file; and storing the output file to the data lake. The present invention also relates, in another embodiment, to a system for ingesting input data containing a plurality of records into a data lake. In an embodiment, the system comprises one or more processors adapted to execute one or more modules, the modules comprising: an input module for splitting the input data into input splits; a mapping module for transforming the input data in the input splits into a format for processing; a partition module for sorting the transformed data; and an output module for writing the sorted transformed data to an output file and determining a location on the data lake for the output file; and a driver for communicating with the one or more modules of the one or more processors via a first communication layer, the driver configuring the one or more modules and calculating the input splits.