PhD Studies PC - Database+R

Author
Discussion

ThePlanner

Original Poster:

5,252 posts

267 months

Thursday 23rd February 2017
quotequote all
I am looking for a PC for my studies. I am undecided which would be the better workflow.

My course is a PhD in Data Analytics. I have access to some decent Servers at work and get access via VPN. But I am unable to directly synchronize the data between home and office Machines.

My choices are below

Option 1 - All Data is stored & Processed on Laptop
Dell XPS 9560 Laptop(1TB SSD/32Gig Ram

Or

Option 2. Data processing undertaken on Server and then Summary Results pushed to laptop SQL database
Used Dell R610 Server + Use my existing Laptop
Dell Server will have 128Gig Ram / 1.2 TB Intel PCIe SSD + 8x 600Gig SAS Drives + 2x Xeon X5650 (3 year old used hardware)

My Existing database size is small at around 500 Gig expected to grow to 800 Gig. I am using R to do the analysis.

I am trying to figure out which is best for workflow





Edited by ThePlanner on Thursday 23 February 07:27

nyt

1,807 posts

150 months

Thursday 23rd February 2017
quotequote all
SQL server on microsoft's cloud?

Or a machine at your work that you can remote desktop into from home.
That would allow you to leave analysis tasks running while you travelled and to still have access to full bandwidth access to your data when you're at home



Edited by nyt on Thursday 23 February 06:56

V8LM

5,174 posts

209 months

Thursday 23rd February 2017
quotequote all
You will need to comply with the institutional and PhD-funder's Research Data Management policies and these will preclude using a home PC or laptop for the prime storage. Depending on the data they could preclude using any off-site/cloud facility too.

Option 1 should be out, although it is the commonly used approach.

Option 2 is better.

Better would be to SSH-in to an institutional resource for al processing and use the laptop for preparation of figures, reports, etc.

Better still is to do all the work at the institution and use the time away for other stuff.


Edited by V8LM on Thursday 23 February 07:00

Du1point8

21,608 posts

192 months

Thursday 23rd February 2017
quotequote all
Shame its SQL.

What kind of data and analysis?

It it Time based data or normal data?

HappyMidget

6,788 posts

115 months

Thursday 23rd February 2017
quotequote all
Are you using SQL2016 with the built in R? It can offload the R processing to multiple servers. I would go with a decent server myself and just connect remotely though.

HappyMidget

6,788 posts

115 months

Thursday 23rd February 2017
quotequote all
Also, using sql2016 and clustered columnstore indexing will massively compress the data. In my last data warehouse, 2bn rows compressed down to about 80GB from over 500GB.

ThePlanner

Original Poster:

5,252 posts

267 months

Thursday 23rd February 2017
quotequote all
HappyMidget said:
Also, using sql2016 and clustered columnstore indexing will massively compress the data. In my last data warehouse, 2bn rows compressed down to about 80GB from over 500GB.
The data is already in SQL 2016 and has been optimized.

Du1point8 said:
Shame its SQL.

What kind of data and analysis?

It it Time based data or normal data?
The main Element of my Research is Predictive Analytics of the data.

The Data contains multiple elements
  • Weather
  • vehicle - Vehicle, Person, Location, Speed
  • Sites of Interest

V8LM said:
You will need to comply with the institutional and PhD-funder's Research Data Management policies and these will preclude using a home PC or laptop for the prime storage. Depending on the data they could preclude using any off-site/cloud facility too.

Option 1 should be out, although it is the commonly used approach.

Option 2 is better.

Better would be to SSH-in to an institutional resource for al processing and use the laptop for preparation of figures, reports, etc.

Better still is to do all the work at the institution and use the time away for other stuff.

Edited by V8LM on Thursday 23 February 07:00
The Data is provided by my employer and there are no restrictions to where I store the data, except that it not online! The project sponsor is a overseas government so not subject to Data Protection Laws. I have an agreed reporting structure for the final Thesis of what data can and cannot be presented.

This has been agreed between the University in UK and the Government Department.

Du1point8 said:
Shame its SQL.

What kind of data and analysis?

It it Time based data or normal data?
At the moment it is SQL, as it is a direct duplicate of the data from my employer. It does not stop me from transferring to another platform, If required.

Edited by ThePlanner on Thursday 23 February 07:20


Edited by ThePlanner on Thursday 23 February 07:26

Vaud

50,501 posts

155 months

Thursday 23rd February 2017
quotequote all
How do they define "not online"?

ThePlanner

Original Poster:

5,252 posts

267 months

Thursday 23rd February 2017
quotequote all
Vaud said:
How do they define "not online"?
Not stored in Amazon Cloud or Similar

We have invested $5 Million last year in out own internal Data/Compute Store for the department. So I have access to this but I do not want to have my research work running alongside the production environment.

WE have 50x Cisco USC Server + a 360TB EMC SAN Storage
800 Cores (1600 Threads) + 12800 Gig Ram in Total
40x Servers are used as a Hadoop Cluster the remaining 10 are used as 10 Windows VMs for ease of access of the data for other users in SQL + ESRI GIS

Early System Design Things were changed after initial System Testing.


Hardware installed in Server Room

Du1point8

21,608 posts

192 months

Thursday 23rd February 2017
quotequote all
sent you a PM

plasticpig

12,932 posts

225 months

Thursday 23rd February 2017
quotequote all
Probably teaching you to suck eggs but isn't the laptop out of the equation purely due to DB size? It depends on exactly what you are doing I guess but how big is the TempDB going to grow to on an 800GB DB?

ThePlanner

Original Poster:

5,252 posts

267 months

Thursday 23rd February 2017
quotequote all
plasticpig said:
Probably teaching you to suck eggs but isn't the laptop out of the equation purely due to DB size? It depends on exactly what you are doing I guess but how big is the TempDB going to grow to on an 800GB DB?
I didn't think about the TempDB growing. Never had to worry about space restrictions before. But thanks could be an issue on laptop.

So looks like Used Server + Transferring of summary data to laptop.

ThePlanner

Original Poster:

5,252 posts

267 months

Thursday 23rd February 2017
quotequote all
This will be Fun.. Will need to transport server (35kg +) on my return flight next month.

Nothing available locally!


tankplanker

2,479 posts

279 months

Thursday 23rd February 2017
quotequote all
Are you looking at virtualisation it on the laptop or running it natively? I personally prefer using virtualisation as its easier to back up the entire thing and I can always port it to different hardware if needed. Granted there is a performance penalty paid by not running it natively as you have a "wrapper" around the application but the benefits out way that for me.

A decent workstation class laptop is more than capable of running sizable virtual machines for this: http://www8.hp.com/us/en/workstations/zbook-15.htm...

I have a similar laptop that I use for demos/testing (can't connect to the network in a lot of places I work) that can quite happily run five virtual servers performing a reasonable workload.

The biggest initial bottleneck with any laptop running virtual servers is the hard disks, I use two high performance SSDs built into the laptop and have the drives of my virtual servers split over the two drives rather than running them as RAID to reduce contention. I found this works better than RAID for me, your mileage may vary. You could also add in a 3rd or 4th drive using a decent external SSD drive plugged in via USB.